ingest attachment插件
所有文档

          Elasticsearch BES

          ingest attachment插件

          简介

          Elasticsearch的ingest attachment插件可以将常用格式的文件作为附件写入Index。ingest attachment插件通过使用Apache Tika来提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。 注意:源字段必须是base64编码的二进制。

          ingest attachment的pipeline参数含义

          Name 是否必须 Default Description
          field yes - 从这个字段中获取base64编码
          target_field no attachment 用于保留attachment信息,主要用于多附件的情况
          indexed_chars no 100000 限制字段的最大保存字符数。-1为无限制。
          indexed_chars_field no - 可以从数据中设定的字段取到indexed_chars限制的值。
          properties no 全属性 选择需要存储的属性。例如 content, title, name, author, keywords, date, content_type, content_length, language
          ignore_missing no false 如果使用true,并且 field 不存在, 则会忽略附件直接写入doc;否则则会报错。

          使用方法:

          单附件

          1、创建pipeline

          PUT _ingest/pipeline/attachment
          {
            "description" : "Extract attachment information",
            "processors" : [
              {
                "attachment" : {
                  "field" : "data"
                }
              }
            ]
          }

          2、向Elasticsearch中写入数据

          我们可以通过两种方式将附件写入Elasticsearch:

          (1)直接写入附件的base64编码

          POST my_index/_doc?pipeline=attachment
          {
            "filename": "test.docx",
            "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
          }

          其中,5rWL6K+V5paH5Lu25pWw5o2udGVzdA==是附件内容转化为base64的结果。

          (2)通过Linux的curl命令直接将附件写入Elasticsearch

          curl -u user:pwd -H 'Content-Type: application/json' -XPOST IP:PORT/my_index/_doc?pipeline=attachment -d'
          {
            "filename": "test.docx",
            "data": "'`base64 -w 0 /test_path/test.docx | perl -pe 's/\n/\\n/g'`'"
          }'

          通过这种方式,test.docx文件可以直接写入Elasticsearch。

          3、查询数据

          GET my_index/_search
          {
             "query": {
                "match": {
                   "attachment.content": "test"
                }
             }
          }

          得到结果如下:

          {
             "took": 292,
             "timed_out": false,
             ...
                   {
                      "_index": "my_index",
                      "_type": "_doc",
                      "_id": "RL_1CXMByfR8Yao01Qs9",
                      "_score": 1.9692057,
                      "_source": {
                         "filename": "test.docx",
                         "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
                         "attachment": {
                            "content_type": "text/plain; charset=UTF-8",
                            "language": "lt",
                            "content": "测试文件数据test",
                            "content_length": 10
                         }
                      }
                   }
             ...
          }

          限制附件内容长度

          1、创建pipeline

          PUT _ingest/pipeline/attachment
          {
            "description" : "Extract attachment information",
            "processors" : [
              {
                "attachment" : {
                  "field" : "data",
                  "indexed_chars" : 6,
                  "indexed_chars_field" : "my_field_size"
                }
              }
            ]
          }

          2、写入数据

          POST /my_index/_doc?pipeline=attachment
          {
            "filename": "test.docx",
            "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
          }
          POST /my_index/_doc?pipeline=attachment
          {
            "filename": "test2.docx",
            "my_field_size": 2
            "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
          }

          我们写入两个带有附件的doc,附件内容都是:测试文件数据test。其中第二个doc,我们将indexed_chars_field设定的字段my_field_size的值设定为2

          注:这里只列举了直接写入base64的方式。

          3、查询数据

          GET my_index/_search
          {
             "query": {
                "match_all": {}
             }
          }

          得到结果如下:

          {
            "took": 292,
            "timed_out": false,
            ...
                 {
                  "_index": "my_index",
                  "_type": "_doc",
                  "_id": "RL_1CXMByfR8Yao01Qs9",
                  "_score": 1.9692057,
                  "_source": {
                    "filename": "test.docx",
                    "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
                    "attachment": {
                      "content_type": "text/plain; charset=UTF-8",
                      "language": "lt",
                      "content": "测试文件数据",
                      "content_length": 6
                    }
                  }
                },
                {
                  "_index": "my_index",
                  "_type": "_doc",
                  "_id": "RL_1CXMByfR8Yao01Qs9",
                  "_score": 1.9692057,
                  "_source": {
                    "filename": "test2.docx",
                    "my_field_size": 2,
                    "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
                    "attachment": {
                      "content_type": "text/plain; charset=UTF-8",
                      "language": "lt",
                      "content": "测试",
                      "content_length": 2
                    }
                  }
                }
            ...
          }

          可以看到,第一个doc,附件内容保留了6个字符,而第二个文档,保留了2个字符。

          多附件

          1、创建pipeline

          PUT _ingest/pipeline/attachment
          {
            "description" : "Extract attachment information from arrays",
            "processors" : [
              {
                "foreach": {
                  "field": "attachments",
                  "processor": {
                    "attachment": {
                      "target_field": "_ingest._value.attachment",
                      "field": "_ingest._value.data"
                    }
                  }
                }
              }
            ]
          }

          需要注意的是,多附件的情况下,fieldtarget_field必须要写成_ingest._value.*,否则不能匹配正确的字段。

          2、写入多附件数据

          POST /my_index/_doc?pipeline=attachment
          {
            "attachments" : [
              {
                "filename" : "ipsum.txt",
                "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
              },
              {
                "filename" : "test.txt",
                "data" : "VGhpcyBpcyBhIHRlc3QK"
              }
            ]
          }

          注:这里只列举了直接写入base64的方式。

          3、查询数据

          GET my_index/_search
          {
             "query": {
                "match_all": {}
             }
          }

          得到结果如下:

          {
            "took": 292,
            "timed_out": false,
            ...
                {
                  "_index": "my_index",
                  "_id": "RL_1CXMByfR8Yao01Qs9",
                  "_version": 1,
                  "_seq_no": 50,
                  "_primary_term": 1,
                  "found": true,
                  "_source": {
                    "attachments": [
                      {
                        "filename": "ipsum.txt",
                        "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
                        "attachment": {
                          "content_type": "text/plain; charset=ISO-8859-1",
                          "language": "en",
                          "content": "this is\njust some text",
                          "content_length": 24
                        }
                      },
                      {
                        "filename": "test.txt",
                        "data": "VGhpcyBpcyBhIHRlc3QK",
                        "attachment": {
                          "content_type": "text/plain; charset=ISO-8859-1",
                          "language": "en",
                          "content": "This is a test",
                          "content_length": 16
                        }
                      }
                    ]
                  }
                }
            ...
          }
          上一篇
          analysis-pinyin拼音分词插件
          下一篇
          动态同义词插件