ElasticsearchBES

    ingest attachment插件

    简介

    Elasticsearch的ingest attachment插件可以将常用格式的文件作为附件写入Index。ingest attachment插件通过使用Apache Tika来提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。 注意:源字段必须是base64编码的二进制。

    ingest attachment的pipeline参数含义

    Name 是否必须 Default Description
    field yes - 从这个字段中获取base64编码
    target_field no attachment 用于保留attachment信息,主要用于多附件的情况
    indexed_chars no 100000 限制字段的最大保存字符数。-1为无限制。
    indexed_chars_field no - 可以从数据中设定的字段取到indexed_chars限制的值。
    properties no 全属性 选择需要存储的属性。例如 content, title, name, author, keywords, date, content_type, content_length, language
    ignore_missing no false 如果使用true,并且 field 不存在, 则会忽略附件直接写入doc;否则则会报错。

    使用方法:

    单附件

    1、创建pipeline

    PUT _ingest/pipeline/attachment
    {
      "description" : "Extract attachment information",
      "processors" : [
        {
          "attachment" : {
            "field" : "data"
          }
        }
      ]
    }

    2、向Elasticsearch中写入数据

    我们可以通过两种方式将附件写入Elasticsearch:

    (1)直接写入附件的base64编码

    POST my_index/_doc?pipeline=attachment
    {
      "filename": "test.docx",
      "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
    }

    其中,5rWL6K+V5paH5Lu25pWw5o2udGVzdA==是附件内容转化为base64的结果。

    (2)通过Linux的curl命令直接将附件写入Elasticsearch

    curl -u user:pwd -H 'Content-Type: application/json' -XPOST IP:PORT/my_index/_doc?pipeline=attachment -d'
    {
      "filename": "test.docx",
      "data": "'`base64 -w 0 /test_path/test.docx | perl -pe 's/\n/\\n/g'`'"
    }'

    通过这种方式,test.docx文件可以直接写入Elasticsearch。

    3、查询数据

    GET my_index/_search
    {
       "query": {
          "match": {
             "attachment.content": "test"
          }
       }
    }

    得到结果如下:

    {
       "took": 292,
       "timed_out": false,
       ...
             {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "RL_1CXMByfR8Yao01Qs9",
                "_score": 1.9692057,
                "_source": {
                   "filename": "test.docx",
                   "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
                   "attachment": {
                      "content_type": "text/plain; charset=UTF-8",
                      "language": "lt",
                      "content": "测试文件数据test",
                      "content_length": 10
                   }
                }
             }
       ...
    }

    限制附件内容长度

    1、创建pipeline

    PUT _ingest/pipeline/attachment
    {
      "description" : "Extract attachment information",
      "processors" : [
        {
          "attachment" : {
            "field" : "data",
            "indexed_chars" : 6,
            "indexed_chars_field" : "my_field_size"
          }
        }
      ]
    }

    2、写入数据

    POST /my_index/_doc?pipeline=attachment
    {
      "filename": "test.docx",
      "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
    }
    POST /my_index/_doc?pipeline=attachment
    {
      "filename": "test2.docx",
      "my_field_size": 2
      "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
    }

    我们写入两个带有附件的doc,附件内容都是:测试文件数据test。其中第二个doc,我们将indexed_chars_field设定的字段my_field_size的值设定为2

    注:这里只列举了直接写入base64的方式。

    3、查询数据

    GET my_index/_search
    {
       "query": {
          "match_all": {}
       }
    }

    得到结果如下:

    {
      "took": 292,
      "timed_out": false,
      ...
           {
            "_index": "my_index",
            "_type": "_doc",
            "_id": "RL_1CXMByfR8Yao01Qs9",
            "_score": 1.9692057,
            "_source": {
              "filename": "test.docx",
              "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
              "attachment": {
                "content_type": "text/plain; charset=UTF-8",
                "language": "lt",
                "content": "测试文件数据",
                "content_length": 6
              }
            }
          },
          {
            "_index": "my_index",
            "_type": "_doc",
            "_id": "RL_1CXMByfR8Yao01Qs9",
            "_score": 1.9692057,
            "_source": {
              "filename": "test2.docx",
              "my_field_size": 2,
              "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
              "attachment": {
                "content_type": "text/plain; charset=UTF-8",
                "language": "lt",
                "content": "测试",
                "content_length": 2
              }
            }
          }
      ...
    }

    可以看到,第一个doc,附件内容保留了6个字符,而第二个文档,保留了2个字符。

    多附件

    1、创建pipeline

    PUT _ingest/pipeline/attachment
    {
      "description" : "Extract attachment information from arrays",
      "processors" : [
        {
          "foreach": {
            "field": "attachments",
            "processor": {
              "attachment": {
                "target_field": "_ingest._value.attachment",
                "field": "_ingest._value.data"
              }
            }
          }
        }
      ]
    }

    需要注意的是,多附件的情况下,fieldtarget_field必须要写成_ingest._value.*,否则不能匹配正确的字段。

    2、写入多附件数据

    POST /my_index/_doc?pipeline=attachment
    {
      "attachments" : [
        {
          "filename" : "ipsum.txt",
          "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
        },
        {
          "filename" : "test.txt",
          "data" : "VGhpcyBpcyBhIHRlc3QK"
        }
      ]
    }

    注:这里只列举了直接写入base64的方式。

    3、查询数据

    GET my_index/_search
    {
       "query": {
          "match_all": {}
       }
    }

    得到结果如下:

    {
      "took": 292,
      "timed_out": false,
      ...
          {
            "_index": "my_index",
            "_id": "RL_1CXMByfR8Yao01Qs9",
            "_version": 1,
            "_seq_no": 50,
            "_primary_term": 1,
            "found": true,
            "_source": {
              "attachments": [
                {
                  "filename": "ipsum.txt",
                  "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
                  "attachment": {
                    "content_type": "text/plain; charset=ISO-8859-1",
                    "language": "en",
                    "content": "this is\njust some text",
                    "content_length": 24
                  }
                },
                {
                  "filename": "test.txt",
                  "data": "VGhpcyBpcyBhIHRlc3QK",
                  "attachment": {
                    "content_type": "text/plain; charset=ISO-8859-1",
                    "language": "en",
                    "content": "This is a test",
                    "content_length": 16
                  }
                }
              ]
            }
          }
      ...
    }
    上一篇
    analysis-pinyin拼音分词插件
    下一篇
    动态同义词插件