ingest attachment插件
更新时间:2024-07-12
简介
Elasticsearch的ingest attachment插件可以将常用格式的文件作为附件写入Index。ingest attachment插件通过使用Apache Tika来提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。
注意:
- 源字段必须是base64编码的二进制。
- 目前ingest attachment插件支持除5.6.16版本外所有版本的Elasticsearch实例。
ingest attachment的pipeline参数含义
Name | 是否必须 | Default | Description |
---|---|---|---|
field | yes | - | 从这个字段中获取base64编码 |
target_field | no | attachment | 用于保留attachment信息,主要用于多附件的情况 |
indexed_chars | no | 100000 | 限制字段的最大保存字符数。-1 为无限制。 |
indexed_chars_field | no | - | 可以从数据中设定的字段取到indexed_chars 限制的值。 |
properties | no | 全属性 | 选择需要存储的属性。例如 content , title , name , author , keywords , date , content_type , content_length , language |
ignore_missing | no | false | 如果使用true ,并且 field 不存在, 则会忽略附件直接写入doc;否则则会报错。 |
使用方法:
单附件
1、创建pipeline
Plain Text
1PUT _ingest/pipeline/attachment
2{
3 "description" : "Extract attachment information",
4 "processors" : [
5 {
6 "attachment" : {
7 "field" : "data"
8 }
9 }
10 ]
11}
2、向Elasticsearch中写入数据
我们可以通过两种方式将附件写入Elasticsearch:
(1)直接写入附件的base64编码
Plain Text
1POST my_index/_doc?pipeline=attachment
2{
3 "filename": "test.docx",
4 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
5}
其中,5rWL6K+V5paH5Lu25pWw5o2udGVzdA==
是附件内容转化为base64的结果。
(2)通过Linux的curl命令直接将附件写入Elasticsearch
Plain Text
1curl -u user:pwd -H 'Content-Type: application/json' -XPOST IP:PORT/my_index/_doc?pipeline=attachment -d'
2{
3 "filename": "test.docx",
4 "data": "'`base64 -w 0 /test_path/test.docx | perl -pe 's/\n/\\n/g'`'"
5}'
通过这种方式,test.docx
文件可以直接写入Elasticsearch。
3、查询数据
Plain Text
1GET my_index/_search
2{
3 "query": {
4 "match": {
5 "attachment.content": "test"
6 }
7 }
8}
得到结果如下:
Plain Text
1{
2 "took": 292,
3 "timed_out": false,
4 ...
5 {
6 "_index": "my_index",
7 "_type": "_doc",
8 "_id": "RL_1CXMByfR8Yao01Qs9",
9 "_score": 1.9692057,
10 "_source": {
11 "filename": "test.docx",
12 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
13 "attachment": {
14 "content_type": "text/plain; charset=UTF-8",
15 "language": "lt",
16 "content": "测试文件数据test",
17 "content_length": 10
18 }
19 }
20 }
21 ...
22}
限制附件内容长度
1、创建pipeline
Plain Text
1PUT _ingest/pipeline/attachment
2{
3 "description" : "Extract attachment information",
4 "processors" : [
5 {
6 "attachment" : {
7 "field" : "data",
8 "indexed_chars" : 6,
9 "indexed_chars_field" : "my_field_size"
10 }
11 }
12 ]
13}
2、写入数据
Plain Text
1POST /my_index/_doc?pipeline=attachment
2{
3 "filename": "test.docx",
4 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
5}
6POST /my_index/_doc?pipeline=attachment
7{
8 "filename": "test2.docx",
9 "my_field_size": 2
10 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
11}
我们写入两个带有附件的doc,附件内容都是:测试文件数据test。其中第二个doc,我们将indexed_chars_field
设定的字段my_field_size
的值设定为2。
注:这里只列举了直接写入base64的方式。
3、查询数据
Plain Text
1GET my_index/_search
2{
3 "query": {
4 "match_all": {}
5 }
6}
得到结果如下:
Plain Text
1{
2 "took": 292,
3 "timed_out": false,
4 ...
5 {
6 "_index": "my_index",
7 "_type": "_doc",
8 "_id": "RL_1CXMByfR8Yao01Qs9",
9 "_score": 1.9692057,
10 "_source": {
11 "filename": "test.docx",
12 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
13 "attachment": {
14 "content_type": "text/plain; charset=UTF-8",
15 "language": "lt",
16 "content": "测试文件数据",
17 "content_length": 6
18 }
19 }
20 },
21 {
22 "_index": "my_index",
23 "_type": "_doc",
24 "_id": "RL_1CXMByfR8Yao01Qs9",
25 "_score": 1.9692057,
26 "_source": {
27 "filename": "test2.docx",
28 "my_field_size": 2,
29 "data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
30 "attachment": {
31 "content_type": "text/plain; charset=UTF-8",
32 "language": "lt",
33 "content": "测试",
34 "content_length": 2
35 }
36 }
37 }
38 ...
39}
可以看到,第一个doc,附件内容保留了6个字符,而第二个文档,保留了2个字符。
多附件
1、创建pipeline
Plain Text
1PUT _ingest/pipeline/attachment
2{
3 "description" : "Extract attachment information from arrays",
4 "processors" : [
5 {
6 "foreach": {
7 "field": "attachments",
8 "processor": {
9 "attachment": {
10 "target_field": "_ingest._value.attachment",
11 "field": "_ingest._value.data"
12 }
13 }
14 }
15 }
16 ]
17}
需要注意的是,多附件的情况下,field
和target_field
必须要写成_ingest._value.*
,否则不能匹配正确的字段。
2、写入多附件数据
Plain Text
1POST /my_index/_doc?pipeline=attachment
2{
3 "attachments" : [
4 {
5 "filename" : "ipsum.txt",
6 "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
7 },
8 {
9 "filename" : "test.txt",
10 "data" : "VGhpcyBpcyBhIHRlc3QK"
11 }
12 ]
13}
注:这里只列举了直接写入base64的方式。
3、查询数据
Plain Text
1GET my_index/_search
2{
3 "query": {
4 "match_all": {}
5 }
6}
得到结果如下:
Plain Text
1{
2 "took": 292,
3 "timed_out": false,
4 ...
5 {
6 "_index": "my_index",
7 "_id": "RL_1CXMByfR8Yao01Qs9",
8 "_version": 1,
9 "_seq_no": 50,
10 "_primary_term": 1,
11 "found": true,
12 "_source": {
13 "attachments": [
14 {
15 "filename": "ipsum.txt",
16 "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
17 "attachment": {
18 "content_type": "text/plain; charset=ISO-8859-1",
19 "language": "en",
20 "content": "this is\njust some text",
21 "content_length": 24
22 }
23 },
24 {
25 "filename": "test.txt",
26 "data": "VGhpcyBpcyBhIHRlc3QK",
27 "attachment": {
28 "content_type": "text/plain; charset=ISO-8859-1",
29 "language": "en",
30 "content": "This is a test",
31 "content_length": 16
32 }
33 }
34 ]
35 }
36 }
37 ...
38}