ingest attachment插件
更新时间:2024-07-12
简介
Elasticsearch的ingest attachment插件可以将常用格式的文件作为附件写入Index。ingest attachment插件通过使用Apache Tika来提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。
注意:
- 源字段必须是base64编码的二进制。
- 目前ingest attachment插件支持除5.6.16版本外所有版本的Elasticsearch实例。
ingest attachment的pipeline参数含义
Name | 是否必须 | Default | Description |
---|---|---|---|
field | yes | - | 从这个字段中获取base64编码 |
target_field | no | attachment | 用于保留attachment信息,主要用于多附件的情况 |
indexed_chars | no | 100000 | 限制字段的最大保存字符数。-1 为无限制。 |
indexed_chars_field | no | - | 可以从数据中设定的字段取到indexed_chars 限制的值。 |
properties | no | 全属性 | 选择需要存储的属性。例如 content , title , name , author , keywords , date , content_type , content_length , language |
ignore_missing | no | false | 如果使用true ,并且 field 不存在, 则会忽略附件直接写入doc;否则则会报错。 |
使用方法:
单附件
1、创建pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
2、向Elasticsearch中写入数据
我们可以通过两种方式将附件写入Elasticsearch:
(1)直接写入附件的base64编码
POST my_index/_doc?pipeline=attachment
{
"filename": "test.docx",
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
}
其中,5rWL6K+V5paH5Lu25pWw5o2udGVzdA==
是附件内容转化为base64的结果。
(2)通过Linux的curl命令直接将附件写入Elasticsearch
curl -u user:pwd -H 'Content-Type: application/json' -XPOST IP:PORT/my_index/_doc?pipeline=attachment -d'
{
"filename": "test.docx",
"data": "'`base64 -w 0 /test_path/test.docx | perl -pe 's/\n/\\n/g'`'"
}'
通过这种方式,test.docx
文件可以直接写入Elasticsearch。
3、查询数据
GET my_index/_search
{
"query": {
"match": {
"attachment.content": "test"
}
}
}
得到结果如下:
{
"took": 292,
"timed_out": false,
...
{
"_index": "my_index",
"_type": "_doc",
"_id": "RL_1CXMByfR8Yao01Qs9",
"_score": 1.9692057,
"_source": {
"filename": "test.docx",
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "测试文件数据test",
"content_length": 10
}
}
}
...
}
限制附件内容长度
1、创建pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : 6,
"indexed_chars_field" : "my_field_size"
}
}
]
}
2、写入数据
POST /my_index/_doc?pipeline=attachment
{
"filename": "test.docx",
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
}
POST /my_index/_doc?pipeline=attachment
{
"filename": "test2.docx",
"my_field_size": 2
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA=="
}
我们写入两个带有附件的doc,附件内容都是:测试文件数据test。其中第二个doc,我们将indexed_chars_field
设定的字段my_field_size
的值设定为2。
注:这里只列举了直接写入base64的方式。
3、查询数据
GET my_index/_search
{
"query": {
"match_all": {}
}
}
得到结果如下:
{
"took": 292,
"timed_out": false,
...
{
"_index": "my_index",
"_type": "_doc",
"_id": "RL_1CXMByfR8Yao01Qs9",
"_score": 1.9692057,
"_source": {
"filename": "test.docx",
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "测试文件数据",
"content_length": 6
}
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "RL_1CXMByfR8Yao01Qs9",
"_score": 1.9692057,
"_source": {
"filename": "test2.docx",
"my_field_size": 2,
"data": "5rWL6K+V5paH5Lu25pWw5o2udGVzdA==",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "测试",
"content_length": 2
}
}
}
...
}
可以看到,第一个doc,附件内容保留了6个字符,而第二个文档,保留了2个字符。
多附件
1、创建pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data"
}
}
}
}
]
}
需要注意的是,多附件的情况下,field
和target_field
必须要写成_ingest._value.*
,否则不能匹配正确的字段。
2、写入多附件数据
POST /my_index/_doc?pipeline=attachment
{
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK"
}
]
}
注:这里只列举了直接写入base64的方式。
3、查询数据
GET my_index/_search
{
"query": {
"match_all": {}
}
}
得到结果如下:
{
"took": 292,
"timed_out": false,
...
{
"_index": "my_index",
"_id": "RL_1CXMByfR8Yao01Qs9",
"_version": 1,
"_seq_no": 50,
"_primary_term": 1,
"found": true,
"_source": {
"attachments": [
{
"filename": "ipsum.txt",
"data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
"attachment": {
"content_type": "text/plain; charset=ISO-8859-1",
"language": "en",
"content": "this is\njust some text",
"content_length": 24
}
},
{
"filename": "test.txt",
"data": "VGhpcyBpcyBhIHRlc3QK",
"attachment": {
"content_type": "text/plain; charset=ISO-8859-1",
"language": "en",
"content": "This is a test",
"content_length": 16
}
}
]
}
}
...
}