百度NLP中文分词插件

更新时间：2024-06-12

analysis-baidu-nlp是百度智能云Elasticsearch（简称ES）团队自主研发的中文分词插件，该插件在中文分词上的性能与准确率均处于业界领先水平。

背景

analysis-baidu-nlp 基于百度NLP内部自主研发的DeepCRF模型，该模型凝聚了百度在中文搜索领域十几年的技术积累，其模型性能与准确率均处于业界领先地位

提供基础粒度和短语粒度两种分词结果，以供不同的应用需求，短语粒度是以基础粒度的分词进行智能组合的结果。

注意：

词典模型会在第一次使用时加载到JVM的堆外内存，我们推荐所用的套餐节点内存8G以上。

目前NLP中文分词插件支持6.5.3、6.8.14、7.4.2、7.10.2版本的实例，不支持NLP中文分词插件的集群，请提交工单，BES团队会协助升级集群，升级方式参见ES版本升级。

分词粒度

analysis-baidu-nlp主要提供两种粒度的Analyzer：

基础粒度模型(bd-nlp-basic)
短语粒度模型(bd-nlp-phrase)

两种Analyzer内部集成大小写过滤器、停用词过滤器，开箱即用。

同名提供两种Tokenizer：

基础模型粒度(bd-nlp-basic)
短语大粒度模型(bd-nlp-phrase)

两种粒度Tokenizer只提供最原始的切词结果，用户可根据自己的应用需求添加自定义的停用词过滤以及一些复杂的过滤器。停用词过滤可以参考如下示例，添加了and、or和the停用词：

PUT /test_index
{
    "settings": {
        "analysis": {
            "filter": {
                "my_stop": {
                    "type": "stop",
                    "stopwords": ["and", "is", "the"]
                }
            }
        }
    }
}

与ik在基础粒度和短语粒度切词的对比

基础粒度对比

对 “维修基金” 进行基础最大粒度切词效果对比

bd-nlp-basic 切词

POST /_analyze
{
    "text": "维修基金",
    "analyzer": "bd-nlp-basic"
}

分词结果：

{
   "tokens": [
      {
         "token": "维修",
         "start_offset": 0,
         "end_offset": 2,
         "type": "WORD",
         "position": 0
      },
      {
         "token": "基金",
         "start_offset": 2,
         "end_offset": 4,
         "type": "WORD",
         "position": 1
      }
   ]
}

ik_max_word 切词

POST _analyze
{
    "analyzer": "ik_max_word",
    "text": "维修基金"
}

切词结果：

{
   "tokens": [
      {
         "token": "维修基金",
         "start_offset": 0,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 0
      },
      {
         "token": "维修",
         "start_offset": 0,
         "end_offset": 2,
         "type": "CN_WORD",
         "position": 1
      },
      {
         "token": "维",
         "start_offset": 0,
         "end_offset": 1,
         "type": "CN_WORD",
         "position": 2
      },
      {
         "token": "修",
         "start_offset": 1,
         "end_offset": 2,
         "type": "CN_CHAR",
         "position": 3
      },
      {
         "token": "基金",
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 4
      },
      {
         "token": "基",
         "start_offset": 2,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 5
      },
      {
         "token": "金",
         "start_offset": 3,
         "end_offset": 4,
         "type": "CN_CHAR",
         "position": 6
      }
   ]
}

对 “清明节，又称踏青节、行清节、三月节、祭祖节等” 进行短语切测效果对比

bd-nlp-basic 短语切词

POST /_analyze
{
    "text": "清明节，又称踏青节、行清节、三月节、祭祖节等",
    "analyzer": "bd-nlp-phrase"
}

短语切分结果：

{
   "tokens": [
      {
         "token": "清明节",
         "start_offset": 0,
         "end_offset": 3,
         "type": "WORD",
         "position": 0
      },
      {
         "token": "又称",
         "start_offset": 4,
         "end_offset": 6,
         "type": "WORD",
         "position": 2
      },
      {
         "token": "踏青节",
         "start_offset": 6,
         "end_offset": 9,
         "type": "WORD",
         "position": 3
      },
      {
         "token": "行清节",
         "start_offset": 10,
         "end_offset": 13,
         "type": "WORD",
         "position": 5
      },
      {
         "token": "三月节",
         "start_offset": 14,
         "end_offset": 17,
         "type": "WORD",
         "position": 7
      },
      {
         "token": "祭祖",
         "start_offset": 18,
         "end_offset": 20,
         "type": "WORD",
         "position": 9
      },
      {
         "token": "节",
         "start_offset": 20,
         "end_offset": 21,
         "type": "WORD",
         "position": 10
      }
   ]
}

ik_smart 智能切词

POST _analyze
{
    "analyzer": "ik_smart",
    "text": "清明节，又称踏青节、行清节、三月节、祭祖节等"
}

切词结果：

{
   "tokens": [
      {
         "token": "清明节",
         "start_offset": 0,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 0
      },
      {
         "token": "又称",
         "start_offset": 4,
         "end_offset": 6,
         "type": "CN_WORD",
         "position": 1
      },
      {
         "token": "踏青",
         "start_offset": 6,
         "end_offset": 8,
         "type": "CN_WORD",
         "position": 2
      },
      {
         "token": "节",
         "start_offset": 8,
         "end_offset": 9,
         "type": "CN_WORD",
         "position": 3
      },
      {
         "token": "行",
         "start_offset": 10,
         "end_offset": 11,
         "type": "CN_WORD",
         "position": 4
      },
      {
         "token": "清",
         "start_offset": 11,
         "end_offset": 12,
         "type": "CN_CHAR",
         "position": 5
      },
      {
         "token": "节",
         "start_offset": 12,
         "end_offset": 13,
         "type": "CN_WORD",
         "position": 6
      },
      {
         "token": "三月",
         "start_offset": 14,
         "end_offset": 16,
         "type": "CN_WORD",
         "position": 7
      },
      {
         "token": "节",
         "start_offset": 16,
         "end_offset": 17,
         "type": "COUNT",
         "position": 8
      },
      {
         "token": "祭祖",
         "start_offset": 18,
         "end_offset": 20,
         "type": "CN_WORD",
         "position": 9
      },
      {
         "token": "节",
         "start_offset": 20,
         "end_offset": 21,
         "type": "CN_WORD",
         "position": 10
      }
   ]
}

Analyze API使用

基础模型粒度分词

POST /_analyze
{
   "analyzer": "bd-nlp-basic",
   "text": "去年我们和他们展开了炉际竞赛，第一回合赢了，第二回合和第三回合却败下阵来。"
}

分词结果：

{
  "tokens": [
     {
        "token": "去年",
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD",
        "position": 0
     },
     {
        "token": "我们",
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD",
        "position": 1
     },
     {
        "token": "和",
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD",
        "position": 2
     },
     {
        "token": "他们",
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD",
        "position": 3
     },
     {
        "token": "展开",
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD",
        "position": 4
     },
     {
        "token": "炉际",
        "start_offset": 10,
        "end_offset": 12,
        "type": "WORD",
        "position": 6
     },
     {
        "token": "竞赛",
        "start_offset": 12,
        "end_offset": 14,
        "type": "WORD",
        "position": 7
     },
     {
        "token": "第一",
        "start_offset": 15,
        "end_offset": 17,
        "type": "WORD",
        "position": 9
     },
     {
        "token": "回合",
        "start_offset": 17,
        "end_offset": 19,
        "type": "WORD",
        "position": 10
     },
     {
        "token": "赢",
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD",
        "position": 11
     },
     {
        "token": "第二",
        "start_offset": 22,
        "end_offset": 24,
        "type": "WORD",
        "position": 14
     },
     {
        "token": "回合",
        "start_offset": 24,
        "end_offset": 26,
        "type": "WORD",
        "position": 15
     },
     {
        "token": "和",
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD",
        "position": 16
     },
     {
        "token": "第三",
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD",
        "position": 17
     },
     {
        "token": "回合",
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD",
        "position": 18
     },
     {
        "token": "败",
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD",
        "position": 20
     },
     {
        "token": "下",
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD",
        "position": 21
     },
     {
        "token": "阵",
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD",
        "position": 22
     },
     {
        "token": "来",
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD",
        "position": 23
     }
  ]
}

短语模型大粒度分词

POST /_analyze
{
   "analyzer": "bd-nlp-phrase",
   "text": "去年我们和他们展开了炉际竞赛，第一回合赢了，第二回合和第三回合却败下阵来。"
}

分词结果：

{
  "tokens": [
     {
        "token": "去年",
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD",
        "position": 0
     },
     {
        "token": "我们",
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD",
        "position": 1
     },
     {
        "token": "和",
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD",
        "position": 2
     },
     {
        "token": "他们",
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD",
        "position": 3
     },
     {
        "token": "展开",
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD",
        "position": 4
     },
     {
        "token": "炉际竞赛",
        "start_offset": 10,
        "end_offset": 14,
        "type": "WORD",
        "position": 6
     },
     {
        "token": "第一回合",
        "start_offset": 15,
        "end_offset": 19,
        "type": "WORD",
        "position": 8
     },
     {
        "token": "赢",
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD",
        "position": 9
     },
     {
        "token": "第二回合",
        "start_offset": 22,
        "end_offset": 26,
        "type": "WORD",
        "position": 12
     },
     {
        "token": "和",
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD",
        "position": 13
     },
     {
        "token": "第三",
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD",
        "position": 14
     },
     {
        "token": "回合",
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD",
        "position": 15
     },
     {
        "token": "败",
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD",
        "position": 17
     },
     {
        "token": "下",
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD",
        "position": 18
     },
     {
        "token": "阵",
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD",
        "position": 19
     },
     {
        "token": "来",
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD",
        "position": 20
     }
  ]
}

索引指定Analyzer

PUT test
{
   "mappings": {
      "doc": {
         "properties": {
            "k1": {
               "type": "text",
               "analyzer": "bd-nlp-basic" // 使用基础粒度模型
            },
            "k2": {
               "type": "text",
               "analyzer": "bd-nlp-phrase" // 使用短语粒度模型
            }
         }
      }
   },
   "settings": {
      "index": {
         "number_of_shards": "1",
         "number_of_replicas": "0"
      }
   }
}

索引指定Tokenizer

PUT /test
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my_analyzer":{
                    "tokenizer":"bd-nlp-basic",   // 自定义一个analyzer
                    "filter":[
                        "lowercase"               // 添加应用需要的过滤器
                    ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "k2":{
                "type":"text",
                "analyzer":"my_analyzer"         // 将自定义analyzer应用到对应字段上
            }
        }
    }
}

准确率与召回率

百度内部大数据集测试结果：

模型	准确率	召回率	F值
analysis-baidu-nlp	98.8%	98.9%	98.8%

向量检索特性

百度NLP中文分词词典动态更新

百度智能云

Elasticsearch BES

Elasticsearch BES

百度NLP中文分词插件

背景

分词粒度

与ik在基础粒度和短语粒度切词的对比

基础粒度对比

Analyze API使用

基础模型粒度分词

短语模型大粒度分词

索引指定Analyzer

索引指定Tokenizer

准确率与召回率