百度NLP中文分词插件
所有文档

          Elasticsearch BES

          百度NLP中文分词插件


          analysis-baidu-nlp是百度智能云Elasticsearch(简称ES)团队自主研发的中文分词插件,该插件在中文分词上的性能与准确率均处于业界领先水平。

          背景

          analysis-baidu-nlp 基于百度NLP内部自主研发的DeepCRF模型,该模型凝聚了百度在中文搜索领域十几年的技术积累,其模型性能与准确率均处于业界领先地位

          提供基础粒度和短语粒度两种分词结果,以供不同的应用需求,短语粒度是以基础粒度的分词进行智能组合的结果。

          注意:

          • 词典模型会在第一次使用时加载到JVM的堆外内存,我们推荐所用的套餐节点内存8G以上。
          • 目前NLP中文分词插件支持6.5.3、7.4.2版本的实例,不支持NLP中文分词插件的集群,请提交工单,BES团队会协助升级集群,升级方式参见ES版本升级

          分词粒度

          analysis-baidu-nlp主要提供两种粒度的Analyzer

          1. 基础粒度模型(bd-nlp-basic)
          2. 短语粒度模型(bd-nlp-phrase)

          两种Analyzer内部集成大小写过滤器、停用词过滤器,开箱即用。

          同名提供两种Tokenizer

          1. 基础模型粒度(bd-nlp-basic)
          2. 短语大粒度模型(bd-nlp-phrase)

          两种粒度Tokenizer只提供最原始的切词结果,用户可根据自己的应用需求添加自定义的停用词过滤以及一些复杂的过滤器。

          与ik在基础粒度和短语粒度切词的对比

          基础粒度对比

          对 “维修基金” 进行基础最大粒度切词效果对比

          • bd-nlp-basic 切词
          POST /_analyze
          {
              "text": "维修基金",
              "analyzer": "bd-nlp-basic"
          }

          分词结果:

          {
             "tokens": [
                {
                   "token": "维修",
                   "start_offset": 0,
                   "end_offset": 2,
                   "type": "WORD",
                   "position": 0
                },
                {
                   "token": "基金",
                   "start_offset": 2,
                   "end_offset": 4,
                   "type": "WORD",
                   "position": 1
                }
             ]
          }
          • ik_max_word 切词
          POST _analyze
          {
              "analyzer": "ik_max_word",
              "text": "维修基金"
          }

          切词结果:

          {
             "tokens": [
                {
                   "token": "维修基金",
                   "start_offset": 0,
                   "end_offset": 4,
                   "type": "CN_WORD",
                   "position": 0
                },
                {
                   "token": "维修",
                   "start_offset": 0,
                   "end_offset": 2,
                   "type": "CN_WORD",
                   "position": 1
                },
                {
                   "token": "维",
                   "start_offset": 0,
                   "end_offset": 1,
                   "type": "CN_WORD",
                   "position": 2
                },
                {
                   "token": "修",
                   "start_offset": 1,
                   "end_offset": 2,
                   "type": "CN_CHAR",
                   "position": 3
                },
                {
                   "token": "基金",
                   "start_offset": 2,
                   "end_offset": 4,
                   "type": "CN_WORD",
                   "position": 4
                },
                {
                   "token": "基",
                   "start_offset": 2,
                   "end_offset": 3,
                   "type": "CN_WORD",
                   "position": 5
                },
                {
                   "token": "金",
                   "start_offset": 3,
                   "end_offset": 4,
                   "type": "CN_CHAR",
                   "position": 6
                }
             ]
          }

          对 “清明节,又称踏青节、行清节、三月节、祭祖节等” 进行短语切测效果对比

          • bd-nlp-basic 短语切词
          POST /_analyze
          {
              "text": "清明节,又称踏青节、行清节、三月节、祭祖节等",
              "analyzer": "bd-nlp-phrase"
          }

          短语切分结果:

          {
             "tokens": [
                {
                   "token": "清明节",
                   "start_offset": 0,
                   "end_offset": 3,
                   "type": "WORD",
                   "position": 0
                },
                {
                   "token": "又称",
                   "start_offset": 4,
                   "end_offset": 6,
                   "type": "WORD",
                   "position": 2
                },
                {
                   "token": "踏青节",
                   "start_offset": 6,
                   "end_offset": 9,
                   "type": "WORD",
                   "position": 3
                },
                {
                   "token": "行清节",
                   "start_offset": 10,
                   "end_offset": 13,
                   "type": "WORD",
                   "position": 5
                },
                {
                   "token": "三月节",
                   "start_offset": 14,
                   "end_offset": 17,
                   "type": "WORD",
                   "position": 7
                },
                {
                   "token": "祭祖",
                   "start_offset": 18,
                   "end_offset": 20,
                   "type": "WORD",
                   "position": 9
                },
                {
                   "token": "节",
                   "start_offset": 20,
                   "end_offset": 21,
                   "type": "WORD",
                   "position": 10
                }
             ]
          }
          • ik_smart 智能切词
          POST _analyze
          {
              "analyzer": "ik_smart",
              "text": "清明节,又称踏青节、行清节、三月节、祭祖节等"
          }

          切词结果:

          {
             "tokens": [
                {
                   "token": "清明节",
                   "start_offset": 0,
                   "end_offset": 3,
                   "type": "CN_WORD",
                   "position": 0
                },
                {
                   "token": "又称",
                   "start_offset": 4,
                   "end_offset": 6,
                   "type": "CN_WORD",
                   "position": 1
                },
                {
                   "token": "踏青",
                   "start_offset": 6,
                   "end_offset": 8,
                   "type": "CN_WORD",
                   "position": 2
                },
                {
                   "token": "节",
                   "start_offset": 8,
                   "end_offset": 9,
                   "type": "CN_WORD",
                   "position": 3
                },
                {
                   "token": "行",
                   "start_offset": 10,
                   "end_offset": 11,
                   "type": "CN_WORD",
                   "position": 4
                },
                {
                   "token": "清",
                   "start_offset": 11,
                   "end_offset": 12,
                   "type": "CN_CHAR",
                   "position": 5
                },
                {
                   "token": "节",
                   "start_offset": 12,
                   "end_offset": 13,
                   "type": "CN_WORD",
                   "position": 6
                },
                {
                   "token": "三月",
                   "start_offset": 14,
                   "end_offset": 16,
                   "type": "CN_WORD",
                   "position": 7
                },
                {
                   "token": "节",
                   "start_offset": 16,
                   "end_offset": 17,
                   "type": "COUNT",
                   "position": 8
                },
                {
                   "token": "祭祖",
                   "start_offset": 18,
                   "end_offset": 20,
                   "type": "CN_WORD",
                   "position": 9
                },
                {
                   "token": "节",
                   "start_offset": 20,
                   "end_offset": 21,
                   "type": "CN_WORD",
                   "position": 10
                }
             ]
          }

          Analyze API使用

          基础模型粒度分词

          POST /_analyze
          {
             "analyzer": "bd-nlp-basic",
             "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
          }

          分词结果:

          {
            "tokens": [
               {
                  "token": "去年",
                  "start_offset": 0,
                  "end_offset": 2,
                  "type": "WORD",
                  "position": 0
               },
               {
                  "token": "我们",
                  "start_offset": 2,
                  "end_offset": 4,
                  "type": "WORD",
                  "position": 1
               },
               {
                  "token": "和",
                  "start_offset": 4,
                  "end_offset": 5,
                  "type": "WORD",
                  "position": 2
               },
               {
                  "token": "他们",
                  "start_offset": 5,
                  "end_offset": 7,
                  "type": "WORD",
                  "position": 3
               },
               {
                  "token": "展开",
                  "start_offset": 7,
                  "end_offset": 9,
                  "type": "WORD",
                  "position": 4
               },
               {
                  "token": "炉际",
                  "start_offset": 10,
                  "end_offset": 12,
                  "type": "WORD",
                  "position": 6
               },
               {
                  "token": "竞赛",
                  "start_offset": 12,
                  "end_offset": 14,
                  "type": "WORD",
                  "position": 7
               },
               {
                  "token": "第一",
                  "start_offset": 15,
                  "end_offset": 17,
                  "type": "WORD",
                  "position": 9
               },
               {
                  "token": "回合",
                  "start_offset": 17,
                  "end_offset": 19,
                  "type": "WORD",
                  "position": 10
               },
               {
                  "token": "赢",
                  "start_offset": 19,
                  "end_offset": 20,
                  "type": "WORD",
                  "position": 11
               },
               {
                  "token": "第二",
                  "start_offset": 22,
                  "end_offset": 24,
                  "type": "WORD",
                  "position": 14
               },
               {
                  "token": "回合",
                  "start_offset": 24,
                  "end_offset": 26,
                  "type": "WORD",
                  "position": 15
               },
               {
                  "token": "和",
                  "start_offset": 26,
                  "end_offset": 27,
                  "type": "WORD",
                  "position": 16
               },
               {
                  "token": "第三",
                  "start_offset": 27,
                  "end_offset": 29,
                  "type": "WORD",
                  "position": 17
               },
               {
                  "token": "回合",
                  "start_offset": 29,
                  "end_offset": 31,
                  "type": "WORD",
                  "position": 18
               },
               {
                  "token": "败",
                  "start_offset": 32,
                  "end_offset": 33,
                  "type": "WORD",
                  "position": 20
               },
               {
                  "token": "下",
                  "start_offset": 33,
                  "end_offset": 34,
                  "type": "WORD",
                  "position": 21
               },
               {
                  "token": "阵",
                  "start_offset": 34,
                  "end_offset": 35,
                  "type": "WORD",
                  "position": 22
               },
               {
                  "token": "来",
                  "start_offset": 35,
                  "end_offset": 36,
                  "type": "WORD",
                  "position": 23
               }
            ]
          }

          短语模型大粒度分词

          POST /_analyze
          {
             "analyzer": "bd-nlp-phrase",
             "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
          }

          分词结果:

          {
            "tokens": [
               {
                  "token": "去年",
                  "start_offset": 0,
                  "end_offset": 2,
                  "type": "WORD",
                  "position": 0
               },
               {
                  "token": "我们",
                  "start_offset": 2,
                  "end_offset": 4,
                  "type": "WORD",
                  "position": 1
               },
               {
                  "token": "和",
                  "start_offset": 4,
                  "end_offset": 5,
                  "type": "WORD",
                  "position": 2
               },
               {
                  "token": "他们",
                  "start_offset": 5,
                  "end_offset": 7,
                  "type": "WORD",
                  "position": 3
               },
               {
                  "token": "展开",
                  "start_offset": 7,
                  "end_offset": 9,
                  "type": "WORD",
                  "position": 4
               },
               {
                  "token": "炉际竞赛",
                  "start_offset": 10,
                  "end_offset": 14,
                  "type": "WORD",
                  "position": 6
               },
               {
                  "token": "第一回合",
                  "start_offset": 15,
                  "end_offset": 19,
                  "type": "WORD",
                  "position": 8
               },
               {
                  "token": "赢",
                  "start_offset": 19,
                  "end_offset": 20,
                  "type": "WORD",
                  "position": 9
               },
               {
                  "token": "第二回合",
                  "start_offset": 22,
                  "end_offset": 26,
                  "type": "WORD",
                  "position": 12
               },
               {
                  "token": "和",
                  "start_offset": 26,
                  "end_offset": 27,
                  "type": "WORD",
                  "position": 13
               },
               {
                  "token": "第三",
                  "start_offset": 27,
                  "end_offset": 29,
                  "type": "WORD",
                  "position": 14
               },
               {
                  "token": "回合",
                  "start_offset": 29,
                  "end_offset": 31,
                  "type": "WORD",
                  "position": 15
               },
               {
                  "token": "败",
                  "start_offset": 32,
                  "end_offset": 33,
                  "type": "WORD",
                  "position": 17
               },
               {
                  "token": "下",
                  "start_offset": 33,
                  "end_offset": 34,
                  "type": "WORD",
                  "position": 18
               },
               {
                  "token": "阵",
                  "start_offset": 34,
                  "end_offset": 35,
                  "type": "WORD",
                  "position": 19
               },
               {
                  "token": "来",
                  "start_offset": 35,
                  "end_offset": 36,
                  "type": "WORD",
                  "position": 20
               }
            ]
          }

          索引指定Analyzer

          PUT test
          {
             "mappings": {
                "doc": {
                   "properties": {
                      "k1": {
                         "type": "text",
                         "analyzer": "bd-nlp-basic" // 使用基础粒度模型
                      },
                      "k2": {
                         "type": "text",
                         "analyzer": "bd-nlp-phrase" // 使用短语粒度模型
                      }
                   }
                }
             },
             "settings": {
                "index": {
                   "number_of_shards": "1",
                   "number_of_replicas": "0"
                }
             }
          }

          索引指定Tokenizer

          PUT /test
          {
              "settings":{
                  "analysis":{
                      "analyzer":{
                          "my_analyzer":{
                              "tokenizer":"bd-nlp-basic",   // 自定义一个analyzer
                              "filter":[
                                  "lowercase"               // 添加应用需要的过滤器
                              ]
                          }
                      }
                  }
              },
              "mappings":{
                  "properties":{
                      "k2":{
                          "type":"text",
                          "analyzer":"my_analyzer"         // 将自定义analyzer应用到对应字段上
                      }
                  }
              }
          }

          准确率与召回率

          百度内部大数据集测试结果:

          模型 准确率 召回率 F值
          analysis-baidu-nlp 98.8% 98.9% 98.8%
          上一篇
          产品定价
          下一篇
          百度NLP中文分词词典动态更新(公测中)