百度ElasticsearchBES

    NLP中文分词插件


    analysis-baidu-nlp是百度智能云Elasticsearch(简称ES)团队自主研发的中文分词插件,该插件在中文分词上的性能与准确率均处于业界领先水平。

    背景

    analysis-baidu-nlp 基于百度NLP内部自主研发的DeepCRF模型,该模型凝聚了百度在中文搜索领域十几年的技术积累,其模型性能与准确率均处于业界领先地位

    提供基础粒度和短语粒度两种分词结果,以供不同的应用需求,短语粒度是以基础粒度的分词进行智能组合的结果。

    注意:词典模型会在第一次使用时加载到JVM的堆外内存,我们推荐所用的套餐节点内存8G以上。

    分词粒度

    analysis-baidu-nlp主要提供两种粒度的Analyzer

    1. 基础粒度模型(bd-nlp-basic)
    2. 短语粒度模型(bd-nlp-phrase)

    两种Analyzer内部集成大小写过滤器、停用词过滤器,开箱即用。

    同名提供两种Tokenizer

    1. 基础模型粒度(bd-nlp-basic)
    2. 短语大粒度模型(bd-nlp-phrase)

    两种粒度Tokenizer只提供最原始的切词结果,用户可根据自己的应用需求添加自定义的停用词过滤以及一些复杂的过滤器。

    与ik在基础粒度和短语粒度切词的对比

    基础粒度对比

    对 “维修基金” 进行基础最大粒度切词效果对比

    • bd-nlp-basic 切词
    POST /_analyze
    {
        "text": "维修基金",
        "analyzer": "bd-nlp-basic"
    }

    分词结果:

    {
       "tokens": [
          {
             "token": "维修",
             "start_offset": 0,
             "end_offset": 2,
             "type": "WORD",
             "position": 0
          },
          {
             "token": "基金",
             "start_offset": 2,
             "end_offset": 4,
             "type": "WORD",
             "position": 1
          }
       ]
    }
    • ik_max_word 切词
    POST _analyze
    {
        "analyzer": "ik_max_word",
        "text": "维修基金"
    }

    切词结果:

    {
       "tokens": [
          {
             "token": "维修基金",
             "start_offset": 0,
             "end_offset": 4,
             "type": "CN_WORD",
             "position": 0
          },
          {
             "token": "维修",
             "start_offset": 0,
             "end_offset": 2,
             "type": "CN_WORD",
             "position": 1
          },
          {
             "token": "维",
             "start_offset": 0,
             "end_offset": 1,
             "type": "CN_WORD",
             "position": 2
          },
          {
             "token": "修",
             "start_offset": 1,
             "end_offset": 2,
             "type": "CN_CHAR",
             "position": 3
          },
          {
             "token": "基金",
             "start_offset": 2,
             "end_offset": 4,
             "type": "CN_WORD",
             "position": 4
          },
          {
             "token": "基",
             "start_offset": 2,
             "end_offset": 3,
             "type": "CN_WORD",
             "position": 5
          },
          {
             "token": "金",
             "start_offset": 3,
             "end_offset": 4,
             "type": "CN_CHAR",
             "position": 6
          }
       ]
    }

    对 “清明节,又称踏青节、行清节、三月节、祭祖节等” 进行短语切测效果对比

    • bd-nlp-basic 短语切词
    POST /_analyze
    {
        "text": "清明节,又称踏青节、行清节、三月节、祭祖节等",
        "analyzer": "bd-nlp-phrase"
    }

    短语切分结果:

    {
       "tokens": [
          {
             "token": "清明节",
             "start_offset": 0,
             "end_offset": 3,
             "type": "WORD",
             "position": 0
          },
          {
             "token": "又称",
             "start_offset": 4,
             "end_offset": 6,
             "type": "WORD",
             "position": 2
          },
          {
             "token": "踏青节",
             "start_offset": 6,
             "end_offset": 9,
             "type": "WORD",
             "position": 3
          },
          {
             "token": "行清节",
             "start_offset": 10,
             "end_offset": 13,
             "type": "WORD",
             "position": 5
          },
          {
             "token": "三月节",
             "start_offset": 14,
             "end_offset": 17,
             "type": "WORD",
             "position": 7
          },
          {
             "token": "祭祖",
             "start_offset": 18,
             "end_offset": 20,
             "type": "WORD",
             "position": 9
          },
          {
             "token": "节",
             "start_offset": 20,
             "end_offset": 21,
             "type": "WORD",
             "position": 10
          }
       ]
    }
    • ik_smart 智能切词
    POST _analyze
    {
        "analyzer": "ik_smart",
        "text": "清明节,又称踏青节、行清节、三月节、祭祖节等"
    }

    切词结果:

    {
       "tokens": [
          {
             "token": "清明节",
             "start_offset": 0,
             "end_offset": 3,
             "type": "CN_WORD",
             "position": 0
          },
          {
             "token": "又称",
             "start_offset": 4,
             "end_offset": 6,
             "type": "CN_WORD",
             "position": 1
          },
          {
             "token": "踏青",
             "start_offset": 6,
             "end_offset": 8,
             "type": "CN_WORD",
             "position": 2
          },
          {
             "token": "节",
             "start_offset": 8,
             "end_offset": 9,
             "type": "CN_WORD",
             "position": 3
          },
          {
             "token": "行",
             "start_offset": 10,
             "end_offset": 11,
             "type": "CN_WORD",
             "position": 4
          },
          {
             "token": "清",
             "start_offset": 11,
             "end_offset": 12,
             "type": "CN_CHAR",
             "position": 5
          },
          {
             "token": "节",
             "start_offset": 12,
             "end_offset": 13,
             "type": "CN_WORD",
             "position": 6
          },
          {
             "token": "三月",
             "start_offset": 14,
             "end_offset": 16,
             "type": "CN_WORD",
             "position": 7
          },
          {
             "token": "节",
             "start_offset": 16,
             "end_offset": 17,
             "type": "COUNT",
             "position": 8
          },
          {
             "token": "祭祖",
             "start_offset": 18,
             "end_offset": 20,
             "type": "CN_WORD",
             "position": 9
          },
          {
             "token": "节",
             "start_offset": 20,
             "end_offset": 21,
             "type": "CN_WORD",
             "position": 10
          }
       ]
    }

    Analyze API使用

    基础模型粒度分词

    POST /_analyze
    {
       "analyzer": "bd-nlp-basic",
       "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
    }

    分词结果:

    {
      "tokens": [
         {
            "token": "去年",
            "start_offset": 0,
            "end_offset": 2,
            "type": "WORD",
            "position": 0
         },
         {
            "token": "我们",
            "start_offset": 2,
            "end_offset": 4,
            "type": "WORD",
            "position": 1
         },
         {
            "token": "和",
            "start_offset": 4,
            "end_offset": 5,
            "type": "WORD",
            "position": 2
         },
         {
            "token": "他们",
            "start_offset": 5,
            "end_offset": 7,
            "type": "WORD",
            "position": 3
         },
         {
            "token": "展开",
            "start_offset": 7,
            "end_offset": 9,
            "type": "WORD",
            "position": 4
         },
         {
            "token": "炉际",
            "start_offset": 10,
            "end_offset": 12,
            "type": "WORD",
            "position": 6
         },
         {
            "token": "竞赛",
            "start_offset": 12,
            "end_offset": 14,
            "type": "WORD",
            "position": 7
         },
         {
            "token": "第一",
            "start_offset": 15,
            "end_offset": 17,
            "type": "WORD",
            "position": 9
         },
         {
            "token": "回合",
            "start_offset": 17,
            "end_offset": 19,
            "type": "WORD",
            "position": 10
         },
         {
            "token": "赢",
            "start_offset": 19,
            "end_offset": 20,
            "type": "WORD",
            "position": 11
         },
         {
            "token": "第二",
            "start_offset": 22,
            "end_offset": 24,
            "type": "WORD",
            "position": 14
         },
         {
            "token": "回合",
            "start_offset": 24,
            "end_offset": 26,
            "type": "WORD",
            "position": 15
         },
         {
            "token": "和",
            "start_offset": 26,
            "end_offset": 27,
            "type": "WORD",
            "position": 16
         },
         {
            "token": "第三",
            "start_offset": 27,
            "end_offset": 29,
            "type": "WORD",
            "position": 17
         },
         {
            "token": "回合",
            "start_offset": 29,
            "end_offset": 31,
            "type": "WORD",
            "position": 18
         },
         {
            "token": "败",
            "start_offset": 32,
            "end_offset": 33,
            "type": "WORD",
            "position": 20
         },
         {
            "token": "下",
            "start_offset": 33,
            "end_offset": 34,
            "type": "WORD",
            "position": 21
         },
         {
            "token": "阵",
            "start_offset": 34,
            "end_offset": 35,
            "type": "WORD",
            "position": 22
         },
         {
            "token": "来",
            "start_offset": 35,
            "end_offset": 36,
            "type": "WORD",
            "position": 23
         }
      ]
    }

    短语模型大粒度分词

    POST /_analyze
    {
       "analyzer": "bd-nlp-phrase",
       "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
    }

    分词结果:

    {
      "tokens": [
         {
            "token": "去年",
            "start_offset": 0,
            "end_offset": 2,
            "type": "WORD",
            "position": 0
         },
         {
            "token": "我们",
            "start_offset": 2,
            "end_offset": 4,
            "type": "WORD",
            "position": 1
         },
         {
            "token": "和",
            "start_offset": 4,
            "end_offset": 5,
            "type": "WORD",
            "position": 2
         },
         {
            "token": "他们",
            "start_offset": 5,
            "end_offset": 7,
            "type": "WORD",
            "position": 3
         },
         {
            "token": "展开",
            "start_offset": 7,
            "end_offset": 9,
            "type": "WORD",
            "position": 4
         },
         {
            "token": "炉际竞赛",
            "start_offset": 10,
            "end_offset": 14,
            "type": "WORD",
            "position": 6
         },
         {
            "token": "第一回合",
            "start_offset": 15,
            "end_offset": 19,
            "type": "WORD",
            "position": 8
         },
         {
            "token": "赢",
            "start_offset": 19,
            "end_offset": 20,
            "type": "WORD",
            "position": 9
         },
         {
            "token": "第二回合",
            "start_offset": 22,
            "end_offset": 26,
            "type": "WORD",
            "position": 12
         },
         {
            "token": "和",
            "start_offset": 26,
            "end_offset": 27,
            "type": "WORD",
            "position": 13
         },
         {
            "token": "第三",
            "start_offset": 27,
            "end_offset": 29,
            "type": "WORD",
            "position": 14
         },
         {
            "token": "回合",
            "start_offset": 29,
            "end_offset": 31,
            "type": "WORD",
            "position": 15
         },
         {
            "token": "败",
            "start_offset": 32,
            "end_offset": 33,
            "type": "WORD",
            "position": 17
         },
         {
            "token": "下",
            "start_offset": 33,
            "end_offset": 34,
            "type": "WORD",
            "position": 18
         },
         {
            "token": "阵",
            "start_offset": 34,
            "end_offset": 35,
            "type": "WORD",
            "position": 19
         },
         {
            "token": "来",
            "start_offset": 35,
            "end_offset": 36,
            "type": "WORD",
            "position": 20
         }
      ]
    }

    索引指定Analyzer

    PUT test
    {
       "mappings": {
          "doc": {
             "properties": {
                "k1": {
                   "type": "text",
                   "analyzer": "bd-nlp-basic" // 使用基础粒度模型
                },
                "k2": {
                   "type": "text",
                   "analyzer": "bd-nlp-phrase" // 使用短语粒度模型
                }
             }
          }
       },
       "settings": {
          "index": {
             "number_of_shards": "1",
             "number_of_replicas": "0"
          }
       }
    }

    索引指定Tokenizer

    PUT /test
    {
        "settings":{
            "analysis":{
                "analyzer":{
                    "my_analyzer":{
                        "tokenizer":"bd-nlp-basic",   // 自定义一个analyzer
                        "filter":[
                            "lowercase"               // 添加应用需要的过滤器
                        ]
                    }
                }
            }
        },
        "mappings":{
            "properties":{
                "k2":{
                    "type":"text",
                    "analyzer":"my_analyzer"         // 将自定义analyzer应用到对应字段上
                }
            }
        }
    } }
    }

    准确率与召回率

    百度内部大数据集测试结果:

    模型 准确率 召回率 F值
    analysis-baidu-nlp 98.8% 98.9% 98.8%
    上一篇
    备份恢复
    下一篇
    配置-IK-Analyzer-词库