Elasticsearch 2.2.0 分詞篇:中文分詞
來自: http://my.oschina.net/secisland/blog/617822
在Elasticsearch中,內置了很多分詞器(analyzers),但默認的分詞器對中文的支持都不是太好。所以需要單獨安裝插件來支持,比較常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果還是不錯的,但是目前IKAnanlyzer還不支持最新的Elasticsearch2.2.0版本,但是smartcn中文分詞器默認官方支持,它提供了一個中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。但是smartcn不支持自定義詞庫,作為測試可先用一下。后面的部分介紹如何支持最新的版本。
smartcn
安裝分詞:plugin install analysis-smartcn
卸載:plugin remove analysis-smartcn
測試:
請求:POST http://127.0.0.1:9200/_analyze/
{ "analyzer": "smartcn", "text": "聯想是全球最大的筆記本廠商" }
返回結果:
{ "tokens": [ { "token": "聯想", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "word", "position": 3 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "word", "position": 4 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "word", "position": 5 }, { "token": "筆記本", "start_offset": 8, "end_offset": 11, "type": "word", "position": 6 }, { "token": "廠商", "start_offset": 11, "end_offset": 13, "type": "word", "position": 7 } ] }
作為對比,我們看一下標準的分詞的結果,在請求中巴smartcn,換成standard
然后看返回結果:
{ "tokens": [ { "token": "聯", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "想", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "全", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "球", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "筆", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "記", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "本", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "廠", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "商", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 12 } ] }
從中可以看出,基本上不能使用,就是一個漢字變成了一個詞了。
本文由賽克 藍德(secisland)原創,轉載請標明作者和出處。
IKAnanlyzer支持2.2.0版本
目前github上最新的版本只支持Elasticsearch2.1.1,路徑為https://github.com/medcl/elasticsearch-analysis-ik。但現在最新的Elasticsearch已經到2.2.0了所以要經過處理一下才能支持。
1、下載源碼,下載完后解壓到任意目錄,然后修改elasticsearch-analysis-ik-master目錄下的pom.xml文件。找到<elasticsearch.version>行,然后把后面的版本號修改成2.2.0。
2、編譯代碼mvn package。
3、編譯完成后會在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。
4、解壓文件到Elasticsearch/plugins目錄下。
5、修改配置文件增加一行:index.analysis.analyzer.ik.type : "ik"
6、重啟Elasticsearch。
測試:和上面的請求一樣,只是把分詞替換成ik
返回的結果:
{ "tokens": [ { "token": "聯想", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 1 }, { "token": "最大", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 2 }, { "token": "筆記本", "start_offset": 8, "end_offset": 11, "type": "CN_WORD", "position": 3 }, { "token": "筆記", "start_offset": 8, "end_offset": 10, "type": "CN_WORD", "position": 4 }, { "token": "筆", "start_offset": 8, "end_offset": 9, "type": "CN_WORD", "position": 5 }, { "token": "記", "start_offset": 9, "end_offset": 10, "type": "CN_CHAR", "position": 6 }, { "token": "本廠", "start_offset": 10, "end_offset": 12, "type": "CN_WORD", "position": 7 }, { "token": "廠商", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 8 } ] }
從中可以看出,兩個分詞器分詞的結果還是有區別的。
擴展詞庫,在config\ik\custom下在mydict.dic中增加需要的詞組,然后重啟Elasticsearch,需要注意的是文件編碼是UTF-8 無BOM格式編碼。
比如增加了賽克藍德單詞。然后再次查詢:
請求:POST http://127.0.0.1:9200/_analyze/
參數:
{ "analyzer": "ik", "text": "賽克藍德是一家數據安全公司" }
返回結果:
{ "tokens": [ { "token": "賽克藍德", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "克", "start_offset": 1, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "藍", "start_offset": 2, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "德", "start_offset": 3, "end_offset": 4, "type": "CN_CHAR", "position": 3 }, { "token": "一家", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "一", "start_offset": 5, "end_offset": 6, "type": "TYPE_CNUM", "position": 5 }, { "token": "家", "start_offset": 6, "end_offset": 7, "type": "COUNT", "position": 6 }, { "token": "數據", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "安全", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 8 }, { "token": "公司", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 9 } ] }
從上面的結果可以看出已經支持賽克藍德單詞了。
賽克藍德(secisland)后續會逐步對Elasticsearch的最新版本的各項功能進行分析,近請期待。也歡迎加入secisland公眾號進行關注。