HanLP中文分詞solr插件

xcxc 9年前發布 | 70K 次閱讀 HanLP 中文分詞

HanLP中文分詞solr插件

基于HanLP，支持Solr5.x，兼容Lucene5.x。

快速上手

將hanlp-portable.jar和hanlp-solr-plugin.jar共兩個jar放入${webapp}/WEB-INF/lib下

修改solr core的配置文件${core}/conf/schema.xml：

<fieldType name="text_cn" class="solr.TextField">
<analyzer type="index" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer"/>
<analyzer type="query" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer"/>
</fieldType>

調用方法

在Query改寫的時候，可以利用HanLPAnalyzer分詞結果中的詞性等屬性，如

String text = "中華人民共和國很遼闊";
for (int i = 0; i < text.length(); ++i)
{
    System.out.print(text.charAt(i) + "" + i + " ");
}
System.out.println();
Analyzer analyzer = new HanLPAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
tokenStream.reset();
while (tokenStream.incrementToken())
{
    CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
    // 距離
    PositionIncrementAttribute positionAttr = kenStream.getAttribute(PositionIncrementAttribute.class);
    // 詞性
    TypeAttribute typeAttr = tokenStream.getAttribute(TypeAttribute.class);
    System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());
}

在另一些場景，支持以自定義的分詞器（比如開啟了命名實體識別的分詞器、繁體中文分詞器、CRF分詞器等）構造HanLPTokenizer，比如：

tokenizer = new HanLPTokenizer(HanLP.newSegment()
                                    .enableJapaneseNameRecognize(true)
                                    .enableIndexMode(true), null, false);
tokenizer.setReader(new StringReader("林志玲亮相網友:確定不是波多野結衣？"));

高級配置

HanLP分詞器主要通過class path下的hanlp.properties進行配置，請閱讀HanLP自然語言處理包文檔以了解更多相關配置，如：

停用詞
用戶詞典
詞性標注

項目主頁：http://www.baiduhome.net/lib/view/home/1440338627749

本文由用戶 xcxc 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1440338627749.html

HanLP 中文分詞

HanLP中文分詞solr插件

HanLP中文分詞solr插件

快速上手

調用方法

高級配置

相關經驗

相關資訊

相關文檔

目錄