superword開源項目中的定義相似規則

nyyb 10年前發布 | 8K 次閱讀 superword

兩個詞之間的關系有同義、反義、近義（有多近？）、相關（有多相關？）等等。我們如何來判斷兩個詞之間的關系呢？利用計算機能自動找出這種關系嗎？當然可以，不僅能找出來，而且還能量化出有多近和有多相關。

本文描述了superword開源項目中的定義相似規則，利用詞的定義計算詞和詞之間的相似性。詞的定義使用的是韋氏詞典，同時也支持牛津詞典。相似性算法使用的是word分詞提供的10大相似性算法。

定義相似規則主要包括以下6步：

1、獲取要計算的詞的定義：

String wordDefinition = MySQLUtils.getWordDefinition(word, WordLinker.Dictionary.WEBSTER.name());

2、獲取分級詞匯，分級詞匯的具體定義見這里：

Set<Word> words = (Set<Word>)application.getAttribute("words_"+request.getAttribute("words_type"));

3、獲取分級詞匯的定義，代碼見這里：

List<String> allWordDefinition = MySQLUtils.getAllWordDefinition(WordLinker.Dictionary.WEBSTER.name(), words);

4、從word分詞提供的10大相似性算法中任選一個，同時指定使用word分詞提供的針對純英文的分詞器：

TextSimilarity textSimilarity = new CosineTextSimilarity();
textSimilarity.setSegmentationAlgorithm(SegmentationAlgorithm.PureEnglish);

5、計算相似性，返回最相似的100個單詞：

int count = 100;
Hits result = textSimilarity.rank(wordDefinition, allWordDefinition, count);

6、輸出計算結果：

StringBuilder temp = new StringBuilder();
int i=1;
temp.append("<table border=\"1\">\n");
for(Hit hit : result.getHits()){
    String[] attrs = hit.getText().split("_");
    String w = attrs[0];
    StringBuilder definition = new StringBuilder(attrs[1]);
    for(int j=2; j<attrs.length; j++){
        definition.append(attrs[j]).append("_");
    }
    temp.append("<tr>");
    temp.append("<td> ").append(i++)
            .append(". </td><td> ")
            .append(WordLinker.toLink(w))
            .append(" </td><td> ")
            .append(definition)
            .append(" </td><td> ")
            .append(hit.getScore())
            .append("</td><td> ")
            .append("<a target=\"_blank\" href=\"definition-similar-rule.jsp?word=" + hit.getText() + "&count=" + count + "&words_type=" + request.getAttribute("words_type") + "\">相似</a>")
            .append(" </td>\n");
    temp.append("</tr>\n");
}
temp.append("</table>\n");
htmlFragment = temp.toString();

計算效果如下圖所示：

1、使用韋氏詞典的定義

superword開源項目中的定義相似規則