Pig + Ansj 統計中文文本詞頻
最近特別喜歡用Pig,有能滿足大部分需求的內置函數(Built In Functions),支持自定義函數(user defined functions, UDF ),能load 純文本、avro等格式數據;可以 illustrate 看pig執行步驟的結果, describe 看alias的schema;以輕量級腳本形式跑MapReduce任務,各種爽爆。
1. Word Count
A = load '/user/.*/req-temp/text.txt' as (text:chararray); B = foreach A generate flatten(TOKENIZE(text)) as word; C = group B by word; D = foreach C generate COUNT(B), group;
TOKENIZE.java 的 實現 ;抽象類 EvalFunc<T> 被用來實現對數據字段進行轉換操作,其中 exec() 方法在pig運行期間被調用。
public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
@Override
public DataBag exec(Tuple input) throws IOException {
...
DataBag output = mBagFactory.newDefaultBag();
...
String delim = " \",()*";
...
StringTokenizer tok = new StringTokenizer((String)o, delim, false);
while (tok.hasMoreTokens()) {
output.add(mTupleFactory.newTuple(tok.nextToken()));
}
return output;
...
}
} TOKENIZE類繼承抽象類 EvalFunc<T> ,用StringTokenizer來對英文文本進行分詞,返回的是 DataBag 。所以,為了能統計單個詞,pig腳本中要用函數 flatten 進行打散。
2. Ansj中文分詞
為了寫Pig的UDF,需要添加maven依賴:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>${pig.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg-all-in-one</artifactId>
<version>3.0</version>
</dependency> 輸入命令 hadoop version 得到hadoop的版本,輸入 pig -i 得到pig的版本。務必要保證與集群部署的pig版本一致,要不然會報:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D
依葫蘆畫瓢,根據 TOKENIZE.java 修改,得到中文分詞 Segment.java :
package com.pig.udf;
public class Segment extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
@Override
public DataBag exec(Tuple input) throws IOException {
try {
if (input==null)
return null;
if (input.size()==0)
return null;
Object o = input.get(0);
if (o==null)
return null;
DataBag output = mBagFactory.newDefaultBag();
if (!(o instanceof String)) {
int errCode = 2114;
String msg = "Expected input to be chararray, but" +
" got " + o.getClass().getName();
throw new ExecException(msg, errCode, PigException.BUG);
}
// filter punctuation
FilterModifWord.insertStopNatures("w");
List<Term> words = ToAnalysis.parse((String) o);
words = FilterModifWord.modifResult(words);
for(Term word: words) {
output.add(mTupleFactory.newTuple(word.getName()));
}
return output;
} catch (ExecException ee) {
throw ee;
}
}
@SuppressWarnings("deprecation")
@Override
public Schema outputSchema(Schema input) {
...
}
... ansj支持設置詞性的停用詞 FilterModifWord.insertStopNatures("w"); ,如此可以去掉標點符號的詞。將java文件打包后放在hdfs上,然后通過register jar包調用該函數:
REGISTER hdfs:///user/.*/piglib/udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar A = load '/user/.*/req-temp/renmin.txt' as (text:chararray); B = foreach A generate flatten(com.pig.udf.Segment(text)) as word; C = group B by word; D = foreach C generate COUNT(B), group;
截取人民日報社論的一段:
樹好家風,嚴管才是厚愛。古人說:“居官所以不能清白者,率由家人喜奢好侈使然也。”要看到,好的家風,能系好人生的“第一粒扣子”。“修身、齊家”,才能“治國、平天下”,領導干部首先要“正好家風、管好家人、處好家事”,才能看好“后院”、堵住“后門”。“父母之愛子,則為之計深遠”,與其冒著風險給子女留下大筆錢財,不如給子女留下好家風、好作風,那才是讓子女受益無窮的東西,才是真正的“為之計深遠”。
統計詞頻如下:
(3,能)
(2,要)
(2,計)
(1,讓)
(1,說)
(1,那)
(2,風)
(1,不如)
(1,不能)
(1,與其)
(1,東西)
(1,人生)
(1,作風)
(1,使然)
(1,修身)
(1,厚愛)
(1,受益)
(1,古人)
(1,后門)
(1,后院)
ansj在不加載用戶字段你自定義此表的情況下,分詞的效果并不理想。