Go開發的基于Hadoop的ETL抽取工具:Crunch
快速開發,快速運行,基于Go工具包。實現基于 Hadoop 的 ETL 和特性抽取工具。
快速入門
Crunch is optimized to be a big-bang-for-the-buck libary, yet almost every aspect is extensible.
Let's say you have a log of semi-structured and deeply nested JSON. Each line contains a record.
You would like to:
- Parse JSON records
- Extract fields
- Cleanup/process fields
- Extract features - run custom code on field values and output the result as new field(s)
所以這里有一個詳細的視圖:
// Describe your row transform := crunch.NewTransformer() row := crunch.NewRow() // Use "field_name type". Types are Hive types. row.FieldWithValue("ev_smp int", "1.0") // If no type given, assume 'string' row.FieldWithDefault("ip", "0.0.0.0", makeQuery("head.x-forwarded-for"), transform.AsIs) row.FieldWithDefault("ev_ts", "", makeQuery("action.timestamp"), transform.AsIs) row.FieldWithDefault("ev_source", "", makeQuery("action.source"), transform.AsIs) row.Feature("doing ip to location", []string{"country", "city"}, func(r crunch.DataReader, row *crunch.Row)[]string{ // call your "standard" Go code for doing ip2location return ip2location(row["ip"]) }) // By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV]) // Also will plug-in Crunch's CLI utility functions (use -help) crunch.ProcessJson(row)
本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!