Go開發的基于Hadoop的ETL抽取工具：Crunch

jopen 11年前發布 | 35K 次閱讀 Crunch 數據挖掘

快速開發，快速運行，基于Go工具包。實現基于 Hadoop 的 ETL 和特性抽取工具。

快速入門

Crunch is optimized to be a big-bang-for-the-buck libary, yet almost every aspect is extensible.

Let's say you have a log of semi-structured and deeply nested JSON. Each line contains a record.

You would like to:

Parse JSON records
Extract fields
Cleanup/process fields
Extract features - run custom code on field values and output the result as new field(s)

Go開發的基于Hadoop的ETL抽取工具：Crunch

所以這里有一個詳細的視圖：

// Describe your row
transform := crunch.NewTransformer()
row := crunch.NewRow()
// Use "field_name type". Types are Hive types.
row.FieldWithValue("ev_smp int", "1.0")
// If no type given, assume 'string'
row.FieldWithDefault("ip", "0.0.0.0", makeQuery("head.x-forwarded-for"), transform.AsIs)
row.FieldWithDefault("ev_ts", "", makeQuery("action.timestamp"), transform.AsIs)
row.FieldWithDefault("ev_source", "", makeQuery("action.source"), transform.AsIs)
row.Feature("doing ip to location", []string{"country", "city"},
  func(r crunch.DataReader, row *crunch.Row)[]string{
    // call your "standard" Go code for doing ip2location
    return ip2location(row["ip"])
  })

// By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV])
// Also will plug-in Crunch's CLI utility functions (use -help)
crunch.ProcessJson(row)

項目主頁：http://www.baiduhome.net/lib/view/home/1416465525055

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1416465525055.html

Crunch 數據挖掘

Go開發的基于Hadoop的ETL抽取工具：Crunch

快速入門

相關經驗

相關資訊

相關文檔

目錄