Hadoopy: 使用Cython實現Python對Hadoop的封裝

jopen 13年前發布 | 25K 次閱讀 Hadoop 分布式/云計算/大數據

Hadoopy是Hadoop Streaming的一個Python封裝，采用Cython開發。它簡單，快速，并且易于被修改。它已經在超過700個節點的集群中測試過了。Hadoopy的目標是：

Similar interface as the Hadoop API (design patterns usable between Python/Java interfaces)
General compatibility with dumbo to allow users to switch back and forth
Usable on Hadoop clusters without Python or admin access
Fast conversion and processing
Stay small and well documented
Be transparent with what is going on
Handle programs with complicated .so’s, ctypes, and extensions
Code written for hack-ability
Simple HDFS access (e.g., reading, writing, ls)
Support (and not replicate) the greater Hadoop ecosystem (e.g., Oozie, whirr)

殺手特點（Hadoopy的獨特之處）：

Automated job parallelization ‘auto-oozie’ available in the hadoopy flow project (maintained out of branch)
Local execution of unmodified MapReduce job with launch_local
Read/write sequence files of TypedBytes directly to HDFS from python (readtb, writetb)
Allows printing to stdout and stderr in Hadoop tasks without causing problems (uses the ‘pipe hopping’ technique, both are available in the task’s stderr)
Works on clusters without any extra installation, Python, or any Python libraries (uses Pyinstaller that is included in this source tree)

額外特性：

Works on OS X
Critical path is in Cython
Simple HDFS access (readtb and ls) inside Python, even inside running jobs
Unit test interface
Reporting using status and counters (and print statements! no need to be scared of them in Hadoopy)
Supports design patterns in the Lin&Dyer book
Typedbytes support (very fast)
Oozie support

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！