分布式流處理框架：Apache Samza

jopen 12年前發布 | 27K 次閱讀分布式/云計算/大數據 Apache Samza

Apache Samza 是一個分布式流處理框架。它使用 Apache Kafka 用于消息發送，采用 Apache Hadoop YARN 來提供容錯，處理器隔離，安全性和資源管理。專用于實時數據的處理，非常像推ter的流處理系統Storm。它具有以下特性：

Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based "process message" API that should be familiar to anyone who's used Map/Reduce.
Managed state: Samza manages snapshotting and restoration of a stream processor's state. Samza will restore a stream processor's state to a snapshot consistent with the processor's last read messages when the processor is restarted.
Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop's security model, and resource isolation through Linux CGroups.