高效和分布式的通用數據處理平臺：Apache Flink

jopen 11年前發布 | 42K 次閱讀 Apache Flink

Apache Flink 是高效和分布式的通用數據處理平臺。

Apache Flink 聲明式的數據分析開源系統，結合了分布式 MapReduce 類平臺的高效，靈活的編程和擴展性。同時在并行數據庫發現查詢優化方案。

 DataSet<String> input = env.readTextFile(inputPath);

input.flatMap(new FlatMapFunction() {
   public void flatMap(String value, Collector out) {
       for (String s : value.split(" ")) {
           out.collect(new Tuple2<String, Long>(s, 1L);
       }
   }
})
.groupBy(0)
.sum(1)
.writeAsText(outputPath);

System Stack

The Apache Flink stack consists of

Programming APIs for different languages (Java, Scala) and paradigms (record-oriented, graph-oriented).
A program optimizer that decides how to execute the program for good performance. It decides among other things about data movement and caching strategies.
A distributed runtime that executes programs in parallel distributed over many machines.

Flink runs independently from Hadoop, but integrates seamlessly with YARN (Hadoop's next-generation scheduler). Various file systems (including the Hadoop Distributed File System) can act as data sources.