分布式計算系統,Spark 發布1.0.0 版本
Spark是一個基于內存計算的開源的集群計算系統,目的是讓數據分析更加快速。Spark非常小巧玲瓏,由加州伯克利大學AMP實驗室的Matei為主的小團隊所開發。使用的語言是Scala,項目的core部分的代碼只有63個Scala文件,非常短小精悍。

Spark 是一種與 Hadoop 相似的開源集群計算環境,但是兩者之間還存在一些不同之處,這些有用的不同之處使 Spark 在某些工作負載方面表現得更加優越,換句話說,Spark 啟用了內存分布數據集,除了能夠提供交互式查詢外,它還可以優化迭代工作負載。
Spark 是在 Scala 語言中實現的,它將 Scala 用作其應用程序框架。與 Hadoop 不同,Spark 和 Scala 能夠緊密集成,其中的 Scala 可以像操作本地集合對象一樣輕松地操作分布式數據集。
盡管創建 Spark 是為了支持分布式數據集上的迭代作業,但是實際上它是對 Hadoop 的補充,可以在 Hadoop 文件系統中并行運行。通過名為Mesos的第三方集群框架可以支持此行為。Spark 由加州大學伯克利分校 AMP 實驗室 (Algorithms, Machines, and People Lab) 開發,可用來構建大型的、低延遲的數據分析應用程序。
Spark 集群計算架構
雖然 Spark 與 Hadoop 有相似之處,但它提供了具有有用差異的一個新的集群計算框架。首先,Spark 是為集群計算中的特定類型的工作負載而設計,即那些在并行操作之間重用工作數據集(比如機器學習算法)的工作負載。為了優化這些類型的工作負載,Spark 引進了內存集群計算的概念,可在內存集群計算中將數據集緩存在內存中,以縮短訪問延遲。
Spark 還引進了名為彈性分布式數據集(RDD) 的抽象。RDD 是分布在一組節點中的只讀對象集合。這些集合是彈性的,如果數據集一部分丟失,則可以對它們進行重建。重建部分數據集的過程依賴于容錯機制,該機制可以維護 "血統"(即允許基于數據衍生過程重建部分數據集的信息)。RDD 被表示為一個 Scala 對象,并且可以從文件中創建它;一個并行化的切片(遍布于節點之間);另一個 RDD 的轉換形式;并且最終會徹底改變現有 RDD 的持久性,比如請求緩存在內存中。
Spark 中的應用程序稱為驅動程序,這些驅動程序可實現在單一節點上執行的操作或在一組節點上并行執行的操作。與 Hadoop 類似,Spark 支持單節點集群或多節點集群。對于多節點操作,Spark 依賴于 Mesos 集群管理器。Mesos 為分布式應用程序的資源共享和隔離提供了一個有效平臺。該設置充許 Spark 與 Hadoop 共存于節點的一個共享池中。
Apache Spark 1.0 發布了,這是一個主要的版本,包含大量新特性和強 API 兼容性。此外該版本增加了一個主要組件 —— Spark SQL 用來操作 Spark 上的結構化數據;此外增強了 Java 和 Python 語言的支持;完全支持 Hadoop/YARN 安全模型和一個統一的集群管理的提交過程。
下載地址:source package(5 MB tgz) 和預編譯包 Hadoop 1 / CDH3, CDH4, orHadoop 2 / CDH5 / HDP2(160 MB tgz). 其他下載 Apache download site.
Spark Release 1.0.0
Spark 1.0.0 is a major release marking the start of the 1.X line. This release brings both a variety of new features and strong API compatibility guarantees throughout the 1.X line. Spark 1.0 adds a new major component, Spark SQL, for loading and manipulating structured data in Spark. It includes major extensions to all of Spark’s existing standard libraries (ML, Streaming, and GraphX) while also enhancing language support in Java and Python. Finally, Spark 1.0 brings operational improvements including full support for the Hadoop/YARN security model and a unified submission process for all supported cluster managers.
You can download Spark 1.0.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, orHadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.
API Stability
Spark 1.0.0 is the first release in the 1.X major line. Spark is guaranteeing stability of its core API for all 1.X releases. Historically Spark has already been very conservative with API changes, but this guarantee codifies our commitment to application writers. The project has also clearly annotated experimental, alpha, and developer API’s to provide guidance on future API changes of newer components.
Integration with YARN Security
For users running in secured Hadoop environments, Spark now integrates with the Hadoop/YARN security model. Spark will authenticate job submission, securely transfer HDFS credentials, and authenticate communication between components.
Operational and Packaging Improvements
This release significantly simplifies the process of bundling and submitting a Spark application. A new spark-submit tool allows users to submit an application to any Spark cluster, including local clusters, Mesos, or YARN, through a common process. The documentation for bundling Spark applications has been substantially expanded. We’ve also added a history server for Spark’s web UI, allowing users to view Spark application data after individual applications are finished.
Spark SQL
This release introduces Spark SQL as a new alpha component. Spark SQL provides support for loading and manipulating structured data in Spark, either from external structured data sources (currently Hive and Parquet) or by adding a schema to an existing RDD. Spark SQL’s API interoperates with the RDD data model, allowing users to interleave Spark code with SQL statements. Under the hood, Spark SQL uses the Catalyst optimizer to choose an efficient execution plan, and can automatically push predicates into storage formats like Parquet. In future releases, Spark SQL will also provide a common API to other storage systems.
MLlib Improvements
In 1.0.0, Spark’s MLlib adds support for sparse feature vectors in Scala, Java, and Python. It takes advantage of sparsity in both storage and computation in linear methods, k-means, and naive Bayes. In addition, this release adds several new algorithms: scalable decision trees for both classification and regression, distributed matrix algorithms including SVD and PCA, model evaluation functions, and L-BFGS as an optimization primitive. The programming guide and code examples for MLlib have also been greatly expanded.
GraphX and Streaming Improvements
In addition to usability and maintainability improvements, GraphX in Spark 1.0 brings substantial performance boosts in graph loading, edge reversal, and neighborhood computation. These operations now require less communication and produce simpler RDD graphs. Spark’s Streaming module has added performance optimizations for stateful stream transformations, along with improved Flume support, and automated state cleanup for long running jobs.
Extended Java and Python Support
Spark 1.0 adds support for Java 8 new lambda syntax in its Java bindings. Java 8 supports a concise syntax for writing anonymous functions, similar to the closure syntax in Scala and Python. This change requires small changes for users of the current Java API, which are noted in the documentation. Spark’s Python API has been extended to support several new functions. We’ve also included several stability improvements in the Python API, particularly for large datasets. PySpark now supports running on YARN as well.
Documentation
Spark’s programming guide has been significantly expanded to centrally cover all supported languages and discuss more operators and aspects of the development life cycle. The MLlib guide has also been expanded with significantly more detail and examples for each algorithm, while documents on configuration, YARN and Mesos have also been revamped.
Smaller Changes
- PySpark now works with more Python versions than before – Python 2.6+ instead of 2.7+, and NumPy 1.4+ instead of 1.7+.
- Spark has upgraded to Avro 1.7.6, adding support for Avro specific types.
- Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs.
- Support for off-heap storage in Tachyon has been added via a special build target.
- Datasets persisted with
DISK_ONLYnow write directly to disk, significantly improving memory usage for large datasets. - Intermediate state created during a Spark job is now garbage collected when the corresponding RDDs become unreferenced, improving performance.
- Spark now includes a Javadoc version of all its API docs and a unified Scaladoc for all modules.
- A new SparkContext.wholeTextFiles method lets you operate on small text files as individual records.
Migrating to Spark 1.0
While most of the Spark API remains the same as in 0.x versions, a few changes have been made for long-term flexibility, especially in the Java API (to support Java 8 lambdas). The documentation includes migration information to upgrade your applications.