快如閃電的集群計算,Spark 0.6.0 發布
Spark 0.6.0是一個重要的版本,帶來了一些新的功能,體系結構的變化,以及性能增強。最顯著的增加是一個獨立的部署模式,一個Java API,以及擴展的文檔。在某些方面性能提升了 2 倍。
您也可以下載此版本的源碼包(2 MB的tar.gz)或預編譯包(48 MB的tar.gz)。
更簡單的部署
In addition to running on Mesos, Spark now has a standalone deploy mode that lets you quickly launch a cluster without installing an external cluster manager. The standalone mode only needs Java installed on each machine, and Spark deployed to it.
In addition, there is experimental support for running on YARN (Hadoop NextGen), currently in a separate branch.
Java API
Java programmers can now use Spark through a new Java API layer. This layer makes available all of Spark's features, including parallel transformations, distributed datasets, broadcast variables, and accumulators, in a Java-friendly manner.
文檔擴展
Spark's documentation has been expanded with a new quick start guide, additional deployment instructions, configuration guide, tuning guide, and improved Scaladoc API documentation.
引擎變化
Under the hood, Spark 0.6 has new, custom storage and communication layers brought in from the upcoming Spark Streaming project. These can improve performance over past versions by as much as 2x. Specifically:
- A new communication manager using asynchronous Java NIO lets shuffle operations run faster, especially when sending large amounts of data or when jobs have many tasks.
- A new storage manager supports per-dataset storage level settings (e.g. whether to keep the dataset in memory, deserialized, on disk, etc, or even replicated across nodes).
- Spark's scheduler and control plane have been optimized to better support ultra-low-latency jobs (under 500ms) and high-throughput scheduling decisions.
新的APIs
- This release adds the ability to control caching strategies on a per-RDD level, so that different RDDs may be stored in memory, on disk, as serialized bytes, etc. You can choose your storage level using the persist() method on RDD.
- A new Accumulable class generalizes Accumulators for the case when the type being accumulated is not the same as the types of elements being added (e.g. you wish to accumulate a collection, such as a Set, by adding individual elements).
- You can now dynamically add files or JARs that should be shipped to your workers with SparkContext.addFile/Jar.
- More Spark operators (e.g. joins) support custom partitioners.
增強的調試
Spark's log now prints which operation in your program each RDD and job described in your logs belongs to, making it easier to tie back to which parts of your code experience problems.
Maven Artifacts
Spark is now available in Maven Central, making it easier to link into your programs without having to build it as a JAR. Use the following Maven identifiers to add it to a project:
- groupId: org.spark-project
- artifactId: spark-core_2.9.2
- version: 0.6.0
兼容性
This release is source-compatible with Spark 0.5 programs, but you will need to recompile them against 0.6. In addition, the configuration for caching has changed: instead of having a spark.cache.class parameter that sets one caching strategy for all RDDs, you can now set a per-RDD storage level. Spark will warn if you try to set spark.cache.class.
Credits
Spark 0.6 was the work of a large set of new contributors from Berkeley and outside.
- Tathagata Das contributed the new communication layer, and parts of the storage layer.
- Haoyuan Li contributed the new storage manager.
- Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.
- Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.
- Josh Rosen contributed the Java API, and several bug fixes.
- Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.
- Reynold Xin contributed numerous bug and performance fixes.
- Imran Rashid contributed the new Accumulable class.
- Harvey Feng contributed improvements to shuffle operations.
- Shivaram Venkataraman improved Spark's memory estimation and wrote a memory tuning guide.
- Ravi Pandya contributed Spark run scripts for Windows.
- Mosharaf Chowdhury provided several fixes to broadcast.
- Henry Milner pointed out several bugs in sampling algorithms.
- Ray Racine provided improvements to the EC2 scripts.
- Paul Ruan and Bill Zhao helped with testing.
還要感謝所有Spark用戶一直在努力建議功能和報告錯誤。