Apache Spark 1.3 發布,基于內存計算的開源的集群計算系統

f663x 9年前發布 | 28K 次閱讀 Apache Spark

Apache Spark 1.3 發布,1.3 版本引入了期待已久的 DataFrame API,這是 Spark 的 RDD 抽象設計來簡單快速支持大數據集的變革。同時在流轉換 ML 和 SQL 的大量提升。

Spark是一個基于內存計算的開源的集群計算系統,目的是讓數據分析更加快速。Spark非常小巧玲瓏,由加州伯克利大學AMP實驗室的Matei為主的小團隊所開發。使用的語言是Scala,項目的core部分的代碼只有63個Scala文件,非常短小精悍。
Spark 是一種與 Hadoop 相似的開源集群計算環境,但是兩者之間還存在一些不同之處,這些有用的不同之處使 Spark 在某些工作負載方面表現得更加優越,換句話說,Spark 啟用了內存分布數據集,除了能夠提供交互式查詢外,它還可以優化迭代工作負載。
Spark 是在 Scala 語言中實現的,它將 Scala 用作其應用程序框架。與 Hadoop 不同,Spark 和 Scala 能夠緊密集成,其中的 Scala 可以像操作本地集合對象一樣輕松地操作分布式數據集。
盡管創建 Spark 是為了支持分布式數據集上的迭代作業,但是實際上它是對 Hadoop 的補充,可以在 Hadoop 文件系統中并行運行。通過名為Mesos的第三方集群框架可以支持此行為。Spark 由加州大學伯克利分校 AMP 實驗室 (Algorithms, Machines, and People Lab) 開發,可用來構建大型的、低延遲的數據分析應用程序。
Spark 集群計算架構
雖然 Spark 與 Hadoop 有相似之處,但它提供了具有有用差異的一個新的集群計算框架。首先,Spark 是為集群計算中的特定類型的工作負載而設計,即那些在并行操作之間重用工作數據集(比如機器學習算法)的工作負載。為了優化這些類型的工作負載,Spark 引進了內存集群計算的概念,可在內存集群計算中將數據集緩存在內存中,以縮短訪問延遲。
Spark 還引進了名為彈性分布式數據集(RDD) 的抽象。RDD 是分布在一組節點中的只讀對象集合。這些集合是彈性的,如果數據集一部分丟失,則可以對它們進行重建。重建部分數據集的過程依賴于容錯機制,該機制可以維護 "血統"(即允許基于數據衍生過程重建部分數據集的信息)。RDD 被表示為一個 Scala 對象,并且可以從文件中創建它;一個并行化的切片(遍布于節點之間);另一個 RDD 的轉換形式;并且最終會徹底改變現有 RDD 的持久性,比如請求緩存在內存中。
Spark 中的應用程序稱為驅動程序,這些驅動程序可實現在單一節點上執行的操作或在一組節點上并行執行的操作。與 Hadoop 類似,Spark 支持單節點集群或多節點集群。對于多節點操作,Spark 依賴于 Mesos 集群管理器。Mesos 為分布式應用程序的資源共享和隔離提供了一個有效平臺。該設置充許 Spark 與 Hadoop 共存于節點的一個共享池中。

Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.

We’ll be publishing in depth overview posts covering Spark’s new features over the coming weeks. Some of the salient features of this release are:

A new DataFrame API

The DataFrame API that we recently announced officially ships in Spark 1.3. DataFrames evolve Spark’s RDD model, making operations with structured datasets even faster and easier. They are inspired by, and fully interoperable with, Pandas and R data frames, and are available in Spark’s Java, Scala, and Python API’s as well as the upcoming (unreleased) R API. DataFrames introduce new simplified operators for filtering, aggregating, and projecting over large datasets. Internally, DataFrames leverage the Spark SQL logical optimizer to intelligently plan the physical execution of operations to work well on large datasets. This planning permeates all the way into physical storage, where optimizations such as predicate pushdown are applied based on analysis of user programs. Read more about the data frames API in the SQL programming guide.

# Constructs a DataFrame from a JSON dataset. users = context.load("s3n://path/to/users.json", "json") # Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)  # Alternatively, using Pandas-like syntax young = users[users.age < 21]  # DataFrame's support existing RDD operators print("Young users: " + young.count())

 

Spark SQL Graduates from Alpha

Spark SQL graduates from an alpha component in this release, guaranteeing compatibility for the SQL dialect and semantics in future releases. Spark SQL’s data source API now fully interoperates with the new DataFrame component, allowing users to create DataFrames directly from Hive tables, Parquet files, and other sources. Users can also intermix SQL and data frame operators on the same data sets. New in 1.3 is the ability to read and write tables from a JDBC connection, with native support for Postgres and MySQL and other RDBMS systems. That API adds has write support for producing output tables as well, to JDBC or any other source.

> CREATE TEMPORARY TABLE impressions USING org.apache.spark.sql.jdbc OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "impressions" ) > SELECT COUNT(*) FROM impressions

Built-in Support for Spark Packages

We earlier announced an initiative to create a community package repository for Spark at the end of 2014. Today Spark Packages has 45 community projects catering to Spark developers, including data source integrations, testing utilities, and tutorials. To make packages easy for Spark users, Spark 1.3 includes support for pulling published packages into the Spark shell or a program with a single flag.

# Launching Spark shell with a package ./bin/spark-shell --packages databricks/spark-avro:0.2  

 

For developers, Spark Packages has also created an SBT plugin to make publishing packages easy and introduced automatic Spark compatibility checks of new releases.

Lower Level Kafka Support in Spark Streaming

Over the last few releases, Kafka has become a popular input source for Spark streaming. Spark 1.3 adds a new Kakfa streaming source that leverages Kafka’s replay capabilities to provide reliable delivery semantics without the use of a write ahead log. It also provides primitives which enable exactly once guarantees for applications that have strong consistently requirements. Kafka support adds a Python API in this release, along with new primitives for creating Python API’s in the future. For a full list of Spark streaming features see the upstream release notes.

New Algorithms in MLlib

Spark 1.3 provides a rich set of new algorithms. The latent Dirichlet allocation (LDA) is one of the first topic modeling algorithms to appear in MLlib. We’ll be documenting LDA in more detail in a follow-up post. Spark’s logistic regression has been generalized to multinomial logistic regression for multiclass classification. This release also adds improved clustering through Gaussian mixture models and power iteration clustering, and frequent itemsets mining through FP-growth. Finally, an efficient block matrix abstraction is introduced for distributed linear algebra. Several other algorithms and utilities are added and discussed in the full release notes.  


This post only scratches the surface of interesting features in Spark 1.3. Overall, this release contains more than 1000 patches from 176 contributors making it our largest yet. Head over to the official release notes to learn more about this release, and watch the Databricks blog for more detailed posts about the major features in the next few weeks!

 本文由用戶 f663x 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!