Spark 2.0 Released

The Apache Spark team announces the production release of Spark 2.0.0.  Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.

Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.

The Spark team guarantees API stability for all production releases in the Spark 2.X line.

Highlights

Spark Summit: Matei Zaharia summarizes highlights of the release. Slides here.

— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.

— Reynold Xin explains technical details of Spark 2.0.

SQL Processing

Key Changes

New and updated APIs:

  • In Scala and Java, the DataFrame and DataSet APIs are unified.
  • In Python and R, DataFrame is the main programming interface (due to lack of type safety).
  • For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
  • Enhancements to the Accumulator and Aggregator APIs.

Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:

  • Native SQL parser supports ANSI SQL and HiveQL.
  • Native DDL command implementations.
  • Subquery support.
  • View canonicalization support.

Additional new features:

  • Native CSV support
  • Off-heap memory management for caching and runtime.
  • Hive-style bucketing.
  • Approximate summary statistics.

Performance enhancements:

  • Speedups of 2X-10X for common SQL and DataFrame operators.
  • Improved performance with Parquet and ORC.
  • Improvements to Catalyst query optimizer for common workloads.
  • Improved performance for window functions.
  • Automatic file coalescing for native data sources.

Explainers

Spark Summit: Andrew Or explains memory management in Spark 2.0+. Slides here.

Spark Summit: Databrick’s Michael Armbrust explains structured analysis in Spark: DataFrames, Datasets, and Streaming. Slides here.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— On KDnuggets, Paige Roberts explains Project Tungsten.

 Sameer Agarwal, Davies Liu, and Reynold Xin dive deeply into Spark 2.0’s second generation Tungsten engine. This paper inspired Tungsten’s design.

Spark Summit: Yin Huai dives deeply into Catalyst, the Spark optimizer. Slides here.

— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.

Spark Summit: AMPLab’s Ankur Dave explains GraphFrames for graph queries in Spark SQL. Slides here.

Spark Streaming

Key Changes

Spark 2.0 includes an experimental release of Structured Streaming.

Explainers

Spark Summit: Tathagata Das explains Structured Streaming. Slides here.

— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.

— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.

Machine Learning

Key Changes

The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.

ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.

Additional techniques supported vary by API:

  • DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
  • PySpark: LDA, GMM, Generalized linear regression
  • SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.

Explainers

Spark Summit: Joseph Bradley previews machine learning in Spark 2.0. Slides here.

— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.

— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.

SparkR

Key Changes

SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.

As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.

Explainers

Spark Summit: Xiangrui Meng explains the latest developments in SparkR. Slides here.

— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.

— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s