Spark 2.0 Released
The Apache Spark team announces the production release of Spark 2.0.0. Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.
Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.
The Spark team guarantees API stability for all production releases in the Spark 2.X line.
— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.
— Reynold Xin explains technical details of Spark 2.0.
New and updated APIs:
- In Scala and Java, the DataFrame and DataSet APIs are unified.
- In Python and R, DataFrame is the main programming interface (due to lack of type safety).
- For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
- Enhancements to the Accumulator and Aggregator APIs.
Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:
- Native SQL parser supports ANSI SQL and HiveQL.
- Native DDL command implementations.
- Subquery support.
- View canonicalization support.
Additional new features:
- Native CSV support
- Off-heap memory management for caching and runtime.
- Hive-style bucketing.
- Approximate summary statistics.
- Speedups of 2X-10X for common SQL and DataFrame operators.
- Improved performance with Parquet and ORC.
- Improvements to Catalyst query optimizer for common workloads.
- Improved performance for window functions.
- Automatic file coalescing for native data sources.
— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.
— On KDnuggets, Paige Roberts explains Project Tungsten.
— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.
Spark 2.0 includes an experimental release of Structured Streaming.
— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.
— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.
The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.
ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.
Additional techniques supported vary by API:
- DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
- PySpark: LDA, GMM, Generalized linear regression
- SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.
— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.
— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.
SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.
As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.
— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.
— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.