Spark 1.6 Released
The Spark team announces release of Spark 1.6.0. For a full list of new features, review the release notes here. On the Databricks blog, Michael Armbrust, Patrick Wendell and Reynold Xin announce the release and summarize key enhancements. In November, the same authors announced a preview of the release.
The contributor base continues to grow, as shown in the chart below:
Key enhancements include:
- Datasets API
- Automatic memory management
- Improvements to Spark SQL performance and API
- Faster streaming state management
- Expanded machine learning capabilities
Let’s review each of these in turn.
The most important new feature in Spark 1.6 is the Datasets API, which allows users to easily express transformations on domain objects, while also providing the performance advantages of the Spark SQL execution engine. The Datasets API supports static typing and user functions that run directly on existing JVM types (such as user classes, Avro schemas, etc). It leverages the Spark Catalyst optimizer and Project Tungsten’s fast in-memory encoding.
On the Databricks blog, Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia introduce you to Spark Datasets. The team expects to enhance the API over the next several releases.
Automatic Memory Management
In previous releases, Spark partitioned available memory into two static regions, one for execution and the other to cache hot data. Developers specified the desired memory configuration, but doing so required user expertise; operations requiring more than the specified amount of memory spilled to disk, sandbagging performance.
With Release 1.6, the Spark memory manager automatically expands and contracts memory regions as needed at runtime. Since developers no longer need to guess about how much memory to allocate, over-allocate to one region or another to support a single operation, more memory is available for other operations.
Improvements to Spark SQL
There are a number of improvements to Spark SQL, including:
- Improved scan performance when reading flat schemas from Parquet.
- Better query plans for high-cardinality distinct aggregations.
- Improved performance for complex types in columnar cache.
- Ability to run queries directly on files, like Apache Drill.
Spark SQL performs reasonably well already against other SQL engines. It will be interesting to see some benchmarks with the new TPC standards.
Spark Streaming: Improved State Management
In a post on the Databricks blog from July, 2015, Tathagata Das, Mahei Zaharia and Patrick Wendell describe how Spark Streaming works. In previous releases, updateStateByKey allowed the user to maintain per-key state and manage that state using an updateFunction, which called for each key, used new data and the existing state of the key to generate an updated state.
Based on user feedback, the Spark team defined a need for a new capability in the streaming API (called mapWithStateAPI) that tracks changes in the data rather than scanning the entire dataset. The team reports a 10X speedup in state management through this enhancement.
Expanded Machine Learning Capabilities
The team has added three new feature transformers for ML pipelines.
— The ChiSqSelector transformer operates on labeled data with categorical features. It performs a Chi-Squared test of independence on the class, then selects the top features.
— SQLTransformer implements transformations defined by a SQL statement.
— QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.
Also new in ML: an algorithm for Accelerated Failure Time (AFT) analysis, a parametric survival regression model for censored data, also called log-linear model for survival analysis. AFT is easier to parallelize than the Cox Proportional Hazards model.
An additional new feature for ML is pipeline persistence, or the ability to save and load ML Pipelines.
Additional new features for ML and MLlib:
- Normal equation solver for ordinary least squares with a limited number of features.
- Bisecting k-means clustering, a hierarchical clustering technique that is faster than regular k-means but produces different results.
- A Latent Dirichret Allocation (LDA) wrapper for topic modeling in ML pipelines.
For streaming analytics, Spark 1.6 implements streaming significance testing to support use cases such as A/B testing.
There are also some enhancements to the R and Python APIs. In general, the R API lags significantly behind the Java, Scala and Python APIs. In Spark 1.6, there are no enhancements to GraphX.