On Friday, March 13, the Apache Spark team announced availability Release 1.3.0. See Databricks’ announcement here; additional coverage here.
Spark continues to maintain its rapid cadence of enhancements, with 175 contributors and more than 1,000 commits. The new DataFrame API, previously called SchemaRDD, is a key feature of the new release. DataFrames are a key prerequisite for the long awaited R interface to Spark; originally envisioned for Release 1.3 at the earliest, the Spark team now expects to include this in Release 1.4.
While there is strong developer interest in Spark Core, the APIs and the SQL, MLLib and Streaming libraries, interest in GraphX remains low. Together with the low confidence in GraphX among users (surveyed recently by Typesafe and Databricks), this raises questions about the future of the module.
Here is a list of new features in this release:
Spark Core
- Multi-level aggregation trees to speed up reduce operations
- Improved error reporting for certain operations
- Spark’s Jetty dependency is now shaded
- Support for SSL encryption
- Support for realtime GC metrics and record counts added to the UI
DataFrame API
- New capability, previously called SchemaRDD
- Includes named fields along with schema information
- Common interchange format among spark components as well as import/export
- Build from Hive tables, JSON data, JDBC databases or any of Spark’s data source APIs
Spark SQL
- Graduates from alpha
- Backward compatibility for HiveQL and stable APIs
- Support for writing tables in data sources
- New JDBC data source enables interface with MySQL, Postgres and other RDBMS systems
- Ability to merge compatible schemas in Parquet
Spark MLLib
New algorithms:
- Latent Dirichlet Allocation for topic modeling
- Multinomial Logistic Regression for multi-class classification
- Gaussian Mixture Modeling and Power Iteration Clustering for clustering
- FP-Growth for frequent pattern mining (association learning)
- Block Matrix Abstraction for distributed linear algebra
Spark has also added an initial capability to import and export models for some algorithms using a Spark-specific format. The team plans to add import/export capability for additional models in the future, as well as PMML support. Design document here.
Also new in 1.3.0: performance improvements for k-Means and ALS; Python API for the ML Pipeline, Gradient-Boosted Trees and Gaussian Mixture Model; and support for DataFrames.
Spark Streaming
- Direct Kafka API eliminates need for write-ahead logs
- Python Kafka API
- Streaming Logistic Regression
- Binary record support
- Ability to load an initial state RDD
Spark GraphX
Updates to GraphX include several utility functions, including a tool to transform a graph into a canonical edge graph.