Spark Release 1.3.0 Goes Live

On Friday, March 13, the Apache Spark team announced availability Release 1.3.0.  See Databricks’ announcement here; additional coverage here.

Spark continues to maintain its rapid cadence of enhancements, with 175 contributors and more than 1,000 commits.  The new DataFrame API, previously called SchemaRDD, is a key feature of the new release.  DataFrames are a key prerequisite for the long awaited R interface to Spark; originally envisioned for Release 1.3 at the earliest, the Spark team now expects to include this in Release 1.4.

While there is strong developer interest in Spark Core, the APIs and the SQL, MLLib and Streaming libraries, interest in GraphX remains low.  Together with the low confidence in GraphX among users (surveyed recently by Typesafe and Databricks), this raises questions about the future of the module.

Here is a list of new features in this release:

Spark Core

  • Multi-level aggregation trees to speed up reduce operations
  • Improved error reporting for certain operations
  • Spark’s Jetty dependency is now shaded
  • Support for SSL encryption
  • Support for realtime GC metrics and record counts added to the UI

DataFrame API

  • New capability, previously called SchemaRDD
  • Includes named fields along with schema information
  • Common interchange format among spark components as well as import/export
  • Build from Hive tables, JSON data, JDBC databases or any of Spark’s data source APIs

Spark SQL

  • Graduates from alpha
  • Backward compatibility for HiveQL and stable APIs
  • Support for writing tables in data sources
  • New JDBC data source enables interface with MySQL, Postgres and other RDBMS systems
  • Ability to merge compatible schemas in Parquet

Spark MLLib

New algorithms:

Spark has also added an initial capability to import and export models for some algorithms using a Spark-specific format.  The team plans to add import/export capability for additional models in the future, as well as PMML support.  Design document here.

Also new in 1.3.0: performance improvements for k-Means and ALS; Python API for the ML Pipeline, Gradient-Boosted Trees and Gaussian Mixture Model; and support for DataFrames.

Spark Streaming

Spark GraphX

Updates to GraphX include several utility functions, including a tool to transform a graph into a canonical edge graph.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.