Thomas Dinsmore's Blog

Spark Release 1.3.0 Goes Live

March 17, 2015

Written by:

On Friday, March 13, the Apache Spark team announced availability Release 1.3.0. See Databricks’ announcement here; additional coverage here.

Spark continues to maintain its rapid cadence of enhancements, with 175 contributors and more than 1,000 commits. The new DataFrame API, previously called SchemaRDD, is a key feature of the new release. DataFrames are a key prerequisite for the long awaited R interface to Spark; originally envisioned for Release 1.3 at the earliest, the Spark team now expects to include this in Release 1.4.

While there is strong developer interest in Spark Core, the APIs and the SQL, MLLib and Streaming libraries, interest in GraphX remains low. Together with the low confidence in GraphX among users (surveyed recently by Typesafe and Databricks), this raises questions about the future of the module.

Here is a list of new features in this release:

Spark Core

Multi-level aggregation trees to speed up reduce operations
Improved error reporting for certain operations
Spark’s Jetty dependency is now shaded
Support for SSL encryption
Support for realtime GC metrics and record counts added to the UI

DataFrame API

New capability, previously called SchemaRDD
Includes named fields along with schema information
Common interchange format among spark components as well as import/export
Build from Hive tables, JSON data, JDBC databases or any of Spark’s data source APIs

Spark SQL

Graduates from alpha
Backward compatibility for HiveQL and stable APIs
Support for writing tables in data sources
New JDBC data source enables interface with MySQL, Postgres and other RDBMS systems
Ability to merge compatible schemas in Parquet

Spark MLLib

New algorithms:

Latent Dirichlet Allocation for topic modeling
Multinomial Logistic Regression for multi-class classification
Gaussian Mixture Modeling and Power Iteration Clustering for clustering
FP-Growth for frequent pattern mining (association learning)
Block Matrix Abstraction for distributed linear algebra

Spark has also added an initial capability to import and export models for some algorithms using a Spark-specific format. The team plans to add import/export capability for additional models in the future, as well as PMML support. Design document here.

Also new in 1.3.0: performance improvements for k-Means and ALS; Python API for the ML Pipeline, Gradient-Boosted Trees and Gaussian Mixture Model; and support for DataFrames.

Spark Streaming

Direct Kafka API eliminates need for write-ahead logs
Python Kafka API
Streaming Logistic Regression
Binary record support
Ability to load an initial state RDD

Spark GraphX

Updates to GraphX include several utility functions, including a tool to transform a graph into a canonical edge graph.

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Thomas Dinsmore's Blog

Spark Release 1.3.0 Goes Live

AI Is Coming For Your Job!!!

Spring 2024 Preview

More on AI Venture Funding

Spark Release 1.3.0 Goes Live

Share this:

Leave a comment Cancel reply

AI Is Coming For Your Job!!!

Spring 2024 Preview

More on AI Venture Funding