Spark 1.4 Released

On June 11, the Spark team announced availability of Release 1.4.  More than 210 contributors from 70 different organizations contributed more than 1,000 patches.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-06-12 at 2.00.20 PM

Spark Core

The Spark team continues to improve Spark operability, performance and compatibility.  Key enhancements include:

  • The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
  • Also for improved performance, serialized shuffle output
  • For the Spark UI, visualization for Spark DAGs and operational monitoring
  • A REST API for application information, such as job, stage, task and storage status
  • For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
  • Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
  • Two Mesos enhancements: Docker support and cluster mode.

DataFrames and SQL

This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.

A complete list of enhancements to the DataFrame API is here.

R Interface

AMPLab released a developer version of SparkR in January 2014.  In June 2014, Alteryx and Databricks announced a partnership to lead development of this component.  In March, 2015, SparkR officially merged into Spark.

SparkR offers an interface to use Apache Spark from R.  In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets.  It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.

Machine Learning

In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API.  Additional enhancements to ML include:

There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.

Enhancements to MLLib include:

There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank.  Spark’s graph analytics capabilities are comparatively static.

Streaming

The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs.  Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s