On June 11, the Spark team announced availability of Release 1.4. More than 210 contributors from 70 different organizations contributed more than 1,000 patches. Spark continues to expand its contributor base, the best measure of health for an open source project.
Spark Core
The Spark team continues to improve Spark operability, performance and compatibility. Key enhancements include:
- The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
- Also for improved performance, serialized shuffle output
- For the Spark UI, visualization for Spark DAGs and operational monitoring
- A REST API for application information, such as job, stage, task and storage status
- For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
- Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
- Two Mesos enhancements: Docker support and cluster mode.
DataFrames and SQL
This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.
- Hive-style partitioning support in data sources API
- Ability to optimize very large joins through sort-merge
- Improved DataFrame API for equijoins and self-joins
- Dedicated UI for SQL JDBC
- Improved DataFrame and SQL error message reporting
- Window function support
- Column arithmetics in DataFrames
- Summary statistics (count, mean, std, min and max) for DataFrames
- Rollup and cube functions
A complete list of enhancements to the DataFrame API is here.
R Interface
AMPLab released a developer version of SparkR in January 2014. In June 2014, Alteryx and Databricks announced a partnership to lead development of this component. In March, 2015, SparkR officially merged into Spark.
SparkR offers an interface to use Apache Spark from R. In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets. It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.
Machine Learning
In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API. Additional enhancements to ML include:
- ElasticNet: Binary Logistic Regression with L1/L2
- OnevsRest: Multiclass to Binary Reduction
- Standardized developer APIs
There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.
Enhancements to MLLib include:
- Stabilized API for decision trees, RandomForests and GradientBoosted Trees
- API for feature attributes
- Bernoulli Naive Bayes algorithm
- Latent Dirichlet Allocation with online variational inference
- Support recommendAll in matrix factorization model
- PMML export for k-means, linear regression, ridge regression, lasso, support vector machines and binary logistic regression
There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank. Spark’s graph analytics capabilities are comparatively static.
Streaming
The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs. Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.
2 thoughts on “Spark 1.4 Released”