For an overview of Spark, see the Apache Spark Page.
On September 11, the Spark team announced release of Spark 1.1. This latest version of Spark includes a number of significant enhancements:
- As announced at the Spark Summit, Shark is now converged with Spark SQL. Databricks has migrated its Shark workloads to Spark, and reports 2X-5X performance improvement.
- The team has added a library of basic statistics for exploratory analysis, including correlations and hypothesis testing. There are also new tools for stratified sampling and random generation.
- Also new to MLLib: utilities for feature extraction for text mining and feature transformation. Feature extraction techniques include Word2Vec and TF-IDF; transformation techniques include normalization and scaling.
- New MLLib algorithms include non-negative matrix factorization and singular value decomposition (SVD) using the Lanczos algorithm. The combination of feature extraction capabilities and a robust SVD give Spark a strong foundation for text mining.
- For Spark Streaming, the team has added support for Amazon Kinesis and a streaming linear regression algorithm.
There are also many bug fixes, as well as performance and usability improvements. With ~175 contributors for this release, Spark continues to be one of the most active projects in the Hadoop ecosystem.
Since release of Spark 1.0, Databricks has announced certification for three additional Spark distributions:
- Bluedata, a pioneer in big data private cloud.
- Guavus, an operational intelligence platform.
- Stratio, a commercially supported open source “Pure Spark” distribution.
In related news, Databricks and O’Reilly Media recently announced a certification program, which will be launched October 15-17 at Strata NY + Hadoop World. More information here, here, here and here.