Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)
Updated and bumped July 10, 2014.
For a powerpoint version on Slideshare, go here.
Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop. Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014. According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.
Organizations seeking to implement advanced analytics in Hadoop face two key challenges. First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.
A second key challenge is the plethora of analytic point solutions in Hadoop. These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or Hive, Impala and Stinger for interactive queries. Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.
Spark directly addresses these challenges. It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data. This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.
Second, Spark offers an integrated framework for analytics, including:
- Machine learning (MLLib)
- Graph analytics (GraphX)
- Streaming analytics (Spark Streaming)
- SQL interface (Spark SQL)
A closely related project, Shark, supports fast queries in Hadoop. Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project. The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.
Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs. RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs. RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence. They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.
For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase). Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat. Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.
Spark’s machine learning library, MLLib, is rapidly growing. In Release 1.0.0 (the latest release) it includes:
- Linear regression
- Logistic regression
- k-means clustering
- Support vector machines
- Alternating least squares (for collaborative filtering)
- Decision trees for classification and regression
- Naive Bayes classifier
- Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
- Model evaluation functions
- L-BFGS optimization primitive
Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization. MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).
In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark. Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark. The Mahout team will maintain existing MapReduce projects. There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.
Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.
GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework. It enables users to interactively load, transform, and compute on massive graphs. Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.
Spark Streaming offers an additional abstraction called discretized streams, or DStreams. DStreams are a continuous sequence of RDDs representing a stream of data. The user creates DStreams from live incoming data or by transforming other DStreams. Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.
Currently, Spark supports programming interfaces for Scala, Java and Python; MLLib algorithms support sparse feature vectors in all three languages. For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014
There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0. In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined. In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface. So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.
Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements). Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.
Spark is now available in every major Hadoop distribution. Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks. (For more on Cloudera’s support for Spark, go here). In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.
Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.
IBM’s commitment to Spark is unclear. While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.
In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine. Datastax will partner with Databricks on this project; availability expected summer 2014.
At the 2014 Spark Summit, SAP announced its support for Spark. SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.
On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem. The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.
At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.
In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.
Databricks offers a certification program for Spark; participants currently include:
- Alpine Data Labs
In May, Databricks and Concurrent Inc announced a strategic partnership. Concurrent plans to add Spark support to its Cascading development environment for Hadoop.
In December, the first Spark Summit attracted more than 450 participants from more than 180 companies. Presentations covered a range of applications such as neuroscience, audience expansion, real-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).
There is a rapidly growing list of Spark Meetups, including:
- San Francisco
- Washington DC Area
Now available for pre-order on Amazon:
Finally, this series of videos provides some good basic knowledge about Spark.