Spark 1.1 Update

For an overview of Spark, see the Apache Spark Page.

On September 11, the Spark team announced release of Spark 1.1.   This latest version of Spark includes a number of significant enhancements:

  • As announced at the Spark Summit, Shark is now converged with Spark SQL.  Databricks has migrated its Shark workloads to Spark, and reports 2X-5X performance improvement.
  • The team has added a library of basic statistics for exploratory analysis, including correlations and hypothesis testing.  There are also new tools for stratified sampling and random generation.
  • Also new to MLLib: utilities for feature extraction for text mining and feature transformation.  Feature extraction techniques include Word2Vec and TF-IDF;  transformation techniques include normalization and scaling.
  • New MLLib algorithms include non-negative matrix factorization and singular value decomposition (SVD) using the Lanczos algorithm.  The combination of feature extraction capabilities and a robust SVD give Spark a strong foundation for text mining.
  • For Spark Streaming, the team has added support for Amazon Kinesis and a streaming linear regression algorithm.

There are also many bug fixes, as well as performance and usability improvements.  With ~175 contributors for this release, Spark continues to be one of the most active projects in the Hadoop ecosystem.

Since release of Spark 1.0, Databricks has announced certification for three additional Spark distributions:

  • Bluedata, a pioneer in big data private cloud.
  • Guavus, an operational intelligence platform.
  • Stratio, a commercially supported open source “Pure Spark” distribution.

In related news, Databricks and O’Reilly Media recently announced a certification program, which will be launched October 15-17 at Strata NY + Hadoop World.  More information here, here, here and here.

Spark Summit 2014 Roundup

Key highlights from the 2014 Spark Summit:

  • Spark is the single most active project in the Hadoop ecosystem
  • Among Hadoop distributors, Cloudera and MapR are clear leaders with Spark
  • SAP now offers a certified Spark distribution and integration with HANA
  • Datastax has delivered a Cassandra connector for Spark
  • Databricks plans to offer a cloud service for Spark
  • Spark SQL will absorb the Shark project for fast SQL
  • Cloudera, MapR, IBM and Intel plan to port Hive to Spark
  • Spark MLLIb will double its supported algorithms in the next release

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event.  Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three-day affair.

It’s always ironic when manual registration at a tech conference produces long lines:

SS4

Databricks CTO Matei Zaharia kicked off the keynotes with his recap of Spark progress since the last summit.   Zaharia enumerated Spark’s two big goals: a unified platform for Big Data applications combined with a standard library for analytics.  CEO Ion Stoica followed with a Databricks update, including an announcement of the SAP alliance and an impressive demo of Databricks Cloud, currently in private beta.  Separately, Databricks announced $33 million in Series B funding.

Spark Release Manager Patrick Wendell delivered an overview of planned development over the next several releases.   Wendell confirmed Spark’s commitment to stable APIs; patches that break the API fail the build.   The project will deliver dot releases every three months beginning in August 2014, and maintenance releases as needed.   Development focus in the near future will be in the libraries:

  • Spark SQL: optimization, extensions (toward SQL 92), integration (NoSQL, RDBMS), incorporation of Shark
  • MLLib : rapid expansion of algorithms (including descriptive statistics, NMF. Sparse SVM, LDA), tighter integration with R
  • Streaming: new data sources, tighter Flume integration
  • GraphX: optimizations and API stability

Mike Franklin of Berkeley’s AMPLab summarized new developments in the Berkeley Data Analytics Stack (“BadAss”), including significant new work in genomics and energy, as well as improvements to Tachyon and MLBase.  Dave Patterson elaborated on AMPLab’s work in genomics, providing examples showing how Spark has markedly reduced both cost and runtime for genomic analysis.

Cloudera, Datastax, MapR and SAP demonstrated that the first rule of success is to show up:

  • Mike Olson of Cloudera responded to Hortonworks’ snark by confirming Cloudera’s commitment to Impala as well as Hive on Spark.  Olson drew a round of applause when he invited Horton to join the Hive on Spark consortium.
  • Martin van Ryswyk of Datastax announced immediate availability of a Cassandra driver for Spark, a component that exposes Cassandra tables as Spark RDDs.  Datastax continues to work on tighter integration with Spark, including support for Spark SQL, Streaming and GraphX libraries.  In the breakouts, Datastax delivered a deeper briefing on integration with Spark Streaming.
  • M.C. Srivas of MapR highlighted Spark benefits realized by four MapR customers, including Cisco, a health insurer, an ad platform and a pharma company.  MapR continues to claim support for Shark as a differentiator, a point mooted by the announcement that Spark SQL will soon absorb Shark.
  • Aiaz Kazi of SAP seemed pleased that most of the audience has heard of SAP HANA, and delivered an overview of SAP’s integration with Spark.

IBM wasted a Platinum sponsorship by sending some engineers to talk about “System T”, IBM’s text mining application, with passing references to Spark.  Although IBM Infosphere BigInsights is a certified Spark distribution, IBM appears uncommitted to Spark; the lack of executive presence at the Summit stood out in sharp contrast to Cloudera and MapR.

Silver sponsors Hortonworks and Pivotal hosted tables in the vendor area, but did not present anything.

Neuroscientist Jeremy Freeman, back by popular demand from the 2013 Spark Summit, presented latest developments in his team’s research into animal brains using Spark as an analytics platform.  Freeman’s presentations are among the best demonstrations of applied analytics that I’ve seen in any forum.

A number of vendors in the Spark ecosystem delivered presentations showing how their applications leverage Spark, including:

The most significant change from the 2013 Spark Summit is the number of reported production users for Spark.  While the December conference focused on Spark’s potential, I counted several dozen production users among the presentations I attended.

Also among the sellout crowd: a SAS executive checking to see if there is anything to this open source and vendor-neutral stuff.  Apparently, he did not get Jim Goodnight’s message that “Big Data is hype manufactured by media“.

 

Smart Money: Venture Capital for Analytics 2013

Thanks to Crunchbase’s downloadable database, we can report that in 2013 investors poured more than $2 billion into Analytic startups, up 38% from 2012.  Crunchbase reports 2013 funding for Analytics ventures more than five times greater than in 2009.

Source: Crunchbase
Source: Crunchbase

Palantir led the pack in new funding, going to the well twice, in October and December, to raise a total of $304m based on a valuation of $9b.  As a point of reference, at 4X revenue, industry leader SAS is worth about $12b.

Funding flowed to companies that build advanced analytics into focused vertical or horizontal solutions.  Examples include:

Investors paid special attention to vendors who specialize in social media analytic platforms:

Capital also flowed to companies offering general-purpose software, platforms and services for analytics, including:

Investors continue to fund startups offering easy-to-use interfaces for the business user, including:

Top investors in Analytics for 2013 include:

Clearly, investors are placing bets on a robust future for analytics.