Key highlights from the 2014 Spark Summit:
- Spark is the single most active project in the Hadoop ecosystem
- Among Hadoop distributors, Cloudera and MapR are clear leaders with Spark
- SAP now offers a certified Spark distribution and integration with HANA
- Datastax has delivered a Cassandra connector for Spark
- Databricks plans to offer a cloud service for Spark
- Spark SQL will absorb the Shark project for fast SQL
- Cloudera, MapR, IBM and Intel plan to port Hive to Spark
- Spark MLLIb will double its supported algorithms in the next release
Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event. Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three-day affair.
It’s always ironic when manual registration at a tech conference produces long lines:
Databricks CTO Matei Zaharia kicked off the keynotes with his recap of Spark progress since the last summit. Zaharia enumerated Spark’s two big goals: a unified platform for Big Data applications combined with a standard library for analytics. CEO Ion Stoica followed with a Databricks update, including an announcement of the SAP alliance and an impressive demo of Databricks Cloud, currently in private beta. Separately, Databricks announced $33 million in Series B funding.
Spark Release Manager Patrick Wendell delivered an overview of planned development over the next several releases. Wendell confirmed Spark’s commitment to stable APIs; patches that break the API fail the build. The project will deliver dot releases every three months beginning in August 2014, and maintenance releases as needed. Development focus in the near future will be in the libraries:
- Spark SQL: optimization, extensions (toward SQL 92), integration (NoSQL, RDBMS), incorporation of Shark
- MLLib : rapid expansion of algorithms (including descriptive statistics, NMF. Sparse SVM, LDA), tighter integration with R
- Streaming: new data sources, tighter Flume integration
- GraphX: optimizations and API stability
Mike Franklin of Berkeley’s AMPLab summarized new developments in the Berkeley Data Analytics Stack (“BadAss”), including significant new work in genomics and energy, as well as improvements to Tachyon and MLBase. Dave Patterson elaborated on AMPLab’s work in genomics, providing examples showing how Spark has markedly reduced both cost and runtime for genomic analysis.
Cloudera, Datastax, MapR and SAP demonstrated that the first rule of success is to show up:
- Mike Olson of Cloudera responded to Hortonworks’ snark by confirming Cloudera’s commitment to Impala as well as Hive on Spark. Olson drew a round of applause when he invited Horton to join the Hive on Spark consortium.
- Martin van Ryswyk of Datastax announced immediate availability of a Cassandra driver for Spark, a component that exposes Cassandra tables as Spark RDDs. Datastax continues to work on tighter integration with Spark, including support for Spark SQL, Streaming and GraphX libraries. In the breakouts, Datastax delivered a deeper briefing on integration with Spark Streaming.
- M.C. Srivas of MapR highlighted Spark benefits realized by four MapR customers, including Cisco, a health insurer, an ad platform and a pharma company. MapR continues to claim support for Shark as a differentiator, a point mooted by the announcement that Spark SQL will soon absorb Shark.
- Aiaz Kazi of SAP seemed pleased that most of the audience has heard of SAP HANA, and delivered an overview of SAP’s integration with Spark.
IBM wasted a Platinum sponsorship by sending some engineers to talk about “System T”, IBM’s text mining application, with passing references to Spark. Although IBM Infosphere BigInsights is a certified Spark distribution, IBM appears uncommitted to Spark; the lack of executive presence at the Summit stood out in sharp contrast to Cloudera and MapR.
Silver sponsors Hortonworks and Pivotal hosted tables in the vendor area, but did not present anything.
Neuroscientist Jeremy Freeman, back by popular demand from the 2013 Spark Summit, presented latest developments in his team’s research into animal brains using Spark as an analytics platform. Freeman’s presentations are among the best demonstrations of applied analytics that I’ve seen in any forum.
A number of vendors in the Spark ecosystem delivered presentations showing how their applications leverage Spark, including:
The most significant change from the 2013 Spark Summit is the number of reported production users for Spark. While the December conference focused on Spark’s potential, I counted several dozen production users among the presentations I attended.
Also among the sellout crowd: a SAS executive checking to see if there is anything to this open source and vendor-neutral stuff. Apparently, he did not get Jim Goodnight’s message that “Big Data is hype manufactured by media“.