Spark Summit East: A Report (Updated)

Updated with links to slides where available.  Some links are broken, conference organizers have been notified.

Spark Summit East 2015 met on March 18 and 19 at the Sheraton Times Square in New York City.  Conference organizers announced another sellout (like the last two Spark Summits on the West Coast).

Competition for speaking slots at Spark events is heating up.  There were 170 submissions for 30 speaking slots at this event, compared to 85 submissions for 50 slots at Spark Summit 2014.  Compared to the last Spark Summit, presentations in the Applications Track, which I attended, were more polished, and demonstrate real progress in putting Spark to work.

The “father” of Spark, Matei Zaharia, kicked off the conference with a review of Spark progress in 2014 and planned enhancements for 2015.  Highlights of 2014 include:

  • Growth in contributors, from 150 to 500
  • Growth in the code base, from 190K lines to 370K lines
  • More than 500 known production instances at the close of 2014

Spark remains the most active project in the Hadoop ecosystem.

Also, in 2014, a team at Databricks smashed the Daytona GreySort record for petabyte-scale sorting.  The previous record, set in 2013, used MapReduce running on 2,100 machines to complete the task in 72 minutes.  The new record, set by Databricks with Spark running in the cloud, used 207 machines to complete the task in 23 minutes.

Key enhancements projected for 2015 include:

  • DataFrames, which are similar to frames in R, already released in Spark 1.3
  • R interface, which currently exists as SparkR, an independent project, targeted to be merged into Spark 1.4 in June
  • Enhancements to machine learning pipelines, which are sequences of tasks linked together into a process
  • Continued expansion of smart interfaces to external data sources, pushing logic into the sources
  • Spark packages — a repository for third-party packages (comparable to CRAN)

Databricks CEO Ion Stoica followed with a pitch for Databricks Cloud, which included brief testimonials from myfitnesspal, Automatic, Zoomdata, Uncharted Software and Tresata.

Additional keynoters included Brian Schimpf of Palantir, Matthew Glickman of Goldman Sachs and Peter Wang of Continuum Analytics.

Spark contributors presented detailed views on the current state of Spark:

  • Michael Armbrust, Spark SQL lead developer presented on the new DataFrames API and other enhancements to Spark SQL.
  • Tathagata Das delivered a talk on the current state and future of Spark Streaming.
  • Joseph Bradley covered MLLib, focusing on the Pipelines capability added in Spark 1.2
  • Ankur Dave offered an overview of GraphX, Spark’s graph engine.

Several observations from the Applications track:

(1) Geospatial applications had a strong presence.

  • Automatic, Tresata and Uncharted all showed live demonstrations of marketable products with geospatial components running on Spark
  • Mansour Raad of ESRI followed his boffo performance at Strata/Hadoop World last October with a virtuoso demonstration of Spark with massive spatial and temporal datasets and the ESRI open source GIS stack

(2) Spark provides a great platform for recommendation engines.

  • Comcast uses Spark to serve personalized recommendations based on analysis of billions of machine-generated events
  • Gilt Groupe uses Spark for a similar real-time application supporting flash sale events, where products are available for a limited time and in limited quantities
  • Leah McGuire of Salesforce described her work building a recommendation system using Spark

(3) Spark is gaining credibility in retail banking.

  • Sandy Ryza of Cloudera presented on Value At Risk (VAR) computations in Spark, a critical element in Basel reporting and stress testing
  • Startup Tresata demonstrated its application for Anti Money Laundering, which is built on a social graph built in Spark

(4) Spark has traction in the life sciences

  • Jeremy Freeman of HHMI Janelia Research Center, a regular presenter at Spark Summits, covered Spark’s unique capability for streaming machine learning.
  • David Tester of Novartis presented plans to build a trillion-edge graph for genomic integration
  • Timothy Danforth of Berkeley’s AMPLab delivered a presentation on next-generation genomics with Spark and ADAM
  • Kevin Mader of ETH Zurich spoke about turning big hairy 3D images into simple, robust, reproducible numbers without resorting to black boxes or magic

Also in the applications track: presenters from Baidu, myfitnesspal and Shopify.

SAS Misses 2014 Growth Forecast

At the beginning of 2014, SAS EVP and CMO Jim Davis predicted double-digit revenue growth for 2014; in October, CEO Jim Goodnight walked that back to 5%, citing a challenging business climate in Europe.  Today, SAS announced 2014 revenue of $3.09 Billion, up 2.3%.

Meanwhile, IBM reported growth in analytics revenue of 7% in Q4.

The challenge for SAS is that the US market is saturated: virtually every enterprise that ever will use SAS already does so, and there are limits to the number of new products one can add to the stack.  Much of SAS’ growth comes from overseas, and a strong dollar impairs SAS’ ability to sell in foreign markets.

On the positive side, SAS reports a total of 3,400 sites for SAS Visual Analytics, its “Tableau-killer”, compared to 1,400 sites announced last year, for a net growth of 2,000 sites.  (In SAS’ parlance, a “site” is roughly equivalent to a server.)  Tableau has not yet released its 2014 results, but in Q3 Tableau reports that it added 2,500 customer accounts.

SAS also reports 24% revenue growth for its cloud services.   IT analyst Synergy Research Group reports that the cloud market is growing at a 49% annualized rate, although AWS, Microsoft, IBM and Google are all growing much faster than that.

In other news, the WSJ reports that Big Data analytics startup Palantir is now valued at $15 billion, which is about the same as what it would cost an acquirer to buy SAS at 5X revenue.

Smart Money: Venture Capital for Analytics 2013

Thanks to Crunchbase’s downloadable database, we can report that in 2013 investors poured more than $2 billion into Analytic startups, up 38% from 2012.  Crunchbase reports 2013 funding for Analytics ventures more than five times greater than in 2009.

Source: Crunchbase
Source: Crunchbase

Palantir led the pack in new funding, going to the well twice, in October and December, to raise a total of $304m based on a valuation of $9b.  As a point of reference, at 4X revenue, industry leader SAS is worth about $12b.

Funding flowed to companies that build advanced analytics into focused vertical or horizontal solutions.  Examples include:

Investors paid special attention to vendors who specialize in social media analytic platforms:

Capital also flowed to companies offering general-purpose software, platforms and services for analytics, including:

Investors continue to fund startups offering easy-to-use interfaces for the business user, including:

Top investors in Analytics for 2013 include:

Clearly, investors are placing bets on a robust future for analytics.