Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

Strata + Hadoop World 2014

A sellout crowd of 5,500 met at the Javits Center in New York last week for the 2014 Strata + Hadoop World conference.  There were three major themes:

Big Data in Action.   In his keynote address, Mike Olson of Cloudera noted the shift from talking about “geeky projects like Pig, Sqoop and Oozie” to talking about applications, such as fraud detection, product design and agriculture.   An entire track in the conference featured success stories from companies such as Goldman Sachs, Transamerica, American Express, L.L. Bean, FICO and Kaiser Permanente.

Symbiosis of Analytics and Big Data.  Paul Zikopoulos of IBM observed that “Big Data without analytics is just a bunch of data.”   Zikopoulos drew an analogy to the mining industry, which uses advanced technology to extract trace amounts of valuable material from large quantities of low-grade ore; in Big Data, we use advanced analytics to extract useful insight from large quantities of low-value per byte data.  Conference sessions reflected the critical role analytic technology plays in the Big Data value chain.

Spark has arrived.  The 2013 conference included two sessions about Spark; this year, thirteen sessions featured Spark, including the sold-out full day Spark Camp.  Moreover, vendors such as ClearStory Data and Platfora openly touted Spark integration, in the belief that this capability resonates with buyers.  Other conference sponsors recently certified on Spark include Pentaho, Skytree, Tableau, Talend and Trifacta; and MapR announced a project to deliver Apache Drill on Spark.

Among the notable Spark sessions:

  • Sean Owen of Cloudera delivered an excellent demonstration of Spark’s MLLib machine learning library for anomaly detection
  • Michael Armbrust of Databricks presented on Spark SQL and its uses as both a query language and a general framework for working with structured data

Advancing a theme he introduced last year, Olson speculated in his keynote that Hadoop will “disappear” this year because enterprises increasingly view Hadoop in the context of an overall data management strategy.  He cited the recent Teradata-Cloudera partnership as evidence of this trend.  That announcement is certainly significant, but it demonstrates the opposite of Olson’s high-level point; Teradata abandoned its exclusive relationship with Hortonworks because many of its customers prefer Cloudera to HDP, and they aren’t willing to switch simply because TD sells a “Unified Data Architecture.”  Most enterprises still make decisions about Hadoop separately from decisions about other elements in the warehousing mix, and there are currently few good reasons to change that behavior.

Rana El Kaliouby of Affectiva presented an excellent example of analytics and Big Data working together.  Affectiva uses streaming facial recognition to capture millions of data points as consumers react to content, and uses machine learning algorithms to draw insight from the data.  By mapping the streaming data to emotional states, they can identify what content resonates with consumers.

Several of the sponsored topics in the plenary sessions were quite good, including presentations by MapR, Intel, ClearStory and IBM; others were about what one expects from sponsored presentations.

There were also a number of entertaining presentations that had little to do with Big Data.  Shankar Vedantum of NPR, for example, spent ten minutes sermonizing about the propensity of the human mind to select facts that confirm existing biases, and selectively used facts to illustrate his point.  He should have paid attention in “Research Methods 101”; at best, his point seemed trite, like telling a convention of nutritionists that “dieting is hard.”

Eli Collins of Cloudera delivered the obligatory “ethics and Big Data” piece, in which he argued that we should “use data for good”; his piece was immediately followed, ironically, by a presentation about using facial recognition to get people to buy more candy.  Everyone agrees that doing good is a good thing, but a technologist delivering a sermon is as silly as a Baptist minister lecturing on Oozie.