Spark Summit East: A Report (Updated)

Updated with links to slides where available.  Some links are broken, conference organizers have been notified.

Spark Summit East 2015 met on March 18 and 19 at the Sheraton Times Square in New York City.  Conference organizers announced another sellout (like the last two Spark Summits on the West Coast).

Competition for speaking slots at Spark events is heating up.  There were 170 submissions for 30 speaking slots at this event, compared to 85 submissions for 50 slots at Spark Summit 2014.  Compared to the last Spark Summit, presentations in the Applications Track, which I attended, were more polished, and demonstrate real progress in putting Spark to work.

The “father” of Spark, Matei Zaharia, kicked off the conference with a review of Spark progress in 2014 and planned enhancements for 2015.  Highlights of 2014 include:

  • Growth in contributors, from 150 to 500
  • Growth in the code base, from 190K lines to 370K lines
  • More than 500 known production instances at the close of 2014

Spark remains the most active project in the Hadoop ecosystem.

Also, in 2014, a team at Databricks smashed the Daytona GreySort record for petabyte-scale sorting.  The previous record, set in 2013, used MapReduce running on 2,100 machines to complete the task in 72 minutes.  The new record, set by Databricks with Spark running in the cloud, used 207 machines to complete the task in 23 minutes.

Key enhancements projected for 2015 include:

  • DataFrames, which are similar to frames in R, already released in Spark 1.3
  • R interface, which currently exists as SparkR, an independent project, targeted to be merged into Spark 1.4 in June
  • Enhancements to machine learning pipelines, which are sequences of tasks linked together into a process
  • Continued expansion of smart interfaces to external data sources, pushing logic into the sources
  • Spark packages — a repository for third-party packages (comparable to CRAN)

Databricks CEO Ion Stoica followed with a pitch for Databricks Cloud, which included brief testimonials from myfitnesspal, Automatic, Zoomdata, Uncharted Software and Tresata.

Additional keynoters included Brian Schimpf of Palantir, Matthew Glickman of Goldman Sachs and Peter Wang of Continuum Analytics.

Spark contributors presented detailed views on the current state of Spark:

  • Michael Armbrust, Spark SQL lead developer presented on the new DataFrames API and other enhancements to Spark SQL.
  • Tathagata Das delivered a talk on the current state and future of Spark Streaming.
  • Joseph Bradley covered MLLib, focusing on the Pipelines capability added in Spark 1.2
  • Ankur Dave offered an overview of GraphX, Spark’s graph engine.

Several observations from the Applications track:

(1) Geospatial applications had a strong presence.

  • Automatic, Tresata and Uncharted all showed live demonstrations of marketable products with geospatial components running on Spark
  • Mansour Raad of ESRI followed his boffo performance at Strata/Hadoop World last October with a virtuoso demonstration of Spark with massive spatial and temporal datasets and the ESRI open source GIS stack

(2) Spark provides a great platform for recommendation engines.

  • Comcast uses Spark to serve personalized recommendations based on analysis of billions of machine-generated events
  • Gilt Groupe uses Spark for a similar real-time application supporting flash sale events, where products are available for a limited time and in limited quantities
  • Leah McGuire of Salesforce described her work building a recommendation system using Spark

(3) Spark is gaining credibility in retail banking.

  • Sandy Ryza of Cloudera presented on Value At Risk (VAR) computations in Spark, a critical element in Basel reporting and stress testing
  • Startup Tresata demonstrated its application for Anti Money Laundering, which is built on a social graph built in Spark

(4) Spark has traction in the life sciences

  • Jeremy Freeman of HHMI Janelia Research Center, a regular presenter at Spark Summits, covered Spark’s unique capability for streaming machine learning.
  • David Tester of Novartis presented plans to build a trillion-edge graph for genomic integration
  • Timothy Danforth of Berkeley’s AMPLab delivered a presentation on next-generation genomics with Spark and ADAM
  • Kevin Mader of ETH Zurich spoke about turning big hairy 3D images into simple, robust, reproducible numbers without resorting to black boxes or magic

Also in the applications track: presenters from Baidu, myfitnesspal and Shopify.

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

Apache Spark is 1.0

Today, the Spark project announced availability of Apache Spark 1.0.0, the first major release since the Apache Foundation named Spark a top-level project.  (Additional announcements here, here and here). With 117 contributors, Spark continues to build critical mass and engagement in the data science community.

Features of the new release include:

  • API stability
  • Integration with YARN security
  • Operational and packaging improvements
  • Spark SQL (Alpha)
  • MLLib enhancements, including
    • Support for sparse feature vectors
    • Scalable decision trees for classification and regression
    • Distributed SVD and PCA
    • Model evaluation functions
    • L-BFGS optimization primitive
  • GraphX enhancements, including performance improvements in graph loading, edge reversal and neighborhood computation
  • Streaming enhancements, including optimized performance for stateful stream transformations, improved Flume support and automated state cleanup for long-running jobs
  • Extended Java and Python support
  • Significant improvements to documentation

…and many small improvements, documented in the Release Notes.

For more information on Spark, read this backgrounder.

 

 

 

 

Machine Learning in Hadoop: Part Two

This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first category.  The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase.  RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.

Both projects enable R users to implement explicit parallelization in MapReduce.  R users write R scripts specifically intended to be run as mappers and reducers in Hadoop.  Users must have MapReduce skills, and must refactor program logic for distributed execution.  There are some differences between the two projects:

  • RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
  • RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
  • As noted above, RHIPE lacks an interface to HBase
  • Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE

Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib.  Both projects have commercial backing, and show robust development activity.  Statistics from GitHub for the thirty days ended February 12 show the following:

  • 0xdata H2O: 18 contributors, 938 commits
  • Apache Spark: 77 contributors, 794 commits

H2O is a project of startup 0xdata, which operates on a services and support business model.  Recent coverage by this blog here;  additional coverage here, here and here.

MLLib is one of several projects included in Apache Spark.  Databricks and Cloudera offer commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and here.

As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature.  Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.

Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS.  There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project.  Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.

Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit.  Development activity on these projects is much less robust than for H2O and Spark.  GitHub statistics for the thirty days ended February 12 speak volumes:

  • Apache Mahout: contributors, 54 commits
  • Vowpal Wabbit: 8 contributors, 57 commits

Neither project has significant commercial backing.  Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project.  (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout).  John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.

Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition.  Mahout’s supervised learning algorithms, however, are weak.  Criticisms of Mahout tend to fall into two categories:

  • The project itself is a mess
  • Mahout’s integration into MapReduce is suitable only for high latency analytics

On the first point, Mahout certainly does seem eclectic, to say the least.  Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects.  Many algorithms are underdeveloped, unsupported or both.  The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.

“High latency” is code for slow.  Slow food is a thing; “slow analytics” is not a thing.  The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.

Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop.  Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis.  Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.

In Part Three, we will cover commercial software for machine learning in Hadoop.