Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:


One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

Big Analytics Roundup (November 16, 2015)

Just three main stories this week: possible trouble for a pair of analytic startups; Google releases TensorFlow to open source; and H2O delivers new capabilities at its annual meeting.

In other news, the Spark team announces Release 1.5.2, a maintenance release; and the Mahout guy announces Release 0.11.1, with bug fixes and performance improvements. (h/t Hadoop Weekly)

Two items of note from the Databricks blog:

— Darin McBeath describes Elsevier’s Spark use case and introduces spark-xml-utils, a Spark package contributed by his team.  The package enables the Spark user to filter documents based on an Path expression, return specific nodes for an Path/XQuery expression and transform documents using an XLST stylesheet.

— Rachit Agarwal and Anurag Khandelwal of Berkeley’s AMPLab introduce Succinct, a distributed datastore for queries on compressed data.   They announce release of Succinct Spark, a Spark package that enables search, count, range and random access queries on compressed RDDs.  The authors claim a 75X performance advantage over native Spark using Succinct as a document store,

Three interesting stories on streaming data:

  • In a podcast, Data Artisans CTO Stephan Ewen discusses Flink, Spark and the Kappa architecture.
  • Techalpine’s Kaushik Pal compares Spark and Flink for streaming data.
  • Will McGinnis helps you get started with Python and Flink.

(1) Analytic Startups in Trouble

In The Information, Steve Nellis and Peter Schulz explain why startups return to the funding well frequently — and why those that don’t may be in trouble.  Venture funding isn’t a perfect indicator of success, but is often the only indicator available.  On the list: Skytree Software and Alpine Data Labs.

(2) Google Releases TensorFlow for Machine Learning

On the Google Research blog, Google announces open source availability of TensorFlow.  TensorFlow is Google’s second generation machine learning system; it supports Deep Learning as well as any computation that can be expressed as a flow graph.   Read this white paper for details of the system.  At present, there are Python and C++ APIs;  Google notes that the C++ API may offer some performance advantages.

Video intro here.

In Wired, Cade Metz reports; Erik T. Mueller dismisses; and Metz returns to note that Deep Learning can leverage GPUs, and that AI’s future is in data, as if we didn’t know these things already.

On Slate, Will Oremus feels the buzz.

On his eponymous blog, Sachin Joglekar explains how to do k-means clustering with TensorFlow.

Separately, in VentureBeat, Jordan Novet rounds up open source frameworks for Deep Learning.

(3) Releases Steam

It’s not a metaphor.  At its second annual H2O World event, H2O releases Steam, an open source data science hub that bundles model selection, model management and model scoring into a single container for elastic deployment.

On the H2O Blog, Yotam Levy wraps Day One, Day Two and Day Three of the H2O World event.  Speaker videos are here, slides here.  (Registration required.)  Some notable presentations:

— H2O: Tomas Nykodym on GLM; Mark Landry on GBM and Random Forests; Arno Candel on Deep Learning; Erin LaDell on Ensemble Modeling.

— Michal Malohlava of H2O and Richard Garris of Databricks explain how to run H2O on Databricks Cloud.  Separately, Michal demonstrates Sparkling Water, a Spark package that enables a Spark user to call H2O algorithms; Nidhi Mehta leads a hands-on with PySparkling Water;  and Xavier Tordoir of Data Fellas exhibits Interactive Genomes Clustering with Sparkling Water on the Spark Notebook.

— Szilard Pafka of Epoch summarizes his work to date benchmarking R, Python, Vowpal Wabbit, H2O, xgboost and Spark MLLib.  As reported previously, Pafka’s benchmarks show that H2O and xgboost are the best performers; they are faster and deliver more accurate models.

As reported in last week’s roundup, also announces a $20 million “B” round.

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.


  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.


  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.


High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.


SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.


Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.

R Interface to Apache Spark

The team at AMPLab has announced a developer preview of SparkR, an R package enabling R users to run jobs on an Apache Spark cluster.   Spark is an open source project that supports distributed in-memory computing for advanced analytics, such as fast queries, machine learning, streaming analytics and graph engines.  Spark works with every data format supported in Hadoop, and supports YARN 2.2.

SparkR exposes the Spark API as distributed lists in R and automatically serializes the necessary variables to execute a function on the cluster.

SparkR is available now on GitHub.  It requires Scala 2.10, Spark version 0.9.0 or higher and depends on the rjava and testthat R packages.