Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2016 Big Analytics Predictions Roundup

Before publishing my own predictions for 2016 later this week, I thought it would be fun to round up published predictions on analytics and Big Data.  Looking through this list, I see a few patterns:

— Streaming is hot.  Analysts do not seem to understand distinctions between streaming data, streaming analytics and real-time decisioning.

— “Data Science” continues to be a term that means whatever you like.

— Security and anti-fraud analytics will be a thing in 2016.  (They were also a thing in 2015.)

— Industry analysts are divided about whether or not the analytics talent crunch will persist.

— IoT is a great concept for selling data management tools, but few know how to make sense of it.

On ZDNet, Andrew Brust summarizes 60 predictions from 17 executives and sees the following:

  1. Increased adoption of streaming analytics
  2. Maturation of IoT technologies
  3. Value and maturity in Big Data products
  4. Increased deployment of artificial intelligence and machine learning

On KDnuggets, Gregory Piatetsky reports on five predictions for 2016 from Tom Davenport of the International Institute of Analytics.  (Webinar replay here.)

  1. Cognitive technology will be the next thing after automated analytics.
  2. Analytical microservices will facilitate embedded analytics.
  3. Data Science and predictive analytics will merge.
  4. The analytics talent crunch will ease due to increased enrollment in graduate programs.
  5. Analytics will focus on data curation and management.

Davenport is smoking something if he thinks cognitive computing will be a thing in 2016.

In Forbes, Gil Press synthesizes the IIA’s predictions (above) with predictions from Forrester, IDC and Gartner to get six predictions:

  1. Analytics will be embedded everywhere.
  2. Machine learning will replace manual data wrangling.
  3. The shortage of analytics talent will persist.
  4. Analytics projects will be riskier than typical IT projects.
  5. Cognitive computing will be the next buzzword.  (Press clearly does not agree with Davenport).
  6. Data monetization will take off.

Predictions (2) and (3) conflict with one another; since analysts spend 80% of their time data wrangling, tooling that automates this step will relieve the talent shortage.

On Datanami, Alex Woodie wades through “dozens” of predictions and publishes the 33 most interesting.  Many of these are self-serving, obvious or nonsensical, so I will do the work Woodie’s editor did not do and distill the list to five:

  1. Streaming analytics will mature and prove its worth.
  2. Apache Kafka will be an essential integration point in enterprise infrastructure.
  3. Business user access to Hadoop data will improve.
  4. Spark will significantly displace MapReduce for Hadoop workloads.
  5. Spark processing outside of Hadoop will also increase significantly.

Teryn O’Brien of Silicon Angle reports on a webinar hosted by Alteryx that included Bob Laurent of Alteryx, Clarke Patterson of Cloudera and Francois Ajenstat of Tableau.  The panel offered three predictions:

  1. Analyst jobs will be hot and analysts will be everyday heroes.
  2. Spark, the cloud and IoT will be big in 2016.
  3. Advanced analytics will play a key role in the Presidential election.

On ITPortal, Dell’s Todd O’Brien predicts three things for 2016:

  1. The role of Citizen Data Scientists will expand and evolve.  (Me: WTF?)
  2. Analytics will significantly affect vertical markets, especially manufacturing.
  3. All innovation will trace back to analytics

On the first point, I think that O’Brien is trying to say that companies should buy analytics software that is easy to use, like what Dell offers.

On the FICO blog, FICO’s chief analytics officer Scott Zoldi offers five predictions for 2016:

  1. Streaming analytics will come of age in 2016.
  2. “Prescriptive analytics” (his term for anomaly detection) will be a must-have security technology.
  3. “Lifestyle analytics” (predictions embedded in consumer interactions) will integrate prescriptive analytics into daily life.
  4. Businesses will rethink Big Data governance.
  5. Fake data scientists will emerge.

On a SAS blog, Polly Mitchell-Guthrie predicts five things:

  1. Machine learning (will be) established in the enterprise.
  2. IOT hype hits reality.
  3. Big Data moves beyond hype.
  4. Analytics improve cybersecurity.
  5. Analytics drives increased industry-academic interaction.

It’s standard practice at SAS to call any new IT trend “hype.”

In a press release, the health analytics vendor SCIO Health Analytics makes four predictions for 2016:

  1. Greater focus on educating health consumers.
  2. Demand for more precision in health analytics.
  3. More time will be spent on reimbursement strategies.
  4. The need for data and transparency across domains will increase.

Prediction #1 may be true, but it’s not really about health analytics.

On the Talend blog, CMO Ashley Stirrup predicts four things:

  1. Real-time analytics will take center stage
  2. New business threats will emerge
  3. CIO turnover will accelerate
  4. Businesses will retool

#2 and #4 aren’t really predictions, they simply state the obvious.

Benchmark: Spark Beats MapReduce

A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads.  In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort.

On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”.  This point refers to Databricks’ record-setting entry in the 2014 Sort Benchmark run by Chris Nyberg, Mehul Shah and Naga Govindaraju.

However, Ramel appears to have overlooked section 3.3.1 of the paper, where the researchers explicitly address this question:

This difference is mainly because our cluster is connected using 1 Gbps Ethernet, as compared to a 10 Gbps Ethernet in, i.e., in our cluster configuration network can become a bottleneck for Sort in Spark.

In other words, had they deployed Spark on a cluster with high-speed network connections, it likely would run the Sort faster than MapReduce did.

I guess we’ll know when Nyberg et. al. release the 2015 GraySort results.

The IBM benchmark team found that k-means ran about 5X faster in Spark than in MapReduce.  Ramel highlights the difference between this and the Spark team’s claim that machine learning algorithms run “up to” 100X faster.

The actual performance comparison shown on the Spark website compares logistic regression, which the IBM researchers did not test.  One possible explanation — the Spark team may have tested against Mahout’s logistic regression algorithm, which runs on a single machine.  It’s hard to say, since the Spark team provides no backup documentation for its performance claims.  That needs to change.

Spark is Too Big to Fail

Reacting to growing interest in Apache Spark, there is a developing contrarian meme:

  • David Ramel asks: are Spark and Hadoop friends or foes?
  • Jack Vaughan compares Spark to the PDP-11, dismisses it as “just processing.”
  • Doug Henschen praises Spark, pans Databricks
  • Nicole Laskowski complains that Spark Summit East “felt like a Databricks show.”
  • Andrew Oliver thinks Spark needs to grow up
  • Andrew Brust worries that vendors are ahead of customers on Spark
  • IBM’s James Kobelius characterizes Spark as “the shiny new thing”
  • Gartner’s Nick Heudecker asserts that Spark is “not enterprise ready”

Spark skepticism falls into three broad categories:

  • Hadoop Purism: Spark deviates from the MapReduce/HDFS framework, and some people aren’t happy about that
  • Backseat Driving: Some analysts argue that Spark is great but Databricks, the commercial venture behind Spark, should do X, Y or Z
  • FUD: Spark’s competitors — commercial and open source — plant “issues” and “concerns” about Spark with industry analysts

Let’s examine each in turn.

“Spark Competes With Hadoop”

Spark does not compete with Hadoop; it competes with MapReduce.  Hadoop is an ecosystem of projects; there are a few components included in all commercial distributions (e.g. Hive, Pig, Hbase), but these  aren’t used at every site.  The ability to mix and match components is a strength for Hadoop.

Some software, like Spark, can run co-located in a Hadoop cluster or on clustered machines outside of Hadoop.  This should not surprise anyone; clustering and distributed computing existed before Hadoop.  Why does it matter if a software component can run both ways?  Users and use cases will drive implementation, and if Spark works better with Cassandra than with HDFS, or if a Spark user does not need the other Hadoop bits, so be it.

While there are reports of organizations that have abandoned MapReduce, most organizations will use Spark together with MapReduce; if users are happy with existing MapReduce jobs, there is no need to rewrite them.  For new applications, however, some users will choose Spark over MapReduce for a variety of reasons; for better runtime performance, more efficient programming, more built-in features or simply because it’s the latest thing.  Isn’t competition a wonderful thing?

Organizations using standalone instances of Spark likely never considered using MapReduce for the application in question.  For these use cases, Spark competes with SAS, Skytree, H2O, Graphlab or some other machine learning software.

Databricks Envy

Sniping at Databricks is equally unwarranted. (Note: I’m not on the payroll.)  There are only so many ways to build a viable open source business model.   Offering a commercial product with additional bits is one way to do so; that is how Cloudera and MapR operate.  Databricks offers a hosted service for Spark with a few extra bits; if you don’t like Databricks’ offering, you can implement on-premises yourself or get Spark as a service through Amazon Web Services, BlueData, Qubole or elsewhere.

And if you really must have a notebook for Spark, try Zeppelin.

Of course, it’s true that Hortonworks open sources everything.  HDP loses $3.76 for every dollar they sell.  They hope to make it up on volume.

Databricks contributes heavily to the open source Spark project, supporting developers whose sole job is to improve Spark.  Most importantly, Databricks provides leadership and release management, which inspires confidence that Spark will not turn into a muddled mess like Mahout.

The complaint that Spark Summit East “felt like a Databricks show” is odd — one rarely hears complaints that Oracle World “feels like an Oracle show.”  There were thirty-nine presentations on the agenda at Spark Summit East, and one — Ion Stoica’s keynoter — highlighted Databricks Cloud.   In contrast, sponsored sessions accounted for a third of the sessions at the 2015 Strata + Hadoop World in Santa Clara.

“Spark Is Not Enterprise-Ready”

Some of the criticism is silly.   Andrew Oliver is shocked to discover that Release 1.0 of Databricks Cloud’s notebook, currently still in beta release, isn’t as slick as Tableau.  Also, a process he was watching timed out.  But wait!  That might be due to slow hotel wi-fi…

Meanwhile, SecurityTracker reports a major security flaw in IBM’s BigSQL.

Is Spark “enterprise ready?”  The same question could be asked about Hadoop, and conservative enterprises will answer “no” in both cases.  There is no single threshold that determines when a piece of software is “enterprise-ready”.  Use cases matter; the standard for software that will run your ATMs is not the same as the standard for software to be used for genomics research.

According to Gartner’s Heudecker, “actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearData Story and Paxata, which uses Spark for data preparation. Other companies primarily use Spark to power dashboards.”  Interesting to hear Gartner dismiss the dashboard market; but enterprises are currently using Spark for more than dashboards.  A top global bank uses Spark today for Basel reporting and stress testing; if you’re not familiar with stress testing, suffice to say that a bank that gets this application wrong is in a heap of trouble.

It’s true that vendors are ahead of customers on Spark  This is hardly out of the ordinary with new technology; one could have said the same thing about Hive in 2010.  Vendors are always ahead of customers; it’s their job.

Spark is Too Big to Fail 

What are the alternatives to Spark?  Gartner’s Heudecker correctly notes that Spark excels at iterative processing, where MapReduce performance is sandbagged by its need to persist after each pass through the data.  High-performance advanced analytics must run in memory; there are commercial products available from SAS and Skytree, but for open source distributed analytics there are few alternatives to Spark.  Flink and Tez lack Spark’s analytic libraries; Impala can support SQL but lacks capabilities for machine learning, streaming analytics and graph analytics.

Whether or not Spark is fully buttoned down in Release 1.3 is irrelevant; at this point it is a settled matter that Spark is superior to MapReduce for advanced analytics applications.

I am not suggesting that Spark is free of bugs or issues.  Like every other commercial and open source software project, Spark has bugs; unlike some of the commercial products Gartner rates as “Leaders”, the Spark team is transparent about issues and fixes them quickly.   It’s also fair to say that this time next year Spark will have more features than it has today; the community of users and contributors will determine what features need to be added.

Unlike some other open source projects, Spark has strong leadership, a disciplined approach to development and an impressive release cadence.  People build software, and the people behind Spark have proven that they know what they are doing.

The list of Spark users is strong and growing.  I’ve attended every Spark Summit since the first one in 2013 and there is noticeable growth in the number and sophistication of the applications presented.  This is not hype; it is real progress by users who are accomplishing bigger and better things with Spark than they could have accomplished without it.

Spark has already achieved a level of commercial support that ensures it will live up to its promise.  Available in every commercial Hadoop distribution and with Datastax, endorsed by SAP and Oracle, it is inconceivable that these players will let Spark fail.  This is partly because reputations are at stake, and also because there are few other options for open source high-performance advanced analytics inside or outside of Hadoop.

SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

RevoScaleR Beats SAS, Hadoop for Regression on Large Dataset

Still catching up on news from Strata conference.

This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.

The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records.  The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.

Results:

— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;

— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;

— Open source R: impossible, open source R cannot load the data set;

— RevoScale R running  on 5-node/20-core LSF cluster: a little over five minutes to run.

In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would  be interesting to see results from such a test.

Some critics have pointed out that the environments aren’t equal.  It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded.    SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.

It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results.  One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture.  While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.

One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?

— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;

— Hadoop would certainly require a custom MapReduce procedure;

— With RevoScaleR, Allstate can push the scoring into IBM Netezza.

This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS.  It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.