Spark is Too Big to Fail

Reacting to growing interest in Apache Spark, there is a developing contrarian meme:

David Ramel asks: are Spark and Hadoop friends or foes?
Jack Vaughan compares Spark to the PDP-11, dismisses it as “just processing.”
Doug Henschen praises Spark, pans Databricks
Nicole Laskowski complains that Spark Summit East “felt like a Databricks show.”
Andrew Oliver thinks Spark needs to grow up
Andrew Brust worries that vendors are ahead of customers on Spark
IBM’s James Kobelius characterizes Spark as “the shiny new thing”
Gartner’s Nick Heudecker asserts that Spark is “not enterprise ready”

Spark skepticism falls into three broad categories:

Hadoop Purism: Spark deviates from the MapReduce/HDFS framework, and some people aren’t happy about that
Backseat Driving: Some analysts argue that Spark is great but Databricks, the commercial venture behind Spark, should do X, Y or Z
FUD: Spark’s competitors — commercial and open source — plant “issues” and “concerns” about Spark with industry analysts

Let’s examine each in turn.

“Spark Competes With Hadoop”

Spark does not compete with Hadoop; it competes with MapReduce. Hadoop is an ecosystem of projects; there are a few components included in all commercial distributions (e.g. Hive, Pig, Hbase), but these aren’t used at every site. The ability to mix and match components is a strength for Hadoop.

Some software, like Spark, can run co-located in a Hadoop cluster or on clustered machines outside of Hadoop. This should not surprise anyone; clustering and distributed computing existed before Hadoop. Why does it matter if a software component can run both ways? Users and use cases will drive implementation, and if Spark works better with Cassandra than with HDFS, or if a Spark user does not need the other Hadoop bits, so be it.

While there are reports of organizations that have abandoned MapReduce, most organizations will use Spark together with MapReduce; if users are happy with existing MapReduce jobs, there is no need to rewrite them. For new applications, however, some users will choose Spark over MapReduce for a variety of reasons; for better runtime performance, more efficient programming, more built-in features or simply because it’s the latest thing. Isn’t competition a wonderful thing?

Organizations using standalone instances of Spark likely never considered using MapReduce for the application in question. For these use cases, Spark competes with SAS, Skytree, H2O, Graphlab or some other machine learning software.

Databricks Envy

Sniping at Databricks is equally unwarranted. (Note: I’m not on the payroll.) There are only so many ways to build a viable open source business model. Offering a commercial product with additional bits is one way to do so; that is how Cloudera and MapR operate. Databricks offers a hosted service for Spark with a few extra bits; if you don’t like Databricks’ offering, you can implement on-premises yourself or get Spark as a service through Amazon Web Services, BlueData, Qubole or elsewhere.

And if you really must have a notebook for Spark, try Zeppelin.

Of course, it’s true that Hortonworks open sources everything. HDP loses $3.76 for every dollar they sell. They hope to make it up on volume.

Databricks contributes heavily to the open source Spark project, supporting developers whose sole job is to improve Spark. Most importantly, Databricks provides leadership and release management, which inspires confidence that Spark will not turn into a muddled mess like Mahout.

The complaint that Spark Summit East “felt like a Databricks show” is odd — one rarely hears complaints that Oracle World “feels like an Oracle show.” There were thirty-nine presentations on the agenda at Spark Summit East, and one — Ion Stoica’s keynoter — highlighted Databricks Cloud. In contrast, sponsored sessions accounted for a third of the sessions at the 2015 Strata + Hadoop World in Santa Clara.

“Spark Is Not Enterprise-Ready”

Some of the criticism is silly. Andrew Oliver is shocked to discover that Release 1.0 of Databricks Cloud’s notebook, currently still in beta release, isn’t as slick as Tableau. Also, a process he was watching timed out. But wait! That might be due to slow hotel wi-fi…

Meanwhile, SecurityTracker reports a major security flaw in IBM’s BigSQL.

Is Spark “enterprise ready?” The same question could be asked about Hadoop, and conservative enterprises will answer “no” in both cases. There is no single threshold that determines when a piece of software is “enterprise-ready”. Use cases matter; the standard for software that will run your ATMs is not the same as the standard for software to be used for genomics research.

According to Gartner’s Heudecker, “actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearData Story and Paxata, which uses Spark for data preparation. Other companies primarily use Spark to power dashboards.” Interesting to hear Gartner dismiss the dashboard market; but enterprises are currently using Spark for more than dashboards. A top global bank uses Spark today for Basel reporting and stress testing; if you’re not familiar with stress testing, suffice to say that a bank that gets this application wrong is in a heap of trouble.

It’s true that vendors are ahead of customers on Spark This is hardly out of the ordinary with new technology; one could have said the same thing about Hive in 2010. Vendors are always ahead of customers; it’s their job.

Spark is Too Big to Fail

What are the alternatives to Spark? Gartner’s Heudecker correctly notes that Spark excels at iterative processing, where MapReduce performance is sandbagged by its need to persist after each pass through the data. High-performance advanced analytics must run in memory; there are commercial products available from SAS and Skytree, but for open source distributed analytics there are few alternatives to Spark. Flink and Tez lack Spark’s analytic libraries; Impala can support SQL but lacks capabilities for machine learning, streaming analytics and graph analytics.

Whether or not Spark is fully buttoned down in Release 1.3 is irrelevant; at this point it is a settled matter that Spark is superior to MapReduce for advanced analytics applications.

I am not suggesting that Spark is free of bugs or issues. Like every other commercial and open source software project, Spark has bugs; unlike some of the commercial products Gartner rates as “Leaders”, the Spark team is transparent about issues and fixes them quickly. It’s also fair to say that this time next year Spark will have more features than it has today; the community of users and contributors will determine what features need to be added.

Unlike some other open source projects, Spark has strong leadership, a disciplined approach to development and an impressive release cadence. People build software, and the people behind Spark have proven that they know what they are doing.

The list of Spark users is strong and growing. I’ve attended every Spark Summit since the first one in 2013 and there is noticeable growth in the number and sophistication of the applications presented. This is not hype; it is real progress by users who are accomplishing bigger and better things with Spark than they could have accomplished without it.

Spark has already achieved a level of commercial support that ensures it will live up to its promise. Available in every commercial Hadoop distribution and with Datastax, endorsed by SAP and Oracle, it is inconceivable that these players will let Spark fail. This is partly because reputations are at stake, and also because there are few other options for open source high-performance advanced analytics inside or outside of Hadoop.

One response to “Spark is Too Big to Fail”

2015 in Big Analytics | The Big Analytics Blog

December 31, 2015 at 5:41 pm

[…] 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April. I wrote this post in response to a growing chorus of snark about Spark […]

Thomas Dinsmore's Blog