Big Analytics Roundup (September 14, 2015)
There are two big stories this week, the latest Spark release (story here) and Cloudera’s One Platform announcement. The latter story is big enough to warrant its own section below. I note, however, that Cloudera is simply announcing that it will continue to do what it is already doing: contribute heavily to Spark.
Here is a list of IBM’s contributions to Spark:
In other news, Gartner discovers Flink, which reminds me of what my science teacher said about the brontosaurus: stomp on his tail on Monday, and on Thursday he bleats.
Cloudera Announces “One Platform” for Spark and Hadoop
On Cloudera’s Vision blog, CSO Mike Olson announces plans to invest in Spark development, focusing on the following areas:
- Security: encryption, secure access to the Web UI, plus additional security features for Spark SQL and Spark Streaming.
- Scalability: improved scheduler logic, improvements to internal data movement, support for HDFS’ Discardable Distributed Memory, improvements to the Spark Job History Server and collaboration with Intel on hardware optimization. Olson enumerates a scalability target of several thousands of jobs running on multi-tenant clusters with more than 10,000 nodes.
- Management: improvements to Spark-on-YARN for better multi-tenancy, performance and ease of use; better resource consumption and utilization metrics; simplified and automated configuration; better integration with Python.
- Streaming: improved performance and resilience for the streaming engine (to support jobs that run “days, months or years”), plus higher-level language extensions and a simple declarative interface.
This chart summarizes past and future development in each area:
Not surprisingly, the announcement sets off a firestorm of analysis (no pun intended).
In SiliconAngle, Maria Deutscher correctly notes that Cloudera’s initiative appears designed to head off competition from those who advocate using Spark without Hadoop. This is a real threat to Cloudera and the other Hadoop distributors; roughly half of all Spark users do so outside of Hadoop. You don’t really need Hadoop to leverage Spark — it works just fine with S3, Cassandra or a host of other datastores.
Gavin Clarke, in The Register, suggests that Cloudera is retiring MapReduce and replacing it with Spark. In Fortune, Derrick Harris sings the same tune. Nothing in Olson’s announcement supports that claim, although Cloudera certainly will promote Spark over MapReduce for new applications. Cloudera’s graphics show a continuing, if reduced role for MapReduce.
It’s fair to say that Cloudera sees Spark supplanting MapReduce in the long run; this has not changed since Cloudera first announced its support for Spark in 2013. In Information Week, Charles Babcock interviews CTO Eli Collins, who notes that Cloudera plans to include the enhancements detailed by Olson in a Spark release to be included in CDH 6.0 next year.
Timothy Prickett Morgan suggests that Cloudera will create its own Spark distribution. But in an interview with Alex Woodie, Collins notes that Cloudera will continue to invest in the Apache open source projects. Those two views aren’t completely inconsistent, but they highlight that this announcement simply says that Cloudera will do what it is already doing: contribute heavily to Spark, as shown below.
As if to underscore the previous point, Justin Kestyln interviews Cloudera’s Spark committers for the Cloudera blog.
In ZDNet, Andrew Brust reports that Cloudera wants to integrate Spark with Cloudera Manager and Cloudera Navigator, which is only surprising if you thought this was done already.
Big Analytics Use Cases
Doug Henschen reports on three success stories for advanced analytics on Hadoop: Merck, Mercy Health Care and Progressive Insurance.
In eWeek, Darryl Taft reports on eight companies who use Spark to drive value.
On Techopedia, Kaushik Pal explains how Apache Drill democratizes analysis.
On the MarkLogic blog, Hemant Puranik explains how to use Spark with MarkLogic.
Sam Palani offers a guide to configuring Tableau with Spark.
On his eponymous blog, Eugene Zhulenev explains how to use Spark ML for audience modeling.
Steve Wooledge whiteboards a Spark use case for drug discovery on the MapR blog.
Two items on Slideshare:
- Amy Wang discusses how to predict loan defaults using Lending Club data.
- Erin LeDell and Mark Landry detail the top ten pitfalls in data science.
On the Domino blog, Sean Lorenz explains Deep Learning with H2O.
On YouTube, Tianqi Chen offers a video covering boosted trees on XGBoost.