Strata+Hadoop World NYC is upon us. Andrew Brust opines that there will be three themes at Strata this year: (1) Spark “versus” Hadoop; (2) streaming goes mainstream; (3) data governance matters. My take:
- “Spark versus Hadoop” is controversy for the sake of people who like controversy. Spark works with Hadoop, and Spark works with other platforms, or by itself. Use cases will determine the best platform.
- We’ve been hearing that streaming is mainstream for something like ten years now. There are a half-dozen commercial products in the space, plus multiple open source frameworks.
- Data governance is a soporific.
Due to the spate of Spark stories this week, this week’s roundup has four sections: Spark, SQL, Machine Learning and Streaming. The top story is Databricks’ Spark survey, which provoked a flurry of analysis.
2015 Spark Survey
Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here. The “report” is a somewhat informative mashup of survey findings, plus other information, such as the headcount from Spark Summits. (Spoiler: it’s increasing.) On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points. Additional analysis here, here, here, here, here, here, here and here.
Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop). Andrew Oliver, for example, commenting on Cloudera’s One Platform announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey. But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.
Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop. But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.
The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java. The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.
Spark as a Service
Google announces Cloud Dataproc, a managed Spark and Hadoop service, currently available in beta. Key benefits claimed: cheap, fast, integrated with the other Google Cloud platform services, easy to manage, simple and familiar. Google claims that they can set up or knock down a cluster in ninety seconds or less. Billing is by the minute, which is cool. Stories here, here, here, here, here, here, here, here, here, here, here, here, here, here, and here.
BlueData offers Yet Another Spark Service.
In case you’re not happy with available offerings for Spark-as-a-service from Databricks, Qubole, Amazon Web Services, Google and BlueData, MemSQL offers Streamliner. Stories here, here, here, here and here.
Miscellaneous Spark Bits
Jim Scott enters the Spark vs. Hadoop fray and gets it wrong. No, Spark does not need HDFS; it works perfectly well with other datastores.
Jim Scott (again) lists five use cases for Spark Streaming: credit card fraud detection, network security, genomic sequencing, real-time ad targeting and hospital readmission.
On the MapR blog, the ubiquitous Jim Scott explains why Spark is a great companion to Hadoop.
In IT Jungle, Alex Woodie wonders what IBM’s embrace of Spark means for the product line IBM now brands as “i-series” and everyone else calls “AS-400”. His answer: nothing, IBM has no plans to put Spark on these tired old boxes.
Writing for American Banker, Tom Groenfeldt interviews Tom Davenport, several vendors (Rob Thomas of IBM, David Wallace of SAS and Abhi Mehta of Tresata) and one banker. Tom Davenport says that bankers use different things, touts Teradata; Rob Thomas talks about IBM’s Spark initiative; David Wallace says that banks use SAS, and the one banker talks about using Accenture. From this muddle, Mr. Groenfeldt concludes that banks are turning to Spark.
In an article titled Retail Gains with Distributed Systems, Daniel Gutierrez talks about Hadoop and Spark, but provides no actual examples of retailers using these platforms.
MapR’s Drill team walks to start Dremio.
Jim Scott, who was quite busy last week, profiles Apache Drill.
On YouTube, a disembodied voice representing Syntelli Solutions offers you a Test Drive using Drill and Spotfire on AWS.
Cloudera benchmarks Impala with TPC-DS queries, concludes that maximum concurrency with good performance increases with the size of the cluster. This does not seem surprising at all; more nodes in the cluster means more horsepower.
Harish Butani of Sparkline Data benchmarks TPCH queries using Spark SQL on Druid, summarizes results on LinkedIn. Conclusion: Spark on Druid runs a lot faster than Spark on Parquet. Full report here. Sparkline publishes a Spark Druid interface in Spark Packages.
On the MapR blog, Michele Nemschoff touts the Hadoop and Spark platform for retail analytics it sold to Quantium, an Australian analytic services provider.
Platfora announces Release 5.0, which leverages Spark behind the scenes for data preparation. Alex Woodie explains. More stories here, here, here and here.
ClearStory Data announces a triumph of branding (“Intelligent Data Harmonization”) and a few new features in a muddled press release.
Carlos Guestrin announces that Dato is a big believer in open source software, which will make you feel good when you pay the subscription fees on Dato’s commercial software. Dato has released its SFrame columnar data frame to open source under a BSD license. SFrames are like Pandas or R Frames, with some additional features useful to data scientists, like out-of-memory operations and support for wide datasets.
No doubt SFrames are cool, but the key challenge for companies in this space is to figure out how to make analytics work with mainstream data formats. Any advantages of a new format are offset by the time and cost needed to ingest and export the data.
At the Moscow Data Fest, H2O argues that machine learning is the new SQL.
Sam Dean interviews H2O.ai VP Marketing Oleg Rogynskyy.
Two items from the Databricks blog cover improvements to Spark’s machine learning capabilities in Spark 1.5:
Cloudera’s Sandy Ryza et. al. contribute Spark-Timeseries, a Python and Scala library for analyzing large-scale time series datasets. (h/t Hadoop Weekly)
Concurrent and Data Artisans announce “strategic partnership” to support Cascading on Flink. Cascading touts.
On the MapR blog, Ellen Friedman introduces you to Flink.
TIBCO’s Kai Wahner presents a nice overview of stream processing frameworks and products. Not surprisingly, he likes Tibco Streambase, but the deck nicely summarizes differences between the commercial and open source options.