Big Analytics Roundup (October 5, 2015)
Announcements timed to coincide with Strata NYC 2015 drive the news this week. The single most interesting item for Big Analytics is the O’Reilly 2015 Data Science Survey, which warrants a post of its own. Two key points:
- Data scientists still use SQL, Excel, Python and R. (Doh!)
- Data scientists who spend time in meetings and presenting results of analysis earn more than the grunts who muck around with data.
The lesson is clear: if you want to earn the big bucks, stop messing around with Zeppelin and learn PowerPoint.
There are a few interesting presentations from Strata embedded in this post, plus two that stand out:
- Ron Kasabian of Intel and Michael Draugelis of Penn Medicine explain how they improve medical decision-making with predictive analytics.
- Iulia Pasov and Calin-Andrei Burloiu show how they use data science to measure and prevent churn at Avira.
Paul Kent has the toughest job at SAS — promoting an initiative his boss thinks is hype. In this sponsored presentation, Paul does a professional job presenting SAS’ Big Data story, which seems compelling. However, the challenge for SAS in Big Data remains: name a reference customer.
MapR’s Jim Scott takes another shot at the “will Spark replace Hadoop?” meme. All together now:
- Spark, like MapReduce, is a compute engine
- Hadoop = MapReduce + HDFS + YARN + (an ecosystem of other bits)
- Spark lacks a native file system, so it can never replace Hadoop
- Coming in 2016: Hadoop = Spark + HDFS + YARN + (an ecosystem…)
- Also possible: Spark + Cassandra, Spark + MongoDB, Spark + Druid, Spark + (your database here)
On the Intersog blog, Jenny Richards gets it right by focusing on the differences between Spark and MapReduce.
Spark Maintenance Release
The Spark team announces Spark 1.5.1, a maintenance release with about 80 bug fixes. On the Databricks blog, Reynold Xin explains Spark version numbers work. Short version: top level numbers correspond to API compatibility, dot releases include features and enhancements, double-dot releases have bug fixes.
Spark Use Cases and Success Stories
At Strata NYC 2015, Databricks’ Reynold Xin describes “sketching” with Spark (aka exploratory analysis and feature engineering).
Also at Strata, Edd Dumbill presents the business case for Spark, Kafka “and friends.”
On the MongoDB blog, Mat Keep interviews Thiago Cardoso, co-founder and CTO of Hekima, a social media analytics startup. Hekima uses Spark, Hadoop and (you guessed it) MongoDB.
On SmartDataCollective, the ubiquitous Jim Scott describes what he calls use cases for Spark: exploratory analytics; machine learning; real-time dashboards and ETL.
At Big Data Dat LA 2015, ESRI’s Adam Mollenkops explains how to apply GeoSpatial Analytics with Spark. Video here.
A slew of software vendors announce integration with Spark.
Syncsort announces integration with Spark and Kafka.
–Dataiku announces integration of its Data Science Studio (DSS) software with Spark. DSS offers a commercially licensed visual workbench enabling the user to build pipelines integrating a number of data sources and formats. Analytic functionality is modest.
–SnapLogic announces pending release of what it calls SnapLogic Elastic Integration Platform, which includes components branded as “Sparkplex” and Spark “Snaps”. The former includes a code generator that translates user requests from the SnapLogic visual pipeline designer into Spark code. The latter is SnapLogic branding for “prebuilt connectors”, of which there are many. SnapLogic claims that Snaps are plug and play, so it’s a “snap” to convert your pipeline from MapReduce to Spark. Stories here, here, here, here, and here.
Spark as a Service
In case you’re not happy with offerings from Databricks, Qubole, Amazon Web Services, Google, BlueData and MemSQL, Altiscale announces you-know-what.
Dremio, a startup led by MapR’s Drill gurus, lands a $10 million A round.
At the Hadoop Meeting NYC, the folks from Dremio present Drill use cases and a roadmap.
Polymath Abhishek Tiwari reflects on Drill.
On the MapR blog, Joseph Blue explains how to identify a data breach with Drill.
…adds support for Microsoft PowerBI to its BI on Hadoop story.
On ZDNet, Natalie Gagliordi touts Teradata’s embrace of Presto. (Note to ZDNet’s headline editor: Presto is not an Apache project). She correctly notes that Presto is faster than Hive-on-MapReduce, but that’s a low bar; just about everything is faster than Hive releases prior to 0.13, but Hive-on-Tez competes well with Drill, Impala, Presto and Spark SQL. That’s a problem for Presto, because a challenger has to be outstanding at something. Once again, it appears that Teradata is betting on the wrong horse.
Databricks’ Hossein Falaki delivers a presentation at Strata on supercharging R with Spark; slides here. Spark’s R API is incomplete; as of Spark 1.5.1 it supports DataFrames operations (including SQL queries) and generalized linear models. That’s better than nothing, but R users who need to do serious machine learning now need to look elsewhere.
On the MapR blog, Ellen Friedman introduces you to Flink.
Data Artisans’ Robert Metzger delivers a presentation about the architecture of Flink’s streaming runtime at ApacheCon Europe.
At Strata NY 2015, Databricks’ Tathagata Das describes the new bits in Spark Streaming.