Big Analytics Roundup (December 14, 2015)
Quite a bit of hard news this week — nine stories, including software releases from Hortonworks and Confluent, a milestone for Apache Kylin, and three funding stories. Plus, a number of items “above the news.” Let’s get to it.
Risk Management on Spark
The growing number of applications that run on Spark show the platform is maturing. ThinkReactive, a consultancy focused on next generation solutions for risk analytics, is a great example. On the O’Reilly Radar blog, ThinkReactive’s Deenar Toraskar explains the need for speed in capital and risk management, and provides examples from his experience with portfolio stress testing and VaR. Risk management is the most demanding analytic application; Deenar is a Spark pioneer, and he was putting applications into production for leading banks back when the “experts” were still sneering at Spark.
Drill versus Spark SQL
MapR’s Jim Scott touts Apache Drill in a piece that masquerades as a neutral Drill versus Spark SQL comparison. MapR isn’t neutral on Drill; it has adopted Drill to counter Cloudera’s Impala and Hortonworks’ Hive on Tez. For a real assessment of the two engines (plus Hive and Impala), read this. Keep in mind that the “ANSI SQL is better than HiveQL” argument is theoretical; what matters is how a SQL engine performs on actual queries. In the Allegro evaluation, Drill failed to run several of the test queries.
In the MIT Technology Review, Will Knight surveys what developers are doing with TensorFlow, Googles recently open-sourced machine learning software. Quite a lot, it seems. Speaking of TensorFlow, On the Google Research blog, Pete Warden explains how to use it to classify images.
- On the AtScale blog, Bruno Aziza wonders if Spark has killed Hadoop. The answer is no.
- On the Flink blog, Matthias Sax explains how to run Storm topologies in Flink.
- Jagrata Minardi and Mike Alperin describe how to visualize data with Spotfire and Spark SQL.
- On the Dato blog, Alon Palumbo explains out of core algorithms.
- A nameless author on the Dremio blog explains how to install Drill on Windows.
- Dr. Kenji Takeda summarizes Microsoft Azure’s resources for data science.
(1) Hortonworks Updates Spark in HDP
Hortonworks announces support for Spark 1.5.2 just in time for release of Spark 1.6. The press release notes HDP;s continuing support for Apache Zeppelin, the open source data science notebook, and Project Magellan for geospatial analytics. Alex Woodie tries and fails to make a story out of it.
(2) Confluent Releases Confluent Platform 2.0
The folks behind Kafka announce Release 2.0 of their eponymous open source software, which is based on Kafka 0.9. Confluent Platform includes Kafka plus Java and C/C++ clients, a schema registry and connectors for JDBC, HDFS and Hive. Dave Ramel reports. More stories here, here and here.
(3) Facebook Announces Big Sur Release
On the Facebook engineering blog, Kevin Lee and Serkan Piantino describe Big Sur, a machine learning engine designed to run on NVIDEA’s Tesla Accelerated Computing Platform. Stories here and here. Great news for folks thinking that we have a shortage of open source tools for neural networks and Deep Learning.
(4) DataScience Lands Funding
(5) Palantir Raises Even More Capital
Analytics behemoth Palantir expands its planned $500 million private equity placement to $679.8 million. The company is now valued at $20 billion.
(6) Google Offers Cloud Vision for Image Classification
In case you need to classify cat pictures, Google offers a limited preview of the Cloud Vision API on Google Cloud Platform.
(7) Apache Kylin Advances
(8) MapR Announces Yet Another Streaming Engine
MapR announces MapR Streams, because…?
(9) Down Round for Platfora?
BI-on-Hadoop vendor Platfora announces a $30 million Series C round. The company previously raised $38 million in a C round back in March 2014.
Not News, But Kind of Interesting
In IT News, Katherine Noyes shares five things you need to know about Hadoop versus Spark.
- They do different things. No kidding. Hadoop, for one thing, includes a file system.
- You can use one without the other. Obviously that’s true for Hadoop, which predates Spark. Almost half of the respondents to Databrick’s recent Spark user survey use it standalone.
- Spark is faster. True, but not because MapReduce operates in steps. Spark is faster because it can retain objects in memory, while MapReduce persists to disk after each step. For the record, recent benchmarking performed by IBM shows a performance advantage for Spark over MaprReduce closer to 5X than 100X.
- You may not need Spark’s speed. MapReduce is fine for embarassingly parallel tasks that require just a single pass through the data, for the reason cited above — if MapReduce does not need to persist an intermediate result to disk, it will run about as fast as Spark.
- Failure recovery: different, but still good. Different, because Spark’s processing model is different.