Big Analytics Roundup (October 12, 2015)
Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics. Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics. StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.
EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica. It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances. Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.
EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf. Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.
What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms. But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.
In other news, the Apache Software Foundation announces three interesting software releases:
- Apache Accumulo: Release 1.6.4, a maintenance release.
- Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
- Apache Kafka: Release 0.8.2.2, a maintenance release.
On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.
Dave Ramel summarizes a paper he thinks is too long for you to read. That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads. As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.
Ordinarily I don’t link sponsored content, but this article from Numascale is interesting. Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.
Spark on Amazon EMR
On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR. The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.
In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.
MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data. Without a trace of irony, he explains how to bring SQL to NoSQL datastores.
On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib. Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library. MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so. In Datanami, George Leopold reports.
In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.
At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive. The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.
At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase. Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver. SQL support is broad enough to run TPC benchmark queries. Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.
On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.
On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR. Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.
MADLib is an open source project for machine learning in SQL. Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community. Machine learning functionality is quite rich. Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq. In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)
Apache Spark (SparkR)
On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.
Apache Spark (MLLib)
On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.
At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.
H2O and Spark (Sparkling Water)
In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.
How Yahoo Does Deep Learning on Spark
Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark. Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node. The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.
Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)
On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.
Apache Spark Streaming
On the MapR blog, Jim Scott offers a guide to Spark Streaming.