Big Analytics Roundup (February 22, 2016)

Spark Summit East met in New York City last week; I’ve listed a few highlights below and will publish a separate detailed review.

Spark Summit East

— On the Databricks blog, Scott Walent, Jen Aman, Dave Wang and Wayne Chan review Spark Summit East 2016.

— Key presentations on what’s next in Spark:

  • Matei Zaharia introduces Spark 2.0.
  • Reynold Xin outlines the future of real-time in Spark.
  • Michael Armbrust explains SQL, DataFrames, Datasets and Streaming.

— In the WSJ’s CIO Journal, Steven Norton recaps Forrester analyst Mike Gualtieri’s presentation from the Spark Summit East, stating that Spark typically enters the enterprise “piggybacked on top of Hadoop.”  That’s not what Gualtieri said, and it’s wrong; user surveys show that standalone Spark clusters are more prevalent than Spark-on-Hadoop.

— Hortonworks’ Shaun Connelly uses his ten minute keynote to tout a customer (WebTrends) using Spark on Hortonworks and tease the audience about a pending joint announcement with Hewlett Packard Labs.


— In TechCrunch, Sankalan Prasad explains time series analysis with Spark and Parquet.

— On the Altocloud blog, Conor Fennell explains how to be agile with Apache Spark.

— In The Big Data Zone the Cloudera Engineering blog, Jeffrey Shmain Juliet Hougland and Sandy Ryza explain how to predict telco churn with Spark machine learning.  His The sample dataset has 5,000 observations and 21 features, so one wonders why not just use R on a laptop.  (Corrected, and link changed.  The Big Data Zone is a screen scraping site that steals and republishes content without attribution.  Apologies to the true authors.  Avoid The Big Data Zone.)

— On the IBM Hadoop Dev blog, Jesse Chen explains how to troubleshoot Spark.


— In CMS Wire, Virginia Backaitus asks four people at Spark Summit why Spark matters, gets four answers.

— Matt Turck asks if Big Data is still a thing, answers the question with updated infographic.


On the Cloudera Engineering blog, a team of Cloudera engineers reports results from a benchmarking project to compare the performance of Cloudera Apache Impala, Hive on Tez and Spark SQL.  For test cases, the team used 99 queries “derived from” TPC-DS; they modified the queries for some reason, so results are not comparable to the published TPC-DS results.  The team excluded Presto from the test, because it could not run 64 of the 99 queries; that makes sense.  They also excluded Drill, because they rarely see Cloudera customers using Drill, which doesn’t make sense.  Not surprisingly, Impala performed well in the test, and demonstrated good performance with ten concurrent users.  Look forward to response from the Stinger and Drill teams.

Open Source Announcements

— Google releases TensorFlow Serving, an open source library for serving machine learning models.  TensorFlow Serving is an inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explains.  Media coverage here.  Note that while TensorFlow Serving can run trained models in parallel, TensorFlow training is still limited to a single machine.

— The Apache Software Foundation announces Apache Arrow as a top-level project.  Arrow is a cross-system data layer for columnar in-memory analytics; it accelerates performance for analytics and simplifies integration among systems such as Drill and Spark.  Seeded with code from Apache Drill, Arrow also includes contributions from Calcite, Cassandra, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark and Storm.  There is extensive coverage of the story:

  • On the Dremio blog, a nameless author elaborates.
  • On the MapR blog, Parth Chandra explains Value Vectors in Drill and Arrow.
  • On the Cloudera Engineering blog, Marcel Kornacker et. al. introduce Arrow.
  • Alex Woodie summarizes Arrow’s objectives.
  • Media storypalooza here.

— Apache also announces SystemML 0.9.0, with improvements to APIs, data ingestion, optimizations, language and runtime operators.  The release also includes two new algorithms, Alternating Least Squares and Cubic Splines.

— Apache Drill announces Release 1.5.0, which includes authentication and security for the web interfaces and REST API, experimental query support for Kudu, improved memory allocation, configurable caching for Hive metadata and a slew of bug fixes.

Commercial Announcements

— SAP announces that SAP Predictive Analytics 2.5, planned to be available “in the near future,” will include “native Spark modeling.”  It will be interesting to see what that means.  In his keynoter at Spark Summit East, SAP VP Ken Tsai mentioned Spark push-down for queries and OLAP, but not for machine learning.

— Databricks announces general availability of Databricks Dashboards, a facility that enables Databricks users to publish visual reports for consumption by business users.  On the Databricks blog, Dave Wang explains.

— Databricks also announces beta release of Databricks Community Edition, a free version of Databricks’ eponymous platform that combines access to a Spark micro-cluster with online training and tutorials.  On the Databricks blog, Ion Stoica and Matei Zaharia elaborate.

— Qubole announces its intent to donate time on Qubole Data Service (QDS) to university classes.  QDS is a managed service offering for Hive, Spark, MapReduce, Cascading, Presto and HBase.

— MapR announces free Spark training, consisting of three online courses.  MapR also announces a limited-time promotional price of $100 for developer certification (registration required).

— Cloudera announces that it will bundle the latest major release of Apache Kafka into Cloudera Enterprise.  Cloudera also announces its partnership with Continuum Analytics.

— IBM announces Platform Conductor for Spark, a “converged solution” consisting of Apache Spark, Platform software for resource scheduling and workload management, plus IBM Spectrum Scale FPO storage.  It’s designed for folks who want Spark but don’t want YARN or HDFS.

— Mystery vendor Frontline Systems releases XLMiner SDK, a software development kit for data mining, text mining, forecasting and predictive analytics.   The SDK runs parallel — not distributed — on 32 or 64-bit Windows machines, features a suite of machine learning algorithms, a sampling algorithm that runs in Spark and APIs for C++, C#, Java, Python and R.

Funding Announcements

In-memory “data fabric” vendor GridGain lands a $15 million “B” round.  Sberbank and MoneyTime Ventures led; Almaz Capital and RTP Ventures participated.

Teradata Watch

Investor KBC Group dumps half of its TDC holdings.  Goldman Sachs and Summit Research issue sell ratings.  Barclays issued a sell rating earlier this month.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.