Big Analytics Roundup (January 25, 2016)

This week, we have a new release of Spark-TS, Google’s proposal to create an Apache incubator project for Cloud Dataflow, Forrester’s assessment of Hadoop distributions, a couple of funding stories and a nice crop of explainers.

Just a reminder that Spark Summit East is coming up February 16-18.  I’ll be delivering a talk in the Executive track on Spark and the Future of Advanced Analytics.

On the blog, Andrew Fogg proposes twenty questions to ask if you want to detect a fake data scientist.  His approach is wrong.  The best way to distinguish between real and fake data scientists is to give them some data, specify a problem and ask for a solution.  Real data scientists will solve the problem.  If your HR department won’t let you do that, ask the applicant about problems they’ve solved in previous work.

At the morning paper, Adrian Colyer summarizes changes in networking, memory, storage and processors heading towards our data centers.  Also, Andrew offers a distributed systems seminar reading list.

In CIO, Thor Olavsrud summarizes thoughts on data and analytics trends from five tech executives.  To save you some time, I’ve tabulated the buzzword count below.

  • Hadoop: 24
  • Big Data: 19
  • Cloud:10
  • IoT: 7
  • Spark: 6
  • Security: 5
  • Insight: 4
  • Data Lake: 4
  • Privacy: 2

On his personal blog, Bob Hayes summarizes results from his “Empirically-Based Approach to Understanding the Structure of Data Science”.  Hayes uses factor analysis to extract dimensions from responses to an internet survey about data science.  His conclusion: data scientists have skills in business, programming and statistics, which demonstrates that factor analysis often confirms what you already know.


  • On the Databricks blog, Joseph Bradley and Xiangrui Meng explain new machine learning features in Spark 1.6, including pipeline persistence, new algorithms and machine learning in SparkR.
  • In SlideShare, Oracle Labs’ Juan Fumero explains how to integrate FastR with Flink.  Apparently, he couldn’t get his hands on Oracle R.
  • At the DataScience.LA Meetup, Erin LeDell explains H2O machine learning in Python.
  • In Techopedia, Kaushik Pal explains what Flink is.
  • At the Chicago Flink Meetup, Matthew Ring explains Flink and NiFi.
  • In Java Code Geeks, Neeraja Rentachintala explains performance enhancements in Apache Drill 1.4.
  • On the Win-Vector blog, John Mount explains parallel computing in R.  He tries to deliver a “gentle introduction” to parallel computing in R, but mostly shows how hard it is to do parallel computing in R.

Spark-TS Enhanced

On the Cloudera Engineering blog, Sandy Ryza announces Release 0.2.0 of Spark-TS, a Spark library for time series analysis.  New bits: nanosecond precision and java.time, Java API, non-string time series identifiers and lags. (h/t Hadoop Weekly)

Cloud Dataflow to Apache?

On the Google Cloud Platform blog, Google announces a proposal to make Cloud Dataflow an Apache Incubator project.  Cloud Dataflow, which Google contributed to open source in 2014, is a high-level pipeline framework that abstracts programming logic from execution engines, promoting portability.  The framework currently supports “runners” for Apache Flink, Apache Spark, single-node local execution and Google’s Cloud Dataflow service.  Media coverage here, here, herehere and here.

Forrester: Hadoop Distros Are Competitive, Except for Pivotal

Forrester evaluates Hadoop distributions.  You can pay $2,495 to buy a copy of the report, or go here for a free copy kindly provided by Cloudera, or read Alex Woodie’s summary here.

Key findings:

  • Cloudera has the strongest current offering, followed closely by MapR and IBM.
  • Hortonworks lags in its current offering, but has a slightly better strategy. whatever that means.
  • Pivotal lags significantly behind the others, but is still rated a “Strong Performer.”  In other words, Pivotal is toast.

Forrester rates Cloudera highly on administration, security and data governance; MapR scores well on architecture, data (?) and workload flexibility.  IBM scores well on data governance and workload flexibility.

Hortonworks does a little better on Forrester’s “strategy” dimension mainly due to a better rating on pricing.

Pivotal gets the booby prize, scoring last on most dimensions.  Forrester notes that Pivotal still may be a good choice for the poor souls who spent heavily on Greenplum and Hawq before Pivotal open-sourced them.

Google, Udacity Partner to Deliver TensorFlow Course

On the Google Research Blog, Vincent Vanhoucke announces a free course in Deep Learning with TensorFlow developed together with Udacity.  Udacity labels the course as intermediate to advanced.   The course is not designed for newbies; in addition to prerequisites for the Machine Learning Engineer Nanodegree, Udacity recommends:

  • Two or more years programming experience (with Python)
  • Git/GitHub experience
  • Basic machine learning knowledge
  • Masic statistics knowledge
  • Linear algebra (vectors and matrices)
  • Calculus (differentiation, integration and partial derivatives)

Qubole Closes New Funding

Qubole announces that it has completed a $30 million “C” round, with IVP in the lead together with CRV, Lightspeed and Norwest.  Qubole Data Service (QDS) is a self-service platform featuring Hive, Spark, MapReduce, Pig, Cascading, Presto and HBase services.  QDS runs on AWS, Azure or Google Cloud Platform, supports connectors to popular databases and integration with Tableau, Birst, Qlik, Pentaho and Alteryx.

Splice Machine Bags Cash

RDBMS-on Hadoop vendor Splice Machine announces a $9 million “C” round from existing investors.  The Register reports.  This latest round follows two “B” rounds in 2014 totaling $18 million.  Splice Machine offers a Derby-based ANSI SQL engine and a query optimizer that routes operations through Spark or MapReduce as appropriate.

New Bright Computing Software Version

Bright Computing announces release of Version 7.2 of its hardware-agnostic cluster management solutions.  Enhancements across all products include improved support for Docker, Kubernetes and Puppet, job-based metrics and resource pools.  Enhancements to Bright Cluster Manager for Big Data include multiple features for Spark administration, enhanced support for Accumulo, Kafka, Pig and Storm, support for Drill and Flink, and support for Zeppelin, Tachyon and Ignite.

Syncsort Surveys People, Learns Stuff

Syncsort releases its second Hadoop Market Adoption Survey, which includes responses from a convenience sample of more than 250 IT decision makers.

Key findings:

  • Asked about compute frameworks of greatest interest, 67% say Spark, 55% say MapReduce, 19% say Tez.  Hortonworks, take note.
  • Respondents rate staff and skills shortages as the most significant barrier to implementing Hadoop.  This is consistent with Syncsort’s previous survey.
  • Respondents see Hadoop as a way to complement existing data warehouse investments, a cost-effective data archiving strategy and a means to take advantage of analytical tools.
  • Consistent with the previous survey, when asked about use cases, respondents cited advanced/predictive analytics more than any other.  This leads me to believe that respondents have no idea what “advanced/predictive analytics” means.
  • Respondents were most likely to cite “increased agility” as a benefit from Hadoop.

Syncsort’s first survey, conducted in 2014, is here.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.