Big Analytics Roundup (March 9, 2015)

Here’s a roundup of interesting Big Analytics news and analysis from the past week.  Featured this week: Hortonworks, Alpine, Spark and H2O.


  • Matt Asay, writing in InfoWorld, deconstructs Hortonworks’ earnings fiasco, and with it the “100% open source” business model.

Alpine Data Labs

  • VentureBeat reports a story that Alpine Data Labs claims 10X growth in user count and billings year over year.
  • MarketWired reports the same story.
  • ITBusinessNet too.

There is no supporting press release from Alpine Data Labs.   The VentureBeat story includes the nugget that Alpine currently has “more than 60” customers; an insider tells me that the number is closer to 75, roughly twice as many as last year.  Alpine has changed its selling model, hiring its own sales force instead of selling through EMC and Pivotal.  This also means that Alpine has changed its messaging from “we run on Greenplum and PostgresSQL, but mostly on Greenplum” to “we run on anything.”  This is an aspiration, to be sure, but a good one.

Alpine has also changed its pricing model from a perpetual server-based model to a user-based subscription model.

Separately, Ventana Research publishes a positive review of Alpine Chorus 5.0.

Apache Spark

  • Jonathan Buckley of Qubole argues that the three open source projects that transformed Hadoop are Hive, Spark and Presto.  It’s an odd choice.  Hive is certainly a key project and Spark is red hot; Presto, not so much.
  • Data prep engine vendor Paxata announces a new release that runs on Spark, releases benchmark report showing significant performance improvements.
  • Databricks announces selection of Databricks Cloud as preferred platform for B2B vendor Radius Intelligence, publishes case study.
  • Forbes profiles Databricks CEO Ion Stoica.
  • Ian Lumb offers eight reasons why Spark is hot.
  • Databricks published a slideshare about Spark DataFrames, which will be available in Spark 1.3 later this month.
  • From the Cloudera blog, an excellent post showing how to build an application for financial markets risk calculations in Spark.


  • In an interview with KDNuggets, Ted Dunning touts Mahout and H2O over Spark.
  • announces Cloudera certification for its Sparking Water interface to Spark.


CMSWire rehashes the Gartner Magic Quadrant without adding value.   The author notes breathlessly that “many KNIME enthusiasts are data miners”, and “on the downside, (RapidMiner’s) user base is mostly data scientists”; as if these points are news, and as if there is something extraordinary about data miners and data scientists using data mining and data science tools.

2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.

0xdata Secures an “A” Round, Changes Name

VentureBeat reports that 0xdata has secured $8.9 million in “A” round financing from six investors.  (According to Crunchbase, the funding actually closed July 19).   In the same article, VentureBeat also reports that 0xdata will change its name to H2O, aligning its corporate and product brands.

Since I first reported on 0xdata/H2O earlier this year, the firm has doubled its user base, added key public references (including Cisco, eBay, Nielsen and PayPal), added Deep Learning, Anomaly Detection and a robust scoring engine to its tooling and, most importantly, delivered on integration with Apache Spark.  H2O nicely complements Spark’s MLLib machine learning library; combined, the two projects offer a powerful platform for predictive analytics with Big Data.

0xdata/H2O will host the first H2O World conference November 18 and 19 at the Computer History Museum in Mountain View, CA.


Machine Learning in Hadoop: Part Two

This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first category.  The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase.  RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.

Both projects enable R users to implement explicit parallelization in MapReduce.  R users write R scripts specifically intended to be run as mappers and reducers in Hadoop.  Users must have MapReduce skills, and must refactor program logic for distributed execution.  There are some differences between the two projects:

  • RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
  • RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
  • As noted above, RHIPE lacks an interface to HBase
  • Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE

Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib.  Both projects have commercial backing, and show robust development activity.  Statistics from GitHub for the thirty days ended February 12 show the following:

  • 0xdata H2O: 18 contributors, 938 commits
  • Apache Spark: 77 contributors, 794 commits

H2O is a project of startup 0xdata, which operates on a services and support business model.  Recent coverage by this blog here;  additional coverage here, here and here.

MLLib is one of several projects included in Apache Spark.  Databricks and Cloudera offer commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and here.

As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature.  Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.

Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS.  There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project.  Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.

Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit.  Development activity on these projects is much less robust than for H2O and Spark.  GitHub statistics for the thirty days ended February 12 speak volumes:

  • Apache Mahout: contributors, 54 commits
  • Vowpal Wabbit: 8 contributors, 57 commits

Neither project has significant commercial backing.  Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project.  (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout).  John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.

Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition.  Mahout’s supervised learning algorithms, however, are weak.  Criticisms of Mahout tend to fall into two categories:

  • The project itself is a mess
  • Mahout’s integration into MapReduce is suitable only for high latency analytics

On the first point, Mahout certainly does seem eclectic, to say the least.  Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects.  Many algorithms are underdeveloped, unsupported or both.  The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.

“High latency” is code for slow.  Slow food is a thing; “slow analytics” is not a thing.  The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.

Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop.  Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis.  Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.

In Part Three, we will cover commercial software for machine learning in Hadoop.

Analytic Startups: 0xdata (Updated May 2014)

Updated May 22, 2014

0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O).  Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then.

0xdata operates on a services business model, and does not offer commercially licensed software.  The firm has four public reference customers and claims more than 2,000 users.  0xdata has formal partnerships with Cloudera, Hortonworks, Intel and MapR.

0xdata’s H20 project is a library of distributed algorithms designed for deployment in Hadoop or free-standing clusters.  0xdata licenses H2O under the Apache 2.0 open source license.  The development team is very active; in the thirty days ended May 22, 19 contributors pushed 783 commits to the project on Git.

The roadmap is aggressive; as of May 2014 the library includes:

For Generalized Linear Models, k-Means and Gradient Boosting, H2O supports a Grid Search feature enabling users to specify multiple models for simultaneous development and comparison.   This feature is a significant timesaver when the optimal model parameters are unknown (which is ordinarily the case).

Users interact directly with the software through a web browser or REST API.  Alternatively, R users can use the H2O.R package to invoke algorithms from RStudio or an alternative R development environment.  (Video demo here).  Scala users can work with H2O through the Scalala library.

For Hadoop deployment, H2O supports CDH4.x, MapR 2.x and AWS EC2.   H2O integrates with HDFS, and is co-located within Hadoop.   At present, H2O supports CSV, Gzip-compressed CSV, MS Excel (XLS), ARRF, HIVE file format, “and others”.

Each H2O algorithm supports scoring and prediction capability.   There is currently no facility for PMML export; this is unnecessary if H2O is deployed in Hadoop (since one can simply use the native prediction capability).

In March, the Apache Mahout project announced that it will support H2O.

Strata Report: Advanced Analytics in Hadoop

Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:



  • H20 (open source project)
  • h2o (R package)


Smart people from Stanford with VC backing and a social media program.   Services business model with open source software.  H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters;  aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests.   Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA).  They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence.  R users can interface through h2o package.  Limited customer base.  Partners with Cloudera and MapR.

Key Points

  • True open source model
  • Comprehensive roadmap
  • Limited functionality
  • Limited user base
  • Performance claims undocumented

Alpine Data Labs


  • Alpine 2.8


Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler).   Supports a reasonably broad range of analytic features.  Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works.  (Appears to be SQL/HiveQL push-down).   In practice, most customers seem to use Alpine with Greenplum.  Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability.  Partners with Pivotal, Cloudera and MapR.

Key Points

  • Reasonable option for users already committed to Greenplum Database
  • Limited partner and user ecosystem
  • Performance and stability should be vetted thoroughly in POC




Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).

Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise.   Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents.  When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.

Oracle Connection to Hadoop  is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive.  As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.

Key Points

  • Good choice for Oracle-centric organizations
  • Oracle Data Mining is a mature product with an excellent user interface
  • Must move data from Hadoop to Oracle Database to leverage OAA
  • Hadoop push-down from R requires expertise in MapReduce



  • SAS/ACCESS Interface to Hadoop
  • SAS Scoring Accelerator for Cloudera
  • SAS Visual Analytics/SAS LASR Server
  • SAS High Performance Analytics Server


SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server.   With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees.   Update:  SAS has announced SAS/ACCESS for Impala.

Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics.  High Performance Analytics is a suite of distributed in-memory analytics.   LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware.  Partners with Cloudera and Hortonworks.

Key Points

  • Legacy SAS connects to Hadoop, does not run in Hadoop
  • SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
  • Visual Analytics cannot work with “raw” data in Hadoop
  • Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
  • High TCO, proprietary architecture for all SAS products



  • Skytree Server


Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013.  Server-based technology, can connect to a range of data sources, including Hadoop.  Programming interface; claims ability to run from R, Weka, C++ and Python.  Good library of algorithms.  Partners with Cloudera, Hortonworks, MapR.  Skytree is opaque about technology and performance claims.

Key Points

  • Limited customer base, no announced sales since company launch
  • Hadoop integration is a connection, not “inside” architecture
  • Performance claims should be carefully vetted