The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

Big Analytics Roundup (November 9, 2015)

My roundup of the Spark Summit Europe is here.

Two important events this week:

  • H2O World starts today and runs through Wednesday at the Computer History Museum in Mountain View CA.   Yotam Levy summarizes here and here.
  • Open Data Science Conference meets November 14-15 at the Marriott Waterfront in SFO

Five backgrounders and explainers:

  • At HUG London, Apache’s Ufuk Celebi delivers a nice intro to Flink.
  • On the Databricks blog, Yesware’s Justin Mills explains how his team migrates Spark applications from concept through prototype through production.
  • On Slideshare, Alpine’s Holden Karau delivers an overview of Spark with Python.
  • Chloe Green wakes from a three year slumber and discovers Spark.
  • On the Cloudera Engineering blog, Madhu Ganta explains how to build a CEP app with Spark and Drools.

Third quarter financials drive the news:

(1) MapR: We Grew 160% in Q3

MapR posts its biggest quarter ever.

(2) HDP: We Grew 168% in Q3

HDP loses $1.33 on every dollar sold, tries to make it up on volume.  Stock craters.

(3) Teradata: We Got A Box of Steak Knives in Q3

Teradata reports more disappointing sales as customers continue to defer investments in big box solutions for data warehousing.  This is getting to be a habit with Teradata; the company missed revenue projections for 2014 as well as the first and second quarters of this year.  Any company can run into headwinds, but a management team that consistently misses targets clearly does not understand its own business and needs to go.

Full report here.

(4) “B” Round for H2O.ai

Machine learning software developer H2O.ai announces a $20 million Series B round led by Paxion Capital Partners.  H2O.ai leads development of H2O, an open source project for distributed in-memory machine learning.  The company reports 25 new support customers this year.

(5) Fuzzy Logix Lands Funds

In-database analytics vendor Fuzzy Logix announces a $5 million “A” round from New Science Ventures.  Fuzzy offers a library of analytic functions that run in a number of high-performance databases and in HiveQL.

(6) New Optimization Package for Spark

On the Databricks blog, Aaron Staple announces availability of Spark TFOCS, an optimization package based on the eponymous Matlab package.  (TFOCS=Templates for First Order Conic Solvers.)

(7) WSO2 Delivers IoT App on Spark 

IoT middleware vendor WSO2 announces Release 3.0 of its open source Data Analytics Server (DAS) platform.   DAS collects data streams and applies batch, real-tim or interactive analytics; predictive analytics are in the roadmap.  For streaming data sources, DAS supports java agents, javascript clients and 100+ connectors.  The software runs on Spark and Lucene.

(8) Hortonworks: We Aren’t Irrelevant

On the Hortonworks blog, Vinay Shukla and Ram Sriharsha tout Hortonworks’ contributions to Spark, including ORC support, an Ambari stack definition for Spark, tighter integration between Hive and Spark, minor enhancements to ML and user-facing documentation.  Looking at the roadmap, they discuss Magellan for geospatial and Zeppelin notebooks. (h/t Hadoop Weekly).

(9) Apache Drill Delivers Fast SQL-on-Laptop

On the MapR blog, Mitsutoshi Kiuchi offers a case study in how to run a silly benchmark.

Comparing the functionality of Drill and Spark SQL, Kiuchi argues that Drill “supports” NoSQL databases but Spark does not, relegating Spark’s packages to a footnote.  “Support” is a loaded word with open source software; technically, nothing is supported unless you pay for it, in which case the scope of support is negotiated as part of the SLA.  It’s also worth noting that MongoDB developed Spark’s interface to MongoDB (for example), which provides a certain amount of confidence.

Kiuchi does not consider other functional areas, such as security, YARN support, query fault tolerance, the user interface, metastore management and view support, where Drill comes up short.

In a previously published performance test of five SQL engines, Spark successfully ran nine out of eleven queries, while Drill ran eight out of ten.  On the eight queries both engines ran, Drill was slightly faster on six.  For this benchmark, Kiuchi runs three queries on his laptop with a tiny dataset.

As a general rule, one should ignore SQL-on-Hadoop benchmarks unless they run industry standard queries (e.g. TPC) with large datasets in a distributed configuration.

Python for Analytics

A reader complains that I did not include Python in a survey of Machine Learning in Hadoop.  It’s a fair point.  There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match.  Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered.  In this post I review the available evidence about the incidence of Python use for analytics; in a separate post, I will survey Python’s capabilities.

Python is a general purpose programming language whose syntax enables programmers to write efficient and concise code.  The Python Software Foundation manages an open source reference implementation written in C and nicknamed CPython.  Alternative implementations include Jython, written in Java; IronPython, for .net; and PyPy, a just-in-time compiler.

There is no dispute that Python is a popular language for general-purpose programming; according to the Transparent Language Popularity Index (TLPI), Python currently ranks seventh in popularity behind  C, Java, Objective C, C++, Basic and PHP.  By the same measure, exclusively analytic languages rank lower:

  • #14. R
  • #19. MATLAB
  • #26. Scala
  • #31. SAS

Measures like TLPI or the Tiobe Community Programming Index tell us something about the overall popularity of a language, but relatively little about its popularity for analytics. Many Python users aren’t at all engaged in analytics, and many analysts don’t use Python.

Python performs very well in Bob Muenchen’s analysis of analytic job postings (which he has perfected into a science).  Muenchen’s analysis shows that Python ranks third in analytic job postings, behind Java and SAS.  Python and R were at rough parity in job postings until early January 2013; since then, Python has outpaced R.

Surveys of analytic users show a mixed picture, reflecting differences in sampling and question construction.  In the 2013 Rexer survey, 64% of all respondents report writing their own code; the top reported choice is SQL (43%), followed by Java (26%) and Python (24%).  (These results are difficult to square with the overall finding that 70% of the respondents use R, which requires the user to write code.)   Rexer’s sample includes a mix of Power Analysts and Business Analysts, but relatively few Data Scientists.  (See this post for a definition of Analytic User Personas).

KDnuggets conducted its annual software poll in 2013; Python ranked fifth behind RapidMiner, R, Excel and Weka/Pentaho.   In a separate KDnuggets poll explicitly focused on programming languages for analytics, data mining and data science, Python ranked second behind R.  The KDnuggets online poll is a convenience sample (which is vulnerable to response bias), but there is no reason to believe that either R or Python users are over-represented relative to one another.  The KDnuggets community consists largely of Data Scientists and Power Analysts.

A follow-up poll by KDnuggets expressly about switching between Python and R found that more people use R than Python, and users switching from other tools are more likely to choose R over Python; however, more users are switching from R to Python than from Python to R.  The graphic below illustrates these relationships.

Switching Between Python and R

O’Reilly Media’s survey of data scientists at the 2012 and 2013 Strata conferences shows Python ranked third, behind SQL and R.  (The survey does not break out responses from 2012 and 2013).  More interesting is O’Reilly’s analysis of how reported usage of each tool correlates with all of the others; the graph shown below depicts all of the positive correlations significant at p=.05.

Strata Tool Correlation

The most striking thing in this graph is the separation between open source tools at the top of the graph and commercial tools at the bottom; respondents tend to use one or the other, but not both.  The dense network among open source tools indicates that those who use any open source tool tend to use many others.  (Weka’s isolation from other tools in the graph indicates either that (a) Weka is a really awesome tool or (b) Weka users have a unique perspective on life. Or both.)

Among respondents to O’Reilly’s survey, Python and R use are correlated, and so are Java and R use; but Python and Java use are not correlated.  Python and R use both correlate with Apache Hadoop and graph engines; Python also correlates with other components of the Hadoop ecosystem, such as Hive, Mahout and Hbase.

To summarize: Python usage is firmly embedded in the open source analytics ecosystem; however, usage is largely concentrated among Data Scientists, with lower penetration among Power Analysts (for whom R and SAS remain the preferred languages).  The KDnuggets data suggests that new entrants to analytic programming are more likely to choose R over Python, but the rate of switching from R to Python suggests that Python addresses needs not currently met with R.

Arguments by Python advocates that Python will outpace R because it is easier to use strike me as silly.  R is not difficult to learn for motivated users.  Unmotivated users aren’t going to choose Python over R; they will choose a business analytics tool like Alpine, Alteryx or Rapid Miner and skip coding entirely.  Analysts who want to code will choose a language for its functionality and not the elegance of its syntax.

Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.