The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

Spark 1.4 Released

On June 11, the Spark team announced availability of Release 1.4.  More than 210 contributors from 70 different organizations contributed more than 1,000 patches.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-06-12 at 2.00.20 PM

Spark Core

The Spark team continues to improve Spark operability, performance and compatibility.  Key enhancements include:

  • The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
  • Also for improved performance, serialized shuffle output
  • For the Spark UI, visualization for Spark DAGs and operational monitoring
  • A REST API for application information, such as job, stage, task and storage status
  • For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
  • Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
  • Two Mesos enhancements: Docker support and cluster mode.

DataFrames and SQL

This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.

A complete list of enhancements to the DataFrame API is here.

R Interface

AMPLab released a developer version of SparkR in January 2014.  In June 2014, Alteryx and Databricks announced a partnership to lead development of this component.  In March, 2015, SparkR officially merged into Spark.

SparkR offers an interface to use Apache Spark from R.  In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets.  It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.

Machine Learning

In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API.  Additional enhancements to ML include:

There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.

Enhancements to MLLib include:

There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank.  Spark’s graph analytics capabilities are comparatively static.

Streaming

The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs.  Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Still More Comments on Microsoft and Revolution Analytics

Three full business days post-announcement, and stories continue to roll in.

Stephen Sowyer of TDWI writes an excellent summary of what Microsoft will likely do with Revolution Analytics.  He correctly notes, for example, that Microsoft is unlikely to develop a business user interface for R with code-generating capabilities (comparable to SAS Enterprise Guide, for example).  This is difficult to do, and the demand is low; people who care about R tend to like working in a programming environment, and value the ability to write their own code.  Business users, on the other hand, tend to be indifferent about the underlying code generated by the application.

Since Revolution’s Windows-based IDE requires some investment to keep it competitive, the most likely scenario is that Microsoft will add R to the Visual Studio suite.

Mr. Sowyer also notes that popular data warehouses (such as Oracle, IBM Netezza and Teradata Aster) can run R scripts in-database.  While this is true, what these databases cannot do is run R scripts in distributed mode, which limits the capability to embarrassingly parallel tasks.  Enabling R scripts to run in distributed databases — necessary for Big Data — is a substantial development project, which is why Revolution Analytics completed only two such ports (one to Hadoop and one to Teradata).

While Microsoft’s deep pockets give Revolution Analytics the means to support more platforms, they still need the active collaboration of database vendors.  Oracle and Pivotal have their own strategies for R, so partnerships with those vendors is unlikely.

For some time now, commercial database vendors have attempted to differentiate their product by including machine learning engines.  Teradata was the first, in 1987, followed by IBM DB2 in 1992; SQL Server followed in the late 1990s, and Oracle acquired what was left of Thinking Machines in 1999 primarily so it could build Darwin software for predictive analytics into Oracle database.  None of these efforts has gained much traction with working analysts. for several reasons: (1) database vendors generally sell to the IT organization and not to an organization’s end users; (2) as a result, most organizations do not link the purchase decision for databases and analytics; (3) users for predictive analytics tend to be few in number compared to SQL and BI users, and their needs tend to get overlooked.

Bottom line: I think it is doubtful that Microsoft will pursue enabling R to run in relational databases other than SQL Server, and they will drop Revolution’s “Write Once Deploy Anywhere” tagline, as it is impossible to deliver.

Elsewhere, Mr. Dan Woods doubles down on his argument that Microsoft should emulate Tibco, which is like arguing that the Seattle Seahawks should emulate the Jacksonville Jaguars.  Sorry, JAX; it just wasn’t your year.

 

More Comments on Microsoft + Revolution Analytics

My inbox continues to fill with Google Alerts about Microsoft’s announced purchase of Revolution Analytics — too numerous to link.

Most of these stories simply repackage the Microsoft announcement.

Clint Boulton of the WSJ’s CIO Journal writes one of the best analyses:

Microsoft is betting on the timeliness of its acquisition as more businesses adopt analytics. Revolution’s software helps companies use R, an open source programming language that more than two million programmers use daily to build predictive models. R is popular among university computer science students, many of whom continue to use it in their careers as data scientists.

Data scientists who extract data from of a data warehouse or Hadoop processing system, use R to slice and dice it for insights, and visualize the results. But businesses analyzing financial, social media and other data often need to scale the analytics across clusters of computers.

Several analysts pass along the factoid that two million people use R.   The truth is that nobody has any idea how many people use R; we don’t even know how many have downloaded the software.  The New York Times pointed out the difficulty in its piece five years ago:

While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that close to 250,000 people work with it regularly.

It’s possible that R has gained 1,750,000 users in the interceding five years.  It’s also possible that R has gained 10,000,000 users.  “Those most familiar with the software” are simply guessing.

While most analysts are neutral to positive on Microsoft’s move, Mr. Dan Woods takes a contrary view.  In an article published in Forbes and cross-posted on multiple platforms, Mr. Woods argues that Microsoft was wrong to buy Revolution Analytics, and instead should buy Tibco.   (That is the implication of his argument that Microsoft should “emulate” Tibco, since the only way to “emulate” Tibco is to own the clump of software Tibco packages up as TERR.)

Mr. Woods is a “content specialist”, as freelance writers call themselves today, and his expertise in analytics is exemplified by his most recent book, Wikis for Dummies, published in 2007.  One suspects that the private equity firm that acquired Tibco in September is peddling the pieces, and has engaged “content specialists” to bang the drum.

Mr. Woods gets two things right.  It’s true that R is a mess, and it is also true that the GPL license makes R difficult to commercialize.  R’s messiness is a byproduct of crowdsourced development; it is a feature to its devotees and a bug for everyone else.  (For those who simply cannot tolerate R’s messiness there is a simple solution: use Python.)  Under the GPL license, any enhancements become part of the free distribution, so if you distribute a product built with R you must share the source code of your product as well.

At the crux of his argument, though, Mr. Woods gets it wrong:

Revolution Analytics has made a business, like many open source-based companies, of supporting Open Source R.

This is factually incorrect.  Revolution only recently started to offer a consulting service for open source R users; for most of its history, its business was built around Revolution R Enterprise, a commercially supported enhanced R distribution.  This is not a trivial distinction.  Cloudera Hadoop, for example, is based on Apache Hadoop, but it is not the same thing; while many enterprises use commercially supported Hadoop distributions (from vendors like Cloudera, Hortonworks or MapR), hardly anyone uses open source Apache Hadoop in production.

The same is true for R; while many enterprises have an issue using open source R, they are willing to deploy commercially supported R distributions (such as Oracle R or Revolution R).  This is the business Microsoft enters by acquiring Revolution Analytics.

Regarding Mr. Woods’ point about the need to rebuild R from the ground up, that is neither possible nor necessary.  The GPL license prevents anyone from “rebuilding” R as a commercial venture; if anyone “rebuilds” the language it will be the open source development team itself.

In any case, one need not “make R scale” — one need only provide an R API to other platforms (such as Apache Spark or dbLytix) that can scale, so that R users can interface with them.   This is the approach taken by Revolution Analytics’ ScaleR software, which is actually written in C, but includes an interface from the R programming language.  By building this component into Azure, Microsoft can offer those who use R locally a scaleable back end.

Update: Mr. Woods doubles down here.

Microsoft Buys Revolution Analytics

On Friday, January 23, Microsoft announced an agreement to acquire Revolution Analytics.  Coverage of the announcement in the media is extensive, with stories by TechCrunchWiredZDNetVentureBeat and many others (here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here and here.)

Microsoft did not disclose the negotiated purchase price; Revolution’s total capitalization is around $40 million.  Given Revolution’s scale of operations, the acquisition will have minimal impact on Microsoft’s near-term revenue and profit.

Many analysts follow Microsoft, but few have heard of Revolution Analytics, and most seem to be stumped by this move.  An example:

Question: What is the significance of Microsoft acquiring Revolution Analytics?

Answer: I am not sure.

Microsoft gets four things with this deal:

  • Instant credibility with the growing open source analytics community
  • Consulting and support skills to help enterprise customers adopt R
  • A capable engineering organization (conveniently located in Seattle)
  • Software bits that should integrate well with the Microsoft stack

In addition to its primary offering, Revolution R Enterprise, Revolution distributes Revolution R Open, an enhanced free distribution of open source R; and Revolution R Cloud an elastic offering on the AWS Marketplace.  Revolution R Open is equivalent in many respects to Oracle R Distribution, which is also compiled with the Intel Math Kernel Libraries.  Revolution R Plus is commercially supported, and includes additional software bits for enterprise integration; this product is comparable to Oracle R Enterprise.

Revolution Analytics’ other key software assets include ScaleR, a distributed out-of-memory back end with a strong R interface; DeployR, a component that supports enterprise deployment of web-based applications; and DevelopR, a Windows-based IDE.

While the IDE has a number of useful features, it requires significant investment to compete effectively with RStudio, which has won the hearts and minds of R users.  Upgrading software simply to make it competitive with a “free” competitor strikes me as a dubious commercial move; it seems more likely that Microsoft will add an R capability to the Visual Studio suite.

Revolution’s ScaleR back end enables R users to leverage a platform for distributed analytics.  ScaleR already runs on Windows Server HPC clusters, which should make integration with Azure a straightforward matter.  This is important for Microsoft, since Azure Machine Learning currently maxes out at around 10Gb.

ScaleR’s integration with Hadoop currently runs through MapReduce; competing best-in-class Hadoop analytics (such as Spark, H2O, Skytree and SAS) run in memory for better performance.  Microsoft’s deep pockets give Revolution the means to make this product competitive.

SAS Versus R Part Two

In a previous post, I summarized some myths about SAS and R — arguments offered by proponents of one or the other that deserve to be dismissed.

In this post, I will review some arguments that do make sense — things to consider if you are an aspiring analyst or if you are an executive making decisions about software for your organization.

(1) Every analysis technique available in SAS is available in R — plus many more

It’s fair to say that any analysis you can do in SAS you can also do in R.  The reverse, however, is not true — there are many techniques available in R that are not available in SAS.

As an open source platform, R is open to innovation, and offers few barriers to entry for new techniques.  An analyst who develops a new technique can quickly publish it in R, even if the technique has only niche appeal; it’s a great example of the long tail effect in action.

Commercial software providers like SAS, on the other hand, use product management calculus to balance the benefits of introducing a new technique against the cost to develop and support it.  The marginal revenue from adding a feature is hard to measure, while the costs are known, so conservative companies like SAS tend to lag well behind the cutting edge.  SAS also tends to bundle popular new capabilities into new products rather than enhancing the existing product, forcing customers to add more SAS software licenses to the stack if they want the capability.

Random Forests is a case in point.  Breiman and Cutler published their seminal article describing the technique in October, 2001; the following year, they published the randomForest package in R.  In December, 2012, SAS released an “experimental” version of what it calls “HP Forests” in SAS High Performance Analytics, and in 2013 included the PROC in SAS Enterprise Miner 13.1.

Ten years is a long time to wait.

(2) SAS is easier to learn and use than R

R mavens dispute this point, but they are wrong.  R is significantly harder to learn and use than SAS, at several levels, and for a number of reasons.

Bob Muenchen recently published an excellent catalogue of Things That Make R Hard to Learn.  Bob should know; he makes a living helping users cross the chasm from SAS to R.  Here is a brief except, but you should definitely read the whole thing:

R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.

There are two main reasons SAS is easier to user than R.  First, as a commercial product every element of SAS is governed by a common design that unifies the SAS programming language, user interfaces and documentation.  As a result, SAS programming syntax and documentation is generally consistent across procedures; statements generally mean the same thing whether you are working in PROC ACCESS or PROC XML.

Developers who contribute R packages, on the other hand, operate independently and without a comparable design.  While each individual package may be well or poorly written, there is no governing principle that ensures packages are consistent with one another.  While R aficionados celebrate its diversity, to the outsider it just seems messy.

SAS’ strong development tools add significant value for the user.  SAS Enterprise Guide, for example, included with Analytics Pro at no extra charge, offers a workflow interface and the ability to generate SAS or SQL code behind the scenes.  There is no equivalent code-generating tool available for R today.

(3) SAS offers an “enterprise-grade” solution

Individual analysts surveyed by Rexer last year said that cost and ease of use are the most important factors they consider when choosing analytic tools.  For enterprises, however, the selection criteria are more complex.

Technical support is a key concern for most organizations; some go so far as to adopt blanket policies banning the use of unsupported software.  SAS invests heavily in its Support organization; unlike many large software vendors, Technical Support is a career track at SAS, with low employee turnover.  With locations located in multiple countries, SAS is able to support customers globally and at enterprise scale.

When SAS licenses its software, it warrants that the software is materially free of defects.  This warranty is backed up by a contractual commitment to fix defects that surface.  Hence, SAS offers the customer a “single throat to choke” — customers know when they license SAS that a single organization is responsible for development, distribution, implementation and support of the software, and accountability is clear.

Open source R, of course, has no organic technical support.  Organizations such as Revolution Analytics offer technical support either for open source R or Revolution’s own commercial R distribution.  Third party service providers like Revolution can be highly knowledgeable and effective; however, if there is a software defect in an R package, the support provider can only notify the developer and request resolution.

(4) SAS costs more than R

“Duh!” you say; “R is free!”  True enough.  R is open source software, distributed with a free license to use; for a single analyst, the incremental TCO to download, install and use R on an existing machine is zero.   This is also true for other key components of the R ecosystem, such as RStudio, the popular development environment.   Low cost of entry is a key driver behind R’s growing popularity.

SAS, on the other hand, charges a subscription fee which consists of a term license to use the software plus technical support and maintenance into a subscription fee.  Entry costs to license the most basic package (SAS Analytics Pro) costs $8,700 (first year fee) at the SAS online store; this package includes Base SAS, SAS/STAT and SAS/Graph.  SAS renewal fees generally run 25-30% of the first year fee.  SAS bundles its analytic features into a number of separate packages, such as SAS/ETS for time series, SAS/OR for optimization and SAS/IML for matrix manipulation; if you require these capabilities, you must pay extra.  SAS also offers access engines for an assortment of data sources, each of which can be licensed individually for $3,000 each.

The version of SAS sold through the online store is for single Windows machines only.  SAS sells its software for servers through its sales force, and pricing is negotiated; “list” price depends on the computing power of the server, measured by cores or sockets.  Server pricing for Analytics Pro starts in the low six figures.

SAS offers a virtualized “University Edition” which is free but not open source.  See here for a review.

Bottom line — for the analyst

Aspiring analysts ask “should I learn SAS or R?”   I’m tempted to answer “why not both?” but that begs the question of which to learn first.

If SAS is the primary tool at your organization or university, learn it and use it.  There are still more jobs available for SAS users than R users (though the gap is narrowing); and even prospective employers who do not currently use SAS treat it as a proxy for analytics know-how.

If your organization or university supports both SAS and R, look for trends in usage.  Is the R community growing rapidly?  Are the “best and brightest” people using SAS or R?  Is your management putting out subtle (or not-so-subtle) messages promoting use of one or the other?   Take the pulse of your organization and make your choice.

If your organization or university does not already license SAS, if you aspire to free-lance consulting or you are simply unemployed, learn R.  Doing so costs you nothing, and there are plenty of low-cost options for training and self-directed learning.

Bottom line — for the enterprise

If you are making decisions about software for an analytics team or an entire organization, the calculus is more complex.

R has more analytic techniques than SAS, but what techniques to you actually need?  Take note of your team’s actual current and future analytic needs, and act accordingly.  If you are using SAS today, the chances are very good that a handful of PROCs account for 95% of current usage; the same is true for R.

SAS is easier to learn than R, but if all or most of your analysts already know R, what difference does it make?  Many younger analysts entering the workforce already know how to use R, and it is a waste of time and money to force them to learn SAS.  On the other hand, if your analysts rely on SAS, you can expect to invest considerable time and money for retraining.

Do you need an enterprise solution?  If you organization spans multiple countries, if you support more than twenty users, the chances are that the answer is “yes”.  For a larger organization, it’s hard to beat SAS’ ability to mobilize support, training and consulting resources around the world.  This is likely to change in the future, as organizations like Revolution Analytics build scale and credibility.

SAS costs more than R, but R is not free.  If you are concerned about SAS costs, carefully evaluate your spending and take note of the value offered by SAS.  Keep in mind that software licensing costs are only one component of Total Cost of Ownership (TCO); third-party support for R is not free, and neither is training and conversion.  Do the math.

In general, SAS works well for organizations that are in the middle of Tom Davenport’s maturity cycle pictured above.   These organizations have the basic data infrastructure and business cases for analytics, combined with a need for rapid scale and consistency across locations.  As organizations mature, they become less dependent on a single vendor for analytics and more willing to develop a “best-in-breed” approach; they are more interested in innovation and “cutting-edge” techniques, and the analysts they hire have the will and skill to learn R.  These organizations are adopting R at an increasing rate.

Adoption of R is most pervasive among analytic service providers, such as consultants, system integrators and marketing service providers.  These organizations are sensitive to software costs and tend to hire most highly skilled analysts, for whom R’s learning curve is not a serious issue.  Costs aside, SAS restrictions on use — designed to prevent cannibalization– are highly problematic for service providers.

SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.