The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Big Analytics Roundup (November 16, 2015)

Just three main stories this week: possible trouble for a pair of analytic startups; Google releases TensorFlow to open source; and H2O delivers new capabilities at its annual meeting.

In other news, the Spark team announces Release 1.5.2, a maintenance release; and the Mahout guy announces Release 0.11.1, with bug fixes and performance improvements. (h/t Hadoop Weekly)

Two items of note from the Databricks blog:

— Darin McBeath describes Elsevier’s Spark use case and introduces spark-xml-utils, a Spark package contributed by his team.  The package enables the Spark user to filter documents based on an Path expression, return specific nodes for an Path/XQuery expression and transform documents using an XLST stylesheet.

— Rachit Agarwal and Anurag Khandelwal of Berkeley’s AMPLab introduce Succinct, a distributed datastore for queries on compressed data.   They announce release of Succinct Spark, a Spark package that enables search, count, range and random access queries on compressed RDDs.  The authors claim a 75X performance advantage over native Spark using Succinct as a document store,

Three interesting stories on streaming data:

  • In a podcast, Data Artisans CTO Stephan Ewen discusses Flink, Spark and the Kappa architecture.
  • Techalpine’s Kaushik Pal compares Spark and Flink for streaming data.
  • Will McGinnis helps you get started with Python and Flink.

(1) Analytic Startups in Trouble

In The Information, Steve Nellis and Peter Schulz explain why startups return to the funding well frequently — and why those that don’t may be in trouble.  Venture funding isn’t a perfect indicator of success, but is often the only indicator available.  On the list: Skytree Software and Alpine Data Labs.

(2) Google Releases TensorFlow for Machine Learning

On the Google Research blog, Google announces open source availability of TensorFlow.  TensorFlow is Google’s second generation machine learning system; it supports Deep Learning as well as any computation that can be expressed as a flow graph.   Read this white paper for details of the system.  At present, there are Python and C++ APIs;  Google notes that the C++ API may offer some performance advantages.

Video intro here.

In Wired, Cade Metz reports; Erik T. Mueller dismisses; and Metz returns to note that Deep Learning can leverage GPUs, and that AI’s future is in data, as if we didn’t know these things already.

On Slate, Will Oremus feels the buzz.

On his eponymous blog, Sachin Joglekar explains how to do k-means clustering with TensorFlow.

Separately, in VentureBeat, Jordan Novet rounds up open source frameworks for Deep Learning.

(3) H2O.ai Releases Steam

It’s not a metaphor.  At its second annual H2O World event, H2O releases Steam, an open source data science hub that bundles model selection, model management and model scoring into a single container for elastic deployment.

On the H2O Blog, Yotam Levy wraps Day One, Day Two and Day Three of the H2O World event.  Speaker videos are here, slides here.  (Registration required.)  Some notable presentations:

— H2O: Tomas Nykodym on GLM; Mark Landry on GBM and Random Forests; Arno Candel on Deep Learning; Erin LaDell on Ensemble Modeling.

— Michal Malohlava of H2O and Richard Garris of Databricks explain how to run H2O on Databricks Cloud.  Separately, Michal demonstrates Sparkling Water, a Spark package that enables a Spark user to call H2O algorithms; Nidhi Mehta leads a hands-on with PySparkling Water;  and Xavier Tordoir of Data Fellas exhibits Interactive Genomes Clustering with Sparkling Water on the Spark Notebook.

— Szilard Pafka of Epoch summarizes his work to date benchmarking R, Python, Vowpal Wabbit, H2O, xgboost and Spark MLLib.  As reported previously, Pafka’s benchmarks show that H2O and xgboost are the best performers; they are faster and deliver more accurate models.

As reported in last week’s roundup, H2O.ai also announces a $20 million “B” round.

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Machine Learning in Hadoop: Part Two

This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first category.  The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase.  RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.

Both projects enable R users to implement explicit parallelization in MapReduce.  R users write R scripts specifically intended to be run as mappers and reducers in Hadoop.  Users must have MapReduce skills, and must refactor program logic for distributed execution.  There are some differences between the two projects:

  • RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
  • RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
  • As noted above, RHIPE lacks an interface to HBase
  • Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE

Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib.  Both projects have commercial backing, and show robust development activity.  Statistics from GitHub for the thirty days ended February 12 show the following:

  • 0xdata H2O: 18 contributors, 938 commits
  • Apache Spark: 77 contributors, 794 commits

H2O is a project of startup 0xdata, which operates on a services and support business model.  Recent coverage by this blog here;  additional coverage here, here and here.

MLLib is one of several projects included in Apache Spark.  Databricks and Cloudera offer commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and here.

As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature.  Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.

Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS.  There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project.  Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.

Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit.  Development activity on these projects is much less robust than for H2O and Spark.  GitHub statistics for the thirty days ended February 12 speak volumes:

  • Apache Mahout: contributors, 54 commits
  • Vowpal Wabbit: 8 contributors, 57 commits

Neither project has significant commercial backing.  Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project.  (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout).  John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.

Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition.  Mahout’s supervised learning algorithms, however, are weak.  Criticisms of Mahout tend to fall into two categories:

  • The project itself is a mess
  • Mahout’s integration into MapReduce is suitable only for high latency analytics

On the first point, Mahout certainly does seem eclectic, to say the least.  Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects.  Many algorithms are underdeveloped, unsupported or both.  The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.

“High latency” is code for slow.  Slow food is a thing; “slow analytics” is not a thing.  The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.

Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop.  Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis.  Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.

In Part Three, we will cover commercial software for machine learning in Hadoop.

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS and HDP

(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.

Introduction

Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.

Distribution

Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.

Ecosystem

In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.

Community

In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.

Notes from Strata 2013

Last week I attended the O’Reilly Strata 2013 Conference.    Here are some notes on presentations pertinent to analytics, in four categories:

  • Vendors
  • Users
  • Technical
  • Thought Provokers

Vendor Presentations

We wouldn’t have trade shows without sponsors, and the big ones get ten minutes of fame.  Some used their time well, others not so much.  I’ll refrain from shaming the bloviaters, but will single out three for applause:

  • John Schroeder of MapR did a nice preso on the business case for Hadoop, with a refreshing focus on measurable revenue impact and cost reduction;
  • Girish Juneja from Intel delivered a thoughtful summary of Intel’s participation in open source projects.  Not a lot of sizzle, but refreshingly free of hype;
  • Charles Zedlewski of Cloudera provided a terrific explanation of the history and direction of Hadoop and made a compelling case for the platform.

Someone should tell O’Reilly that Skytree is a vendor.  Skytree managed to get a 45-minute slot in a non-vendor track, and Alexander Gray of Skytree used the time to say stuff that data miners learned years ago.

User Presentations

Several presenters spoke about how their organization uses analytics.   In any conference, presentations like this are often the most compelling.

  • Rajat Taneja from Electronic Arts spoke about the depth of information captured by gaming companies, and how they use this information to improve the gaming experience.  Good presentation, with great visuals
  • Eric Colson of Stitch Fix (and formerly with Netflix) spoke about recommendation engines.  Stitch Fix sends bundles of new clothing to buyers on spec, and they have finely tuned the bundling process using a mix of machine learning and human decisions.  Eric spoke about the respective strengths of machine and human decisioning, and how to use them together effectively.
  • Michael Bailey of Facebook gets credit for truth in packaging for “Introduction to Forecasting”.  His presentation covered very basic content, the sort of thing covered in Stat 101, and he did a fine job presenting that.  Michael hinted at Facebook’s complex forecasting problem — they have to simultaneously forecast eyeballs and ad placements — and it would be great to hear more about that in a future presentation.

Technical Presentations

It’s tough to deliver detailed content in a short session; most of the presenters I saw struck the right balance.

  • Sharmila Shahani-Mulligan and others from ClearStory Data presented to an overflow audience interested in learning more about Spark and Shark.  Spark is an open source in-memory distributed computational engine that runs on top of Hadoop. It is designed to support iterative algorithms, and supports Java, Scala and Python.  Shark is part of Hive, integrates with Spark, and offers a SQL interface
  • Dr. Vijay Srinivas Agneeswaran of Impetus Technologies delivered what I thought was the best presentation in the show.   He summarized the limits of legacy analytics, discussed analytics in Hadoop (such as Mahout),  and spoke about a third wave of distributed analytics based on technologies like Spark, HaLoop, Twister, Apache Hama and GraphLab.
  • Jayant Shekhar of Cloudera delivered a very detailed presentation on how to build a recommendation engine.

Thought Provokers

Several presenters spoke on broad conceptual topics, with mixed results.

  • James Hendler of RPI spoke on the subject of “broad data”.  His presentation seemed thoughtful, but to be honest he lost me.
  • Nathan Marz of Twitter has co-authored a book on Big Data coming out soon.   After listening to his short preso on data modeling (“Human Fault Tolerance”) , I added the book to my wish list.
  • Kate Crawford of Microsoft presented on the subject of hidden biases in big data.  Her presentation covered material well known to seasoned analysts (“hey, did you know that your data may be biased?), Kate’s presentation was excellent, and full of good examples.

Overall, an excellent show.

Overstock.com Joins the Mahout Parade

Interesting story here in Wired about how Overstock.com used Mahout to build and deploy a recommendation engine to replace RichRelevance, thereby saving $2 million in annual fees.

Overstock joins an elite list of companies who are monetizing Mahout. including  Adobe,  Amazon,  AOL,  Buzzlogic,  Foursquare,  Twitter and Yahoo.

(h/t Bill Zanine)