This is the second of a three-part series on the current state of play for machine learning in Hadoop. Part One is here. In this post, we cover open source options.
As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines. This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.
Tools for machine learning in Hadoop can be classified into two main categories:
- Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
- Fully distributed machine learning software that integrates with Hadoop
There are two major open source projects in the first category. The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase. RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.
Both projects enable R users to implement explicit parallelization in MapReduce. R users write R scripts specifically intended to be run as mappers and reducers in Hadoop. Users must have MapReduce skills, and must refactor program logic for distributed execution. There are some differences between the two projects:
- RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
- RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
- As noted above, RHIPE lacks an interface to HBase
- Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE
Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib. Both projects have commercial backing, and show robust development activity. Statistics from GitHub for the thirty days ended February 12 show the following:
- 0xdata H2O: 18 contributors, 938 commits
- Apache Spark: 77 contributors, 794 commits
As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature. Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.
Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS. There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project. Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.
Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit. Development activity on these projects is much less robust than for H2O and Spark. GitHub statistics for the thirty days ended February 12 speak volumes:
- Apache Mahout: 5 contributors, 54 commits
- Vowpal Wabbit: 8 contributors, 57 commits
Neither project has significant commercial backing. Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project. (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout). John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.
Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition. Mahout’s supervised learning algorithms, however, are weak. Criticisms of Mahout tend to fall into two categories:
- The project itself is a mess
- Mahout’s integration into MapReduce is suitable only for high latency analytics
On the first point, Mahout certainly does seem eclectic, to say the least. Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects. Many algorithms are underdeveloped, unsupported or both. The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.
“High latency” is code for slow. Slow food is a thing; “slow analytics” is not a thing. The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.
Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop. Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis. Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.
In Part Three, we will cover commercial software for machine learning in Hadoop.