This is the second of a three-part series on the current state of play for machine learning in Hadoop. Part One is here. In this post, we cover open source options.
As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines. This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.
Tools for machine learning in Hadoop can be classified into two main categories:
- Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
- Fully distributed machine learning software that integrates with Hadoop
There are two major open source projects in the first category. The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase. RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.
Both projects enable R users to implement explicit parallelization in MapReduce. R users write R scripts specifically intended to be run as mappers and reducers in Hadoop. Users must have MapReduce skills, and must refactor program logic for distributed execution. There are some differences between the two projects:
- RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
- RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
- As noted above, RHIPE lacks an interface to HBase
- Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE
Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib. Both projects have commercial backing, and show robust development activity. Statistics from GitHub for the thirty days ended February 12 show the following:
- 0xdata H2O: 18 contributors, 938 commits
- Apache Spark: 77 contributors, 794 commits
H2O is a project of startup 0xdata, which operates on a services and support business model. Recent coverage by this blog here; additional coverage here, here and here.
MLLib is one of several projects included in Apache Spark. Databricks and Cloudera offer commercial support. Recent coverage by this blog here and here; additional coverage here, here, here and here.
As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature. Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.
Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS. There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project. Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.
Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit. Development activity on these projects is much less robust than for H2O and Spark. GitHub statistics for the thirty days ended February 12 speak volumes:
- Apache Mahout: 5 contributors, 54 commits
- Vowpal Wabbit: 8 contributors, 57 commits
Neither project has significant commercial backing. Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project. (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout). John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.
Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition. Mahout’s supervised learning algorithms, however, are weak. Criticisms of Mahout tend to fall into two categories:
- The project itself is a mess
- Mahout’s integration into MapReduce is suitable only for high latency analytics
On the first point, Mahout certainly does seem eclectic, to say the least. Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects. Many algorithms are underdeveloped, unsupported or both. The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.
“High latency” is code for slow. Slow food is a thing; “slow analytics” is not a thing. The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.
Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop. Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis. Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.
In Part Three, we will cover commercial software for machine learning in Hadoop.
15 thoughts on “Machine Learning in Hadoop: Part Two”
1) >> In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout.
Not completely true. Myrrix is an evolution of Mahout.
2) There is no mention of Python. There is a lot of action happening in the intersection of Python and ML.
Thanks for reading and commenting.
Re #1, seems to be a semantic point. Sean Owen, founder of Myrrix is also the single largest contributor to Mahout over the life of the project.
Re #2, no disrespect intended to the Pythonista community, but it’s a general purpose language and not an analytic language. Everyone knows that you can write custom machine learning code in Python, Java, MapReduce, C etc.
1) Sean Owen left Apache Mahout PMC and is not much active in the Mahout community. 2) Agree that Python is a general purpose language, take a look at Pandas, Scikit-learn for analytics and ML.
1) So what?
FWIW I agree with the assessment of Mahout above. Myrrix was really a different project. The reasoning for joining Cloudera was not related to Mahout. I have put a comparatively large amount of effort into it but not currently. CDH5 contains Mahout and it’s fully supported (a lot by me). But the next major release (like, over a year away) will almost certainly not due to the issues above.
Thanks for reading and commenting. Did not intend to start a controversy here, but simply wanted to note that Cloudera has invested something in Mahout, which is more than the other distributors can say.
The Mahout project team claims that Myrrix is “built on” Mahout. If that is not correct, you may want to contact them and correct the record.
According to the Myrrix website, Myrrix “evolved from” Mahout. That implies a fork, but not a completely unrelated project.
If Cloudera’s rationale for acquiring Myrrix “was not related” to Mahout and you are now providing most of the support for Mahout, it would appear that someone miscalculated, no?
Code-wise there is virtually no relationship between the two. “Spiritually” you might say there is, as I authored a lot of the recommender stuff in Mahout, and Myrrix was a recommender also based on Hadoop from the same person. “Built on” isn’t quite accurate but not a big deal to me; there is a non-zero relationship between the two projects
Cloudera has supported Mahout from before I arrived. In truth there is little use of it that results in customer support work, and it was on its way out before I came. I only meant it as a passing comment, that as it happens, I’ve taken over the little support work there is as well, as I certainly have most background in the project within the company.
Really my work is around the successor project, called Oryx, and Spark, customer engagement, etc. Mahout is a few percent of the work.
Cloudera may have “supported” Mahout prior to the acquisition of Myrrix, but they weren’t exactly promoting it. There are five postings pertinent to Mahout on the Cloudera Developer Blog, and one of those talks about how to use Pig instead of Mahout. By that measure, Mahout ranks dead last among the Apache bits.
I think I and others at Cloudera are in violent agreement with you. As I alluded to, it was only narrowly included in CDH5. Still, a promise is a promise, and Cloudera can and does support and fix Mahout for customers — I delivered a patch for RDF this month, for example.
The goal should be advancing the sophistication of open platforms for ML on Hadoop, regardless of the project or where it’s from, and that is why I/Cloudera are putting energy into things like Spark (now an Apache project actually) and to a lesser extent Oryx.
But breaking from the Apache orthodoxy draws attacks from pure-play Hadoop vendors (*cough* http://mail-archives.apache.org/mod_mbox/mahout-dev/201311.mbox/%3CCAEccTywczrx866CuFp07o0bA=hvV+W-AGDykMg8KSK2WYAwtHQ@mail.gmail.com%3E). It is usually easier to err on the side of supporting Apache stuff just because it’s Apache.
Again, many thanks for reading and comment.
Cloudera’s support for Spark and Oryx put it squarely in the lead for machine learning in Hadoop, IMHO
Love the first two articles, but can’t seem to find the 3rd in the series. Is it not written yet or do I have terrible search skills?
Your search skills are fine, I never got around to writing part 3. Looking to update the whole topic soon, so stay tuned. Thanks for reading!
Reblogged this on Learning Data Science Everyday.