This is the first of a two-part post on the current state of advanced analytics in Hadoop. In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout. In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.
For starters, a few definitions.
I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data. Discovery means the articulation of patterns as rules or mathematical expressions; deployment means the mobilization of discovered patterns to improve a business process. Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance. Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.
By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.
Analysts can and do code advanced analytics directly in MapReduce. For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.
The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics. External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore. This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach. Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both. The end result is an extended analytic discovery-to-deployment cycle.
Eliminating data movement radically reduces analytic cycle time. This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place. This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.
Ideal use cases for advanced analytics in Hadoop have the following profile:
- Source data is already in Hadoop
- Applications that consume the analytics are also in Hadoop
- Business need to use all of available data (e.g. sampling is not acceptable)
- Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself
The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly. These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.
Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm. Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases. Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.
As of Release 0.7 (June 16, 2012), the following algorithms are implemented:
Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models
Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down
Association: Parallel FP-Growth
Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition
Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization
Lexical Analysis: Collocations
For a clever introduction to machine learning and Mahout, watch this video.
For more detail, review this presentation on Slideshare.
There are no recently released books on Mahout. This book is two releases out of date, but provides a good introduction to the project.
Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others. Check the Powered by Mahout page for an extended list.
Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.