Advanced Analytics in Hadoop, Part One

This is the first of a two-part post on the current state of advanced analytics in Hadoop.  In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout.  In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.

For starters, a few definitions.

I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data.   Discovery means the articulation of patterns as rules or mathematical expressions;  deployment means the mobilization of discovered patterns to improve a business process.  Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance.  Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.

By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.

Analysts can and do code advanced analytics directly in MapReduce.  For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.

The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics.   External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore.  This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach.  Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both.  The end result is an extended analytic discovery-to-deployment cycle.

Eliminating data movement radically reduces analytic cycle time.  This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place.  This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.

Ideal use cases for advanced analytics in Hadoop have the following profile:

  • Source data is already in Hadoop
  • Applications that consume the analytics are also in Hadoop
  • Business need to use all of available data (e.g. sampling is not acceptable)
  • Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself

The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly.  These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.

Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm.   Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases.   Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.

As of Release 0.7 (June 16, 2012), the following algorithms are implemented:

Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models

Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down

Association: Parallel FP-Growth

Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition

Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization

Lexical Analysis: Collocations

For a clever introduction to machine learning and Mahout, watch this video.

For more detail, review this presentation on Slideshare.

There are no recently released books on Mahout.  This book is two releases out of date, but provides a good introduction to the project.

Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others.   Check the Powered by Mahout page for an extended list.

Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.

What Business Practices Enable Agile Analytics?

Part four in a four-part series.

We’ve mentioned some of the technical innovations that support an Agile approach to analytics; there are also business practices to consider.   Some practices in Agile software development apply equally well to analytics as any other project, including the need for a sustainable development pace; close collaboration; face-to-face conversation; motivated and trustworthy contributors, and continuous attention to technical excellence.  Additional practices pertinent to analytics include:

  • Commitment to open standards architecture
  • Rigorous selection of the right tool for the task
  • Close collaboration between analysts and IT
  • Focus on solving the client’s business problem

More often than not, customers with serious cycle time issues are locked into closed single-vendor architecture.  Lacking an open architecture to interface with data at the front end and back end of the analytics workflow, these organizations are forced into treating the analytics tool as a data management tool and decision engine; this is comparable to using a toothbrush to paint your house.  Server-based analytic software packages are very good at analytics, but perform poorly as databases and decision engines.

Agile analysts take a flexible, “best-in-class” approach to solving the problem at hand.  No single vendor offers “best-in-class” tools for every analytic method and algorithm.  Some vendors, like KXEN, offer unique algorithms that are unavailable from other vendors; others, like Salford Systems, have specialized experience and intellectual property that enables them to offer a richer feature set for certain data mining methods.  In an Agile analytics environment, analysts freely choose among commercial, open source and homegrown software, using a mashup of tools as needed.

While it may seem like a platitude to call for collaboration between an organization’s analytics community and the IT organization, we frequently see customers who have developed complex processes for analytics that either duplicate existing IT processes, or perform tasks that can be done more efficiently by IT. Analysts should spend their time doing analysis, not data movement, management, enhancement, cleansing or scoring; but surveyed analysts typically report that they spend much of their time performing these tasks.  In some cases, this is because IT has failed to provide the needed support; in other cases, the analytics team insists on controlling the process.   Regardless of the root cause, IT and analytics leadership alike need to recognize the need for collaboration, and an appropriate division of labor.

Focusing the analytics effort on the client’s business problem is essential for the practice of Agile analytics.  Organizations frequently get stuck on issues that are difficult to resolve because the parties are focused on different goals; in the analytics world, this takes the form of debates over tools, methods and procedures.  Analysts should bear in mind that clients are not interested in winning prizes for the “best” model, and they don’t care about the analyst’s advanced degrees.   Business requires speed, agility and clarity, and analysts who can’t deliver on these expectations will not survive.

What Is Agile Analytics?

This post is the second in a four-part series.

Agile Analytics is an approach to predictive analytics that emphasizes:

  • Client satisfaction through rapid delivery of usable predictions
  • Focus on model performance when deployed “in market”
  • Iterative and evolutionary approach to model development
  • Rapid cycle time through radical reduction in time to deployment

The Agile approach focuses on the client’s end goal: using data-driven predictions to make better decisions that impact the business.  In contrast, conventional approaches to predictive modeling (such as the well-known SEMMA[1] model) tend to focus on the model development process, with minimal attention given to either the client’s business problem or how the model will be deployed.

Since Agile Analytics is most concerned with how well the predictive model supports the client’s decision-making process, the analyst evaluates the model based on how well it serves this purpose when deployed under market conditions.  In practice, this means that the analyst evaluates model accuracy in production together with score latency, deployment cost and interpretability – a critical factor when building predictive analytics into a human process.   Conventional approaches typically evaluate predictive models solely on model accuracy when back-tested on a sample, a measure that often overstates the accuracy that the model will achieve when deployed under market conditions.

Agile analysts stress rapid deployment and iterative learning; they assume that the knowledge produced from tracking an initial model after it is deployed enables enhancements in subsequent iterations, and they build this expectation into the modeling process.  An Agile analyst quickly develops a predictive model using fast, robust methods and available data, deploys the model, monitors the model in production and improves it as soon as possible.  A conventional analyst tends to take extra time perfecting an initial model prior to deployment, and may pay no attention to in-market performance unless the client complains about anomalies.

Reducing cycle time is critical for the Agile analyst, since every iteration produces new knowledge.  The Agile analyst aggressively looks for ways to reduce the time needed to develop and deploy models, and factors cycle time into the choice of analytic methods.  Conventional analysts are often strikingly unengaged with what happens outside of the model development task; larger analytic teams often delegate tasks like data marshalling, cleansing and scoring to junior members, who perform the “grunt” work with programming tools.


[1] Sample, Explore, Modify, Model, Assess