2013 Rexer Data Miner Survey

Rexer Analytics published its 2013 Data Miner Survey just before the Holidays, and it’s an excellent read.

As always when working with survey research, one should use some caution in interpreting the results; it’s very difficult to build a representative sample of analysts and data miners.  While it is easy to find fault with Rexer’s sample — which vendors who are unhappy with some of the findings will likely try to do — there is no better survey of working analysts available today.

Key findings:

  • Customer Analytics is the most frequently cited application for analytics:
    • Understanding customers
    • Improving customer experience
    • Customer acquisition, upsell and cross-sell
  • Respondents recognize growing data volumes, but the size of their analytic data sets is stable
    • In other words, one should not confuse managing Big Data with analyzing Big Data
  • R is the most widely used analytic software
    • 70% of respondents say they use R
    • 24% say R is their primary tool, more than any other software
  • Text mining is mainstream; 70% of respondents say they mine text now or plan to start
  • Time to deployment remains an issue; respondents report deployment cycles ranging from weeks to a year or more

One of the most interesting pieces of analysis in the survey is a clustering based on the importance ratings of tool selection criteria.  Rexer’s analysis reveals two principal dimensions in the data, one labeled as “Cost” and the other labeled as “Ease of Use and Interface Quality”.  The largest cluster, which includes respondents who rated everything important, should be discounted as an artifact of questionnaire design; it reflects a phenomenon known as the “wrist effect”, where respondents simply check all of the boxes on one end of the scale.   Of the remaining respondents:

  • Respondents who value the ability to write one’s own code generally do not value ease of use, and vice versa.  These respondents are most likely to cite SAS or R as their primary software
    • Among these users, those who cite the importance of cost are much more likely to cite R as their primary tool
    • Those who place a lower value on cost tend to value the quality of the user interface
  • Respondents who value ease of use and the quality of the user interface are more likely to be new to analytics
    • These respondents are most likely to cite Statistica, Rapid Miner and IBM SPSS Modeler as their primary tool

For more information about the survey and to get a copy, go here.

Analytic Applications (Part Two): Managerial Analytics

This is the second in a four-part taxonomy of analytics based on how the analytic work product is used.  In the first post of this series, I covered Strategic Analytics, or analytics that support the C-suite.  In this post, I will cover Managerial Analytics: analytics that support middle management, including functional and regional line managers.

At this level, questions and issues are functionally focused:

  • What is the best way to manage our cash?
  • Is product XYZ performing according to expectations?
  • How effective are our marketing programs?
  • Where can we find the best opportunities for new retail outlets?

There are differences in nomenclature across functions, as well as distinct opportunities for specialized analytics (retail store location analysis, marketing mix analysis, new product forecasting), but managerial questions and issues tend to fall into three categories:

  • Measuring the results of existing entities (products, programs, stores, factories)
  • Optimizing the performance of existing entities
  • Planning and developing new entities

Measuring existing entities with reports, dashboards, drill-everywhere (etc.) is the sweet spot for enterprise business intelligence systems.  Such systems are highly effective when the data is timely and credible, reports are easy to use and the system reflects a meaningful assessment framework.  This means that metrics (activity, revenue, costs, profits) reflect the goals of the business function and are standardized to enable comparison across entities.

Given the state of BI technology, analysis teams within functions (Marketing, Underwriting, Store Operations etc.) spend a surprisingly large amount of time preparing routine reports for managers.  (For example, an insurance client asked my firm to perform an assessment of actual work performed by a group of more than one hundred SAS users.  The client was astonished to learn that 80% of the SAS usage could be done in Cognos, which the client also owned).

In some cases, this is simply due to a lack of investment by the organization in the necessary tools and enablers, a problem that is easily fixed.  More often than not, though, the root cause is the absence of consensus within the function of what is to be measured and how performance should be compared across entities.   In organizations that lack measurement discipline, assessment is a free-for-all where individual program and product managers seek out customized reports that show their program or product to the best advantage; in this environment, every program or product is a winner and analytics lose credibility with management.  There is no technical “fix” for this problem; it takes leadership for management to set out clear goals for the organization and build consensus for an assessment framework.

Functional analysts often complain that they spend so much time preparing routine reports that they have little or no time to perform analytics that optimize the performance of existing entities.  Optimization technology is not new, but tends to be used more pervasively in Operational Analytics (which I will discuss in the next post in this series).   Functionally focused optimization tools for management decisions have been available for well over a decade, but adoption is limited for several reasons:

  • First, an organization stuck in the “ad hoc” trap described in the previous paragraph will never build the kind of history needed to optimize anything.
  • Second, managers at this level tend to be overly optimistic about the value of their own judgment in business decisions, and resist efforts to replace intuitive judgment with systematic and metrics-based optimization.
  • Finally, in areas such as Marketing Mix decisions, constrained optimization necessarily means choosing one entity over another for resources; this is inherently a leadership decision, so unless functional leadership understands and buys into the optimization approach it will not be used.

Analytics for planning and developing new entities (such as programs, products or stores) usually require information from outside of the organization, and may also require skills not present in existing staff.  For both reasons, analytics for this purpose are often outsourced to providers with access to pertinent skills and data.  For analysts inside the organization, technical requirements look a lot like those for Strategic Analytics: the ability to rapidly ingest data from any source combined with a flexible and agile programming environment and functional support for a wide range of generic analytic problems.

In the next post in this series, I’ll cover Operational Analytics, defined as analytics whose purpose is to improve the efficiency or effectiveness of a business process.

Analytic Applications (Part One)

Conversations about analytics tend to get muddled because the word describes everything from a simple SQL query to climate forecasting.  There are several different ways to classify analytic methods, but in this post I propose a taxonomy of analytics based on how the results are used.

Before we can define enterprise best practices for analytics, we need to understand how they add value to the organization.  One should not lump all analytics together because, as I will show, the generic analytic applications have fundamentally different requirements for people, processes and tooling.

There are four generic analytic applications:

  • Strategic Analytics
  • Managerial Analytics
  • Operational Analytics
  • Customer-Enabling Analytics

In today’s post, I’ll address Strategic Analytics; the rest I’ll cover in subsequent posts.

Strategic Analytics directly address the needs of the C-suite.  This includes answering non-repeatable questions, performing root-cause analysis and supporting make-or-break decisions (among other things).   Some examples:

  • “How will Hurricane Sandy impact our branch banks?”
  • “Why does our top-selling SUV turn over so often?”
  • “How will a merger with XYZ Co. impact our business?”

Strategic issues are inherently not repeatable and fall outside of existing policy; otherwise the issue would be delegated.   Issues are often tinged with a sense of urgency, and a need for maximum credibility; when a strategic decision must be taken, time is of the essence, and the numbers must add up.   Answers to strategic questions frequently require data that is not readily accessible and may be outside of the organization.

Conventional business intelligence systems do not address the needs of Strategic Analytics, due to the ad hoc and sui generis nature of the questions and supporting data requirements.   This does not mean that such systems add no value to the organization; in practice, the enterprise BI system may be the first place an analyst will go to seek an answer.  But no matter how good the enterprise BI system is, it will never be sufficiently complete to provide all of the answers needed by the C-suite.

The analyst is key to the success of Strategic Analytics.  This type of work tends to attract the best and most capable analysts, who are able to work rapidly and accurately under pressure.  Backgrounds tend to be eclectic: an insurance company I’ve worked with, for example, has a strategic analysis team that includes an anthropologist, an economist, an epidemiologist and graduate of the local community college who worked her way up in the Claims Department.

Successful strategic analysts develop domain, business and organizational expertise that lends credibility to their work.  Above all, the strategic analyst takes a skeptical approach to the data, and demonstrates the necessary drive and initiative to get answers.  This often means doing hard stuff, such as working with programming tools and granular data to get to the bottom of a problem.

More often than not, the most important contribution of the IT organization to Strategic Analytics is to stay out of the way.  Conventional IT production standards are a bug, not a feature, in this kind of work, where the sandbox environment is the production environment.  Smart IT organizations recognize this, and allow the strategic analysts some latitude in how they organize and manage data.   Dumb IT organizations try to force the strategic analysis team into a “Production” framework.  This simply inhibits agility, and encourages top executives to outsource strategic issues to outside consultants.

Analytic tooling tends to reflect the diverse backgrounds of the analytics, and can be all over the map.  Strategic analysts use SAS, R, Stata, Statsoft, or whatever to do the work, and drop the results into Powerpoint.  One of the best strategy analysts I’ve ever worked with used nothing other than SQL and Excel.  Since strategic analysis teams tend to be small, there is little value in demanding use of a single tool set; moreover, most strategic analysts want to use the best tool for the job, and prefer to use niche tools that are optimized for a single problem.

The most important common requirement is the capability to rapidly ingest and organize data from any source and in any format.  For many organizations, this has historically meant using SAS.  (A surprisingly large number of analytic teams use SAS to ingest and organize the data, but perform the actual analysis using other tools).    Growing data volumes, however, pose a performance challenge for the conventional SAS architecture, so analytic teams increasingly look to data warehouse appliances like IBM Netezza, to Hadoop, or a combination of the two.

In the next post, I’ll cover Managerial Analytics, which includes analytics designed to monitor and optimize the performance of programs and products.

Advanced Analytics in Hadoop, Part One

This is the first of a two-part post on the current state of advanced analytics in Hadoop.  In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout.  In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.

For starters, a few definitions.

I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data.   Discovery means the articulation of patterns as rules or mathematical expressions;  deployment means the mobilization of discovered patterns to improve a business process.  Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance.  Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.

By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.

Analysts can and do code advanced analytics directly in MapReduce.  For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.

The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics.   External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore.  This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach.  Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both.  The end result is an extended analytic discovery-to-deployment cycle.

Eliminating data movement radically reduces analytic cycle time.  This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place.  This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.

Ideal use cases for advanced analytics in Hadoop have the following profile:

  • Source data is already in Hadoop
  • Applications that consume the analytics are also in Hadoop
  • Business need to use all of available data (e.g. sampling is not acceptable)
  • Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself

The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly.  These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.

Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm.   Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases.   Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.

As of Release 0.7 (June 16, 2012), the following algorithms are implemented:

Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models

Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down

Association: Parallel FP-Growth

Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition

Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization

Lexical Analysis: Collocations

For a clever introduction to machine learning and Mahout, watch this video.

For more detail, review this presentation on Slideshare.

There are no recently released books on Mahout.  This book is two releases out of date, but provides a good introduction to the project.

Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others.   Check the Powered by Mahout page for an extended list.

Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.

How Important is Model Accuracy?

Go to a trade show for predictive analytics and listen to the presentations; most will focus on building more accurate predictive models.  Presenters will differ on how this should be done: some will tell you to purchase their brand of software, others will encourage you to adopt one method or another, but most will agree: accuracy isn’t everything, it’s the only thing.

I’m not going to argue in this post that accuracy isn’t a good thing (all other things equal), but consider the following scenario: you have a business problem that can be mitigated with a predictive model.  You ask three consultants to submit proposals, and here’s what you get:

  • Consultant A proposes to spend one day and promises to produce a model that is more accurate than a coin flip
  • Consultant B proposes to spend one week, and promises to produce a model that is more accurate than Consultant A’s model
  • Consultant C proposes to spend one year, and promises to produce the most accurate model of all

Which one will you choose?

This is an extreme example, of course, but my point is that one rarely hears analysts talk about the time and effort needed to achieve a given level of accuracy. or the time and effort needed to implement a predictive model in production.  But in real enterprises, there are essential trade-offs that must be factored into the analytics process.  As we evaluate these three proposals, consider the following points:

(1) We can’t know how accurate a prediction will be; we can only know how accurate it was.

We judge the accuracy of a prediction after the event of interest has occurred.  In practice, we evaluate the accuracy of a predictive model by examining how accurate it is with historical data.  This is a pretty good method, since past patterns often persist into the future.  The key word is “often”, as in “not always”; the world isn’t in a steady state, and black swans happen.   This does not mean we should abandon predictive modeling, but it does mean we should treat very small differences in model back-testing with skepticism.

(2) Overall model accuracy is irrelevant.

We live in an asymmetrical world, and errors in prediction are not all alike.   Let’s suppose that your doctor thinks that you may have a disease that is deadly, but can be treated with an experimental treatment that is painful and expensive.  The doctor gives you the option of two different tests, and tells you that Test X has an overall accuracy rate of 60%, while Test Y has an overall accuracy of 40%.

Hey, you think, that’s a no-brainer; give me Test X.

What the doctor did not tell you is that all of the errors for Test X are false negatives: the test says you don’t have the disease when you actually do.  Test Y, on the other hand, produces a lot of false positives, but also correctly predicts all of the actual disease cases.

If you chose Test X, congratulations!  You’re dead.

(3) We can’t know the value of model accuracy without understanding the differential cost of errors.

In the previous example, the differential cost of errors is starkly exaggerated: on the one hand, we risk death, on the other hand, we undergo painful and expensive treatment.  In commercial analytics, the economics tend to be more subtle:  in Marketing, for example, we send a promotional message to a customer who does not respond (false positive) or decline a credit line for a prospective customer who would have paid on time (false negative).  The actual measure of a model, however isn’t its statistical accuracy, but its economic accuracy: the overall impact of the predictive model on the business process it is designed to support.

Taking these points into consideration, Consultant A’s quick and dirty approach looks a lot better, for three reasons:

  • Better results in back-testing for Consultants B and C may or may not be sustainable in production
  • Consultant A’s model benefits the business sooner than the other models
  • Absent data on the cost of errors, it’s impossible to say whether Consultants B and C add more value

A fourth point goes to the heart of what Agile Analytics is all about.  While Consultants B and C continue to tinker in the sandbox, Consultant A has in-market results to use in building a better predictive model.

The bottom line is this: the first step in any predictive modeling effort must always focus on understanding the economics of the business process to be supported.  Unless and until the analyst knows the business value of a correct prediction — and the cost of incorrect predictions — it is impossible to say which predictive model is “best”.