How to Optimize Your Marketing Spend

There are formal methods and tools you can use to optimize marketing spend, including software from SAS, IBM and HP (among others).  The usefulness of these methods, however, depends on basic disciplines that are missing from many Marketing organizations.

In this post I’d like to propose some informal rules for marketing optimization.  These do not exclude using formal methods as well — think of them as organizing principles to put in place before you go shopping for optimization software.

(1) Ignore your agency’s “metrics”.

You use agencies to implement your Marketing campaigns, and they will be more than happy to provide you with analysis that shows you how much value you’re getting from the agency.  Ignore these.  Asking your agency to measure results of the campaigns they implement is like asking Bernie Madoff to serve as the custodian for your investments.

Every agency analyst understands that the role of analytics is to make the account team look good.   This influences the analytic work product in a number of ways, from use of bogus and irrelevant metrics to cherry-picking the numbers.

Digital media agencies are very good at execution, and they should play a role in developing strategy.  But if you are serious about getting the most from your Marketing effort, you should have your own people measure campaign results, or engage an independent analytics firm to perform this task for you.

(2) Use market testing to measure every campaign.

Market testing isn’t the gold standard for campaign measurement; it’s the only standard.  The principle is straightforward: you assign marketing treatments to prospects at random, including a control group who receive no treatment.  You then measure subsequent buying behavior among members of the treatment and control groups; the campaign impact is the difference between the two.

The beauty of test marketing is that you do not need a hard link between impressions and revenue at the point of sale, nor do you need to control for other impressions or market noise.  If treatments and controls are assigned at random, any differences in buying behavior are attributable to effects of the campaign.

Testing takes more effort to design and implement, which is one reason your agency will object to it.  The other reason is that rigorous testing often shows that brilliant creative concepts have no impact on sales.  Agency strategists tend to see themselves as advocates for creative “branding”; they oppose metrics that expose them as gasbags.  That is precisely why you should insist on it.

(3) Kill campaigns that do not cover media costs.

Duh, you think.  Should be obvious, right?  Think again.

A couple of years ago, I reviewed the digital media campaigns for a big retailer we shall call Big Brand Stores.  Big Brand ran forty-two digital campaigns per fiscal year; stunningly, exactly one campaign — a remarketing campaign — showed incremental revenue sufficient to cover media costs.  (This analysis made no attempt to consider other costs, including creative development, site-side development, program management or, for that matter, cost of goods sold.)

There is a technical term for campaigns that do not cover media costs.  They’re called “losers”.

The client’s creative and media strategists had a number of excuses for running these campaigns, such as:

  • “We’re investing in building the brand.”
  • “We’re driving traffic into the stores.”
  • “Our revenue attribution is faulty.”

Building a brand is a worthy project; you do it by delivering great products and services over time, not by spamming everyone you can find.

It’s possible that some of the shoppers rummaging through the marked-down sweaters in your bargain basement saw your banner ad this morning.  Possible, but not likely; it’s more likely they’re there because they know exactly when you mark down sweaters every season.

Complaints about revenue attribution usually center on the “last click” versus “full-funnel” debate, a tiresome argument you can avoid by insisting on measurement through market testing.

If you can’t measure the gain, don’t do the campaign.

(4) Stop doing one-off campaigns.

Out of Big Brand’s forty-two campaigns, thirty-nine were one-offs: campaigns run once and never again.  A few of these made strategic sense: store openings, special competitive situations and so forth.  The majority were simply “concepts”, based on the premise that the client needed to “do something” in April.

The problem with one-off campaigns is that you learn little or nothing from them.  The insight you get from market testing enables you to tune and improve one campaign with a well-defined value proposition targeted to a particular audience.  You get the most value from that insight when you repeat the campaign.  Marketing organizations stuck in the one-off trap never build the knowledge and insight needed to compete effectively.  They spend much, but learn nothing.

Allocate no more than ten percent of your Marketing spend to one-off campaigns.  Hold this as a reserve for special situations — an unexpected competitive threat, product recall or natural disaster.  Direct the rest of your budget toward ongoing programs defined by strategy.  For more on that, read the next section.

(5) Drive campaign concepts from strategy.

Instead of spending your time working with the agency to decide which celebrity endorsement to tout in April, develop ongoing programs that address key strategic business problems.  For example, among a certain segment of consumers, awareness and trial of your product may be low; for a credit card portfolio, share of revolving balances may be lagging competing cards among certain key segments.

The exact challenge depends on your business situation; what matters is that you choose initiatives that (a) can be influenced through a sustained program of marketing communications and (b) will make a material impact to your business.

Note that “getting lots of clicks in April” satisfies the former but not the latter.

This principle assumes that you have a strategic segmentation in place, because segmentation is to Marketing what maneuver is to warfare.  You simply cannot expect to succeed by attempting to appeal to all consumers in the same way.  Your choice of initiatives should also demonstrate some awareness of the customer lifecycle; for example, you don’t address current customers in the same way that you address prospective customers or former customers.

When doing this, keep the second and third principles in mind: a campaign concept is only a concept until it is tested.  A particular execution may fail market testing, but if you have chosen your initiatives well you will try again using a different approach.  Keep in mind that you learn as much from failed market tests as from successful market tests.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Python for Analytics

A reader complains that I did not include Python in a survey of Machine Learning in Hadoop.  It’s a fair point.  There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match.  Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered.  In this post I review the available evidence about the incidence of Python use for analytics; in a separate post, I will survey Python’s capabilities.

Python is a general purpose programming language whose syntax enables programmers to write efficient and concise code.  The Python Software Foundation manages an open source reference implementation written in C and nicknamed CPython.  Alternative implementations include Jython, written in Java; IronPython, for .net; and PyPy, a just-in-time compiler.

There is no dispute that Python is a popular language for general-purpose programming; according to the Transparent Language Popularity Index (TLPI), Python currently ranks seventh in popularity behind  C, Java, Objective C, C++, Basic and PHP.  By the same measure, exclusively analytic languages rank lower:

  • #14. R
  • #19. MATLAB
  • #26. Scala
  • #31. SAS

Measures like TLPI or the Tiobe Community Programming Index tell us something about the overall popularity of a language, but relatively little about its popularity for analytics. Many Python users aren’t at all engaged in analytics, and many analysts don’t use Python.

Python performs very well in Bob Muenchen’s analysis of analytic job postings (which he has perfected into a science).  Muenchen’s analysis shows that Python ranks third in analytic job postings, behind Java and SAS.  Python and R were at rough parity in job postings until early January 2013; since then, Python has outpaced R.

Surveys of analytic users show a mixed picture, reflecting differences in sampling and question construction.  In the 2013 Rexer survey, 64% of all respondents report writing their own code; the top reported choice is SQL (43%), followed by Java (26%) and Python (24%).  (These results are difficult to square with the overall finding that 70% of the respondents use R, which requires the user to write code.)   Rexer’s sample includes a mix of Power Analysts and Business Analysts, but relatively few Data Scientists.  (See this post for a definition of Analytic User Personas).

KDnuggets conducted its annual software poll in 2013; Python ranked fifth behind RapidMiner, R, Excel and Weka/Pentaho.   In a separate KDnuggets poll explicitly focused on programming languages for analytics, data mining and data science, Python ranked second behind R.  The KDnuggets online poll is a convenience sample (which is vulnerable to response bias), but there is no reason to believe that either R or Python users are over-represented relative to one another.  The KDnuggets community consists largely of Data Scientists and Power Analysts.

A follow-up poll by KDnuggets expressly about switching between Python and R found that more people use R than Python, and users switching from other tools are more likely to choose R over Python; however, more users are switching from R to Python than from Python to R.  The graphic below illustrates these relationships.

Switching Between Python and R

O’Reilly Media’s survey of data scientists at the 2012 and 2013 Strata conferences shows Python ranked third, behind SQL and R.  (The survey does not break out responses from 2012 and 2013).  More interesting is O’Reilly’s analysis of how reported usage of each tool correlates with all of the others; the graph shown below depicts all of the positive correlations significant at p=.05.

Strata Tool Correlation

The most striking thing in this graph is the separation between open source tools at the top of the graph and commercial tools at the bottom; respondents tend to use one or the other, but not both.  The dense network among open source tools indicates that those who use any open source tool tend to use many others.  (Weka’s isolation from other tools in the graph indicates either that (a) Weka is a really awesome tool or (b) Weka users have a unique perspective on life. Or both.)

Among respondents to O’Reilly’s survey, Python and R use are correlated, and so are Java and R use; but Python and Java use are not correlated.  Python and R use both correlate with Apache Hadoop and graph engines; Python also correlates with other components of the Hadoop ecosystem, such as Hive, Mahout and Hbase.

To summarize: Python usage is firmly embedded in the open source analytics ecosystem; however, usage is largely concentrated among Data Scientists, with lower penetration among Power Analysts (for whom R and SAS remain the preferred languages).  The KDnuggets data suggests that new entrants to analytic programming are more likely to choose R over Python, but the rate of switching from R to Python suggests that Python addresses needs not currently met with R.

Arguments by Python advocates that Python will outpace R because it is easier to use strike me as silly.  R is not difficult to learn for motivated users.  Unmotivated users aren’t going to choose Python over R; they will choose a business analytics tool like Alpine, Alteryx or Rapid Miner and skip coding entirely.  Analysts who want to code will choose a language for its functionality and not the elegance of its syntax.

Analytic User Personas

Analytic users are not all the same; in most organizations, there are a number of different user “personalities”, or personas, with distinct needs.  If you develop an analytics architecture for your organization or develop analytic software to sell to others, it is important to understand these personas.  In this essay, I profile four personas:

  • Power Analyst
  • Data Scientist
  • Business Analyst
  • Analytic Consumer

Your organization may or may not include all four personas; for example, if your organization consistently outsources predictive model building, you may have no Power Analysts or Data Scientists.   Moreover, if your organization is large enough, it may be valuable for you to recognize distinct subclasses of users within each persona.   In any event, your success depends on how well you understand the diverse needs of prospective users.

The Power Analyst

The Power Analyst sees advanced analytics as a full-time job, and holds positions such as Statistician or Actuary in organizations with significant investments in analytics, or as consultants in organizations that provide analytic services.  The Power Analyst understands conventional statistics and machine learning, and has considerable working experience in applied analytics.

Power Analysts prefer to work in an analytic programming language such as Legacy SAS or R.  They have enough training and working experience with the language to be productive, and consider analytic programming languages to be more flexible and powerful than analytic software packages with GUI interfaces.  They do not need analytics to be easy, and may look down on those who do.

The rightanalytic method is extremely important to Power Analysts; they tend to be more concerned with using the correctmethodology than with actual differences in business results achieved with different methods.  This means, for example, if a particular analytic problem calls for a specific method or class of methods, such as Survival Analysis, the Power Analyst will go to great lengths to use this method even if the improvement to predictive accuracy is very small.

In practice, since working Power Analysts tend to work with highly diverse problems and cannot always predict the nature of the problems they will need to address, they place a premium on being able to use a wide variety of analytic methods and techniques.  The need for a particular method or technique may be rare, but Power Analysts want to be able to use it if the need arises.

Since data preparation is critical to successful predictive analytics, Power Analytics need to be able to understand and control the data they work with.  This does not mean that Power Analytics want to manage the data or perform ETL tasks; it means that they need the data management processes to be transparent and responsive.  In organizations where IT does not place a premium on supporting predictive analytics, Power Analysts will take over data management and ETL to meet their own needs, but this is not necessarily the working model they prefer.

The work product of Power Analysts may be a management report of some kind showing the results of an analysis, a predictive model specification to be recoded for production, a predictive model object (such as a PMML document) or an actual executable scoring function written in a programming language such as Java or C.  Power Analysts do not want to be heavily involved in production deployment or routing model scoring, though they may be forced into this role if the organization has not invested in tooling for model score deployment. 

Power Analysts are highly engaged in the specific brand, release and version of analytic software.  In organizations where the analytics team has significant influence, they play a decisive role in selecting analytic software.   They also want control over the technical infrastructure supporting the analytic software, though they tend to be indifferent about specific brands of hardware, databases, storage and so forth.

In many organizations, the Power Analyst provides an “attest” function to validate that analytics are correctly performed; hence, they tend to have disproportionate authority in analytic matters based on their reputation and expertise.

The Data Scientist

As the Google Trends graph below illustrates, the term “Data Scientist” is of recent origin, hardly used at all prior to 2011 but rapidly increasing since then.

Google Trends Data Scientist

The Data Scientist is similar in many respects to the Power Analyst.  Both share a lack of interest in easy to usetooling, and a desire to engage at a granular level with the data.

The principal differences between Data Scientists and Power Analysts relate to background, training and approach.   Power Analysts tend to understand statistical methods, bring a statistical orientation to analytics, and tend to prefer working with higher-level languages with built-in analytic syntax; Data Scientists, on the other hand, tend to come from a machine learning, engineering or computer science background.  Consequently, they tend to prefer working with programming languages such as C, Java or Python and tend to be much better equipped to work with SQL and MapReduce. 

It is no accident that the growing usage of the Data Scientist label correlates with expanded deployment and use of Hadoop.  Data Scientists tend to have working experience with Hadoop, and this may be their preferred working environment.  They are comfortable working with MapReduce or Apache Spark, and will develop their own code on these platforms if there is no available “off-the-shelf” software that meets their needs.

Data Scientistsmachine learning roots influence their methods, techniques and approach, which affect their requirements for analytic tooling.  The machine learning discipline tends to focus less on choosing the rightanalytic method, and places the focus on results of the predictive analytics process, including the predictive power of the model produced by the process.  Hence, they are much more open to various forms of brute forcelearning, and choose methods that may be difficult to defend within the statistical paradigm but demonstrate good results.

Data Scientists tend to have low regard for existing analytic software vendors, especially those like SAS and IBM who cater to business customers by soft-peddling technical details; instead, they tend to prefer open source tooling.  They seek the best technicalsolution, one with sufficient flexibility to support innovation.  Data Scientists tend to engage directly in the process of productionizingtheir analytic findings; Power Analysts, in contrast, tend to prefer an entirely hands-offrole in the process.

Since the Data Scientist role has recently emerged, it may lack the sapiential authority enjoyed by the Power Analyst in conservative organizations.  In some organizations, “Data Science” is perceived negatively, and

The Business Analyst

The Business Analyst uses analytics within the context of a role in the organization where analytics is important but not the exclusive responsibility.  Business Analysts hold a range of titles, such as Loan Officer, Marketing Analyst or Merchandising Specialist.

Business Analysts are familiar with analytics and may have some training and experience.  Nevertheless, they prefer an easy-to-use interface and software such as SAS Enterprise Guide, SAS Enterprise Miner, SPSS Statistics or similar products.  

While Power Analysts are very concerned with choosing the rightmethod for the problem, Business Analysts tend to prefer a simpler approach.  For example, they may be familiar with regression analysis, but they are unlikely to be interested in all of the various kinds of regression and the details of how regression models are calculated.  They value wizardtooling that guides the selection of methods and techniques within a problem-solving framework.

The Business Analyst may be aware that data is important to the success of analytics, but does not want to deal with it directly.  Instead, the Business Analyst prefers to work with data that certified correct by others in the organization.  Face validity matters to the Business Analyst; data should be internally consistent and align with the analysts understanding of the business.

In most cases, the work product of a Business Analyst is a report summarizing the results of an analysis.   The work product may also be a decision of some kind, such as the volume of merchandise to a complex loan decision.  Business Analysts rarely produce predictive models for production deployment, because their working methods tend to lack the rigor and exhaustiveness of Power Analysts.

Business Analysts value good customer-friendly Technical Support, and tend to prefer to use software from vendors with demonstrated credibility in analytics.  

The Analytic Consumer

Analytic Consumers are fully focused on business questions and issues and do not engage directly in the productionof analytics; instead they use the results of analytics in the form of automated decisions, forecasts and other forms of intelligence that are embedded into the business processes in which they engage.

Analytic Consumers are not necessarily top managementor any other specific level in the organization; they are simply not professionally engaged in the sausage-makingof forecasts, automated decisions, and so forth.

While the Analytic Consumer may not engage with mathematical computations, they are concerned with the overall utility, performance and reliability of the systems they use.  For example, a customer service rep in a credit card call center may not be concerned with the analytic method used to determine a decision, but will be very concerned if the system takes a long time to reach a decision.  The rep may also object if the system does not provide reasonable explanations when it declines credit request, or appears to decline too many customers that seem to be good risks.

In most organizations, Analytic Consumers are the largest group of prospective users.  Since the range of possible ways that analytics can positively affect business processes is large and growing rapidly, and since embedded analytics have few barriers to use, this group of users also has the greatest growth potential.

In most organizations, there are many more prospective Analytic Consumers and Business Analysts than Power Analysts and Data Scientists; on the surface, this means that a strategy of appealing to Analytic Consumers and Business Analysts offers the greatest potential for business value.  However, few organizations are willing to entrust “hard money” analytic applications (such as fraud, credit risk or trading) to analytic novices; since the best and brightest analysts tend to be Power Analysts or Data Scientists, they tend to carry the most weight in decision-making about analytics.

Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.

Machine Learning in Hadoop: Part Two

This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first category.  The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase.  RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.

Both projects enable R users to implement explicit parallelization in MapReduce.  R users write R scripts specifically intended to be run as mappers and reducers in Hadoop.  Users must have MapReduce skills, and must refactor program logic for distributed execution.  There are some differences between the two projects:

  • RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
  • RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
  • As noted above, RHIPE lacks an interface to HBase
  • Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE

Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib.  Both projects have commercial backing, and show robust development activity.  Statistics from GitHub for the thirty days ended February 12 show the following:

  • 0xdata H2O: 18 contributors, 938 commits
  • Apache Spark: 77 contributors, 794 commits

H2O is a project of startup 0xdata, which operates on a services and support business model.  Recent coverage by this blog here;  additional coverage here, here and here.

MLLib is one of several projects included in Apache Spark.  Databricks and Cloudera offer commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and here.

As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature.  Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.

Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS.  There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project.  Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.

Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit.  Development activity on these projects is much less robust than for H2O and Spark.  GitHub statistics for the thirty days ended February 12 speak volumes:

  • Apache Mahout: contributors, 54 commits
  • Vowpal Wabbit: 8 contributors, 57 commits

Neither project has significant commercial backing.  Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project.  (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout).  John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.

Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition.  Mahout’s supervised learning algorithms, however, are weak.  Criticisms of Mahout tend to fall into two categories:

  • The project itself is a mess
  • Mahout’s integration into MapReduce is suitable only for high latency analytics

On the first point, Mahout certainly does seem eclectic, to say the least.  Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects.  Many algorithms are underdeveloped, unsupported or both.  The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.

“High latency” is code for slow.  Slow food is a thing; “slow analytics” is not a thing.  The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.

Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop.  Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis.  Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.

In Part Three, we will cover commercial software for machine learning in Hadoop.

Machine Learning in Hadoop: Part One

Much has changed since I last blogged on this subject a year ago (here and here).  This is the first of a three-part blog covering the current state of play for machine learning in Hadoop.  I use the term “machine learning” deliberately, to refer to tools that can learn from data in an automated or semi-automated manner; this includes traditional statistical modeling plus supervised and unsupervised machine learning.  For convenience, I will not cover fast query tools, BI applications, graph engines or streaming analytics; all of those are important, and deserve separate treatment.

Every analytics vendor claims the ability to work with Hadoop.  In Part One, we cover five things to consider when evaluating how well a particular machine learning tool integrates with Hadoop: deployment topology, hardware requirements, workload integration, data integration, and the user interface.  Of course, these are not the only things an organization should consider when evaluating software; other features, such as support for specific analytic methods, required authentication protocols and other needs specific to the organization may be decisive.

Deployment Topology

Where does the machine learning software reside relative to the Hadoop TaskTracker and Data Nodes (“worker nodes”)?  Is it (a) distributed among the Hadoop worker nodes; (b) deployed on special purpose “analytic” nodes or (c) deployed outside the Hadoop cluster?

Distribution among the worker nodes offers the best performance; under any other topology, data movement will impair performance.  If end users tend to work with relatively small snippets of data sampled from the data store, “beside” architectures may be acceptable, but fully distributed deployment is essential for very large datasets.

Deployment on special purpose “analytic” nodes is a compromise architecture, usually motivated either by a desire to reduce software licensing fees or avoid hardware upgrades for worker node servers.  There is nothing wrong with saving money, but clients should not be surprised if performance suffers under anything other than a fully distributed architecture.

Hardware Requirements

If the machine learning software supports distributed deployment on the Hadoop worker nodes, can it run effectively on standard Hadoop node servers?  The definition of a “standard” node server is a moving target; Cloudera, for example, recognizes that the appropriate hardware spec depends on planned workload.  Machine learning, as a rule, benefits from a high memory spec, but some machine learning software tools are more efficient than others in the way they use memory.

Clients are sometimes reluctant to implement a fully distributed machine learning architecture in Hadoop because they do not want to replace or upgrade a large number of node servers.  This reluctance is natural, but the problem is attributable in part to a gap in planning and rapidly changing technology.  Trading off performance for cost reduction may be the right thing to do, but it should be a deliberate decision.

Workload Integration

If the machine learning software can be distributed among the worker nodes, how well does it co-exist with other MapReduce and non-MapReduce applications?  The gold standard is the ability to run under Apache YARN, which supports resource management across MapReduce and non-MapReduce applications.   Machine learning software that pushes commands down to MapReduce is also acceptable, since the generated MapReduce jobs run under existing Hadoop workload management.

Software that effectively takes over the Hadoop cluster and prevents other jobs from running is only acceptable if the cluster will be dedicated to the machine learning application.   This is not completely unreasonable if the Hadoop cluster replaces a conventional standalone analytic server and file system; the TCO for a Hadoop cluster is very favorable relative to a dedicated high-end analytic server.  Obviously, clients should know how they plan to use the cluster when considering this.

Data Integration

Ideally, machine learning software should be able to work with every data format supported in Hadoop; most machine learning tools are more limited in what they can read and write. The ability to work with uncompressed text in HDFS is table stakes; more sophisticated tools can work with sequence files as well, and support popular compression formats such as Snappy and Bzip/Gzip.  There is also growing interest in use of Apache Avro.   Users may also want to work with data in HBase, Hive or Impala.

There is wide variation in the data formats supported by machine learning software; clients are well advised to tailor assessments to the actual formats they plan to use.

User Interface

There are many aspects of the user interface that matter to clients when evaluating software, but here we consider just one aspect:  Does the machine learning software require the user to specify native MapReduce commands, or does it effectively translate user requests to run in Hadoop behind the scenes?

If the user must specify MapReduce, Hive or Pig it begs the question: why not just perform that task directly in MapReduce, Hive or Pig?

In Part Two, we will examine current open source alternatives for machine learning in Hadoop. 

Analytic Startups: Skytree

Skytree started out as an academic machine learning project developed at Georgia Tech’s Fastlab.  Leadership shopped the software to a number of software vendors prior to 2011 and, finding no buyers, launched as a standalone venture in 2012.

In April 2013, Skytree announced Series A funding of $18 million, with backing from U.S. Venture Partners, UPS, Javelin Venture Partners and Osage University Partners.   The company has 18 U.S. employees in LinkedIn.

Skytree’s public reference customers include Adconian, Brookfield Residential Property Services, CANFAR, eHarmony, SETI Institute and United States Golf Association.  This customer list did not change in 2013 despite significant investment in marketing and sales.

Skytree has formally partnered with Cloudera, Hortonworks and MapR.

Compared to its peers, Skytree reveals very little about its technology, which is generally a yellow flag.

urlSkytree’s principal product is Skytree Server, a server-based library of distributed algorithms.   Skytree claims to support the following techniques:

  • Support Vector Machines (SVM)
  • Nearest Neighbor
  • K-Means
  • Principal Component Analysis (PCA)
  • Linear Regression
  • Two-Point Correlation
  • Kernal Density Estimation (KDE)
  • Gradient Boosted Trees
  • Random Forests

Skytree does not show images or videos of its user interface anywhere on its website.  The implication is that it lacks a visual interface, and programming is required.  Skytree claims a web services interface as well as interfaces to R, Weka, C++ and Python.

For data sources, Skytree claims the ability to connect to relational databases (presumably through ODBC); Hadoop (presumably HDFS); and to consume data from flat files and “common statistical packages”.

Skytree claims the ability to deploy on commodity Linux servers in local, cluster, cloud or Hadoop configurations.  (Absent YARN support, though, the latter will be a “beside” architecture, with data movement).

A second product, Skytree Advisor, launched in Beta in September.  Skytree Advisor is mostly interesting for what it reveals about Skytree Server.  The product includes some unique capabilities, including the ability to produce an actual report, but the user interface evokes a blue screen of death.   The status of this offering seems to be in doubt, as Skytree no longer promotes it.

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.


Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.


Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.


In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.


In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.

Analytic Startups: 0xdata (Updated May 2014)

Updated May 22, 2014

0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O).  Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then.

0xdata operates on a services business model, and does not offer commercially licensed software.  The firm has four public reference customers and claims more than 2,000 users.  0xdata has formal partnerships with Cloudera, Hortonworks, Intel and MapR.

0xdata’s H20 project is a library of distributed algorithms designed for deployment in Hadoop or free-standing clusters.  0xdata licenses H2O under the Apache 2.0 open source license.  The development team is very active; in the thirty days ended May 22, 19 contributors pushed 783 commits to the project on Git.

The roadmap is aggressive; as of May 2014 the library includes:

For Generalized Linear Models, k-Means and Gradient Boosting, H2O supports a Grid Search feature enabling users to specify multiple models for simultaneous development and comparison.   This feature is a significant timesaver when the optimal model parameters are unknown (which is ordinarily the case).

Users interact directly with the software through a web browser or REST API.  Alternatively, R users can use the H2O.R package to invoke algorithms from RStudio or an alternative R development environment.  (Video demo here).  Scala users can work with H2O through the Scalala library.

For Hadoop deployment, H2O supports CDH4.x, MapR 2.x and AWS EC2.   H2O integrates with HDFS, and is co-located within Hadoop.   At present, H2O supports CSV, Gzip-compressed CSV, MS Excel (XLS), ARRF, HIVE file format, “and others”.

Each H2O algorithm supports scoring and prediction capability.   There is currently no facility for PMML export; this is unnecessary if H2O is deployed in Hadoop (since one can simply use the native prediction capability).

In March, the Apache Mahout project announced that it will support H2O.