Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

What Is Driving Interest in Agile Analytics?

Part three in a four-part series.

A combination of market forces and technical innovation drive interest in Agile methods for analytics:

  • Clients require more timely and actionable analytics
  • Data warehouses have reduced latency in the data used by predictive models
  • Innovation directly impacts the analytic workflow itself

Business requirements for analytics are changing rapidly, and clients demand predictive analytics that can support decisions today.  For example, consider direct marketing:  ten years ago, firms relied mostly on direct mail and outbound telemarketing; marketing campaigns were served by batch-oriented systems, and analytic cycle times were measured in months or even years.  Today, firms have shifted that marketing spend to email, web media and social media, where cycle times are measured in days, hours or even minutes.  The analytics required to support these channels are entirely different, and must operate at a digital cadence.

Organizations have also substantially reduced the latency built into data warehouses.  Ten years ago, analysts frequently worked with monthly snapshot data, delivered a week or more into the following month.  While this is still the case for some organizations, data warehouses with daily, inter-day and real-time updates are increasingly common.  A predictive model score is as timely as the data it consumes; as firms drive latency from data warehousing processes, analytical processes are exposed as cumbersome and slow.

Numerous innovations in analytics create the potential to reduce cycle time:

  • In-database analytics eliminate the most time-consuming tasks, data marshalling and model scoring
  • Tighter database integration by vendors such as SAS and SPSS enable users to achieve hundred-fold runtime improvements for front-end processing
  • Enhancements to the PMML standard make it possible for firms to integrate a wide variety of end-user analytic tools with high performance data warehouses

All of these factors taken together add up to radical reductions in time to deployment for predictive models.  Organizations used to take a year or more to build and deploy models; a major credit card issuer I worked with in the 1990s needed two years to upgrade its behavior scorecards.  Today, IBM Netezza customers who practice Agile methods can reduce this cycle to a day or less.