In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop. But what if the analytic methods you need are not implemented in the current Mahout release? The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.
Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout. Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.
Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement. The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos. IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.
To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year. Rhadoop’s rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics. This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.
For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer. This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer. According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution. There is an excellent video about this product from the Hadoop Summit at this link.
Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources. It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth. Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.
Finally, a word on what SAS is doing with Hadoop. Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website. In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program. While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS. There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.
One thought on “Advanced Analytics in Hadoop, Part Two”