Notes From #BigDataMN

Analytics conferences tend to be held in places like Orlando or Las Vegas, where it’s sunny and warm all of the time and there are copious incidental pleasures to fill the off hours.  I can’t speak to the incidental pleasures of Minneapolis in January, but warm it is not; peak temperature on Monday had a minus sign in front of it, and that’s in Fahrenheit.

Nevertheless, a sellout crowd for MinneAnalytics#BigDataMN event filled the rooms at the Carlson School of Management in Minneapolis.   MinneAnalytics is one of the more visible regional analytic user groups, and their events are well-organized and content rich.

Vendors present at #BigDataMN included the usual suspects, including IBM, EMC, Teradata Aster, Cloudera and several others.   SAS was conspicuous by its absence, which is noteworthy because MinneAnalytics is operated by the Twin Cities Area SAS Users Group.  It seems that SAS does not wish to appear at events where R is discussed favorably.   Those crafty strategists at SAS corporate headquarters know a threat when they see it.

At least a third of the presentations featured open source analytics.   Some highlights:

  • Erik Iverson, chair of the local R User Group, presented two excellent overviews of R.  The second of these, an introduction to R basics, drew an overflow audience of all ages; about 90% of these, by show of hands, had no prior experience with R.  In his first presentation, a balanced “flyover” of R from a business perspective, Erik made the excellent point that prospective analysts entering the labor force today have all grown up with R; and so, by inference, we can expect that perceived R learning curve issues will decline as this cohort matures.
  • Winston Chang introduced RStudio‘s new Shiny server for R web applications a tool that gives the lie to the notion that R is suitable for academic research but little more.  This presentation had some impact.  As I stood in the back of the room, I could see a number of participants download and install RStudio then and there.
  • Luba Gloukov of Revolution Analytics offered an excellent interactive demonstration of how she uses Revolution R together with YouTube and Google Maps to identify and map emerging artists.  This was a fun and lively presentation.  One does not often associate the words “fun” and “lively” with an analytics conference.

Mark Pitts from United Health offered a balanced overview of SAS High Performance Analytics, based on his organization’s ongoing assessment of HPA and alternatives.  Mark nicely presented what HPA does well (it’s extremely fast with large data sets) together with its limitations (functionality is limited relative to standard single-threaded SAS).  Mark did not mention cost of ownership of this product, which exceeds the GNP of some countries.  🙂

The format of this event — which provides most speakers with slots of twenty to twenty-five minutes — is excellent.  The short time slots prevents bloviation, and if a speaker is less than inspired the audience doesn’t have to decide between a catnap or checking email.  Conference presentations should be like speed dates: get in, make your point quickly, and if there’s a fit you can follow up afterwards.

Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

RevoScaleR Beats SAS, Hadoop for Regression on Large Dataset

Still catching up on news from Strata conference.

This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.

The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records.  The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.

Results:

— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;

— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;

— Open source R: impossible, open source R cannot load the data set;

— RevoScale R running  on 5-node/20-core LSF cluster: a little over five minutes to run.

In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would  be interesting to see results from such a test.

Some critics have pointed out that the environments aren’t equal.  It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded.    SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.

It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results.  One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture.  While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.

One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?

— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;

— Hadoop would certainly require a custom MapReduce procedure;

— With RevoScaleR, Allstate can push the scoring into IBM Netezza.

This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS.  It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.

Embrace Open Source Analytics

Suppose you could implement an analytics platform with comprehensive out-of-the-box capabilities, a flexible programming environment, good visualization capabilities and a growing body of skilled users.  Suppose this platform leveraged a massively parallel architecture for high performance and scalability.  And suppose you could do this without investing in software fees.

You don’t have to suppose, because IBM Netezza helps you leverage the power and capability of R.

R is the best known open source analytics project, but there are many other open source analytics available, including the Data Mining Template Library, the dlib and Orange C++ libraries and the Java Data Mining Package.  In this article, we’ll focus on R.

There are three main reasons R should be part of your enterprise analytics architecture:

  • R has capabilities not available in commercial analytics software
  • Usage of R by analysts is growing rapidly
  • R’s total cost of ownership is attractive

R functionality is a superset of the functionality available in commercial analytics packages. There are currently 3,047 packages published in the CRAN repository, and almost 5,000 packages in all repositories worldwide.  Moreover, the number of available packages is growing rapidly.  While commercial software vendors must prioritize development effort towards features with predictable demand and broad appeal, R developers work under no such constraints.  As a result, new, cutting-edge and niche applications tend to be published in R before they are available in commercial packages.

A customer we’re working with in the life sciences industry wants to apply four new methods to their analytic toolkit.  This customer spends almost a billion dollars each year to run hundreds of thousands of experiments; very small improvements in precision directly impact this customer’s bottom line.  Right now, all of these new methods are available in R, and none are available in commercial packages.

Interest in R is growing exponentially.   According to the most recent Rexer Analytics survey, R is the preferred analytics package for more respondents than for any other analytic software.  R outperforms all other analytics packages on various measures of mindshare, including listserv activity, website popularity, page rank and blogging activity.

Some customers we work with express concerns that open source software may be full of bugs, trojan horses or other security risks.  This view is based on the mistaken belief that developers can publish anything they like in R.  In fact, the R Project has a highly-developed review and testing process, and well-defined procedures for bug tracking and fixing.  R’s large and highly engaged user community ensures that R packages receive as much scrutiny and testing as many commercial software packages.

Like many analytical packages, R performs calculations in memory, which limits the amount of data that can be used in analysis to the size of memory on the host.  IBM Netezza partner Revolution Analytics has developed a commercial version of R (Revolution R Enterprise) that combines the capability and value of open source R with the quality assurance and technical support of vendor-supported software.   Revolution has also developed a set of enhancements that enable R to scale to terabyte-sized problems.  The combination of Revolution R Enterprise and Netezza’s massively parallel architecture provides a truly scalable and high-performance analytics platform.

Open source analytics like R offer firms rich capabilities, a flexible platform and great value.   With Netezza and Revolution Analytics, R is a scalable and high performance platform.