Overstock.com Joins the Mahout Parade

Interesting story here in Wired about how Overstock.com used Mahout to build and deploy a recommendation engine to replace RichRelevance, thereby saving $2 million in annual fees.

Overstock joins an elite list of companies who are monetizing Mahout. including  Adobe,  Amazon,  AOL,  Buzzlogic,  Foursquare,  Twitter and Yahoo.

(h/t Bill Zanine)

RevoScaleR Beats SAS, Hadoop for Regression on Large Dataset

Still catching up on news from Strata conference.

This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.

The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records.  The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.


— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;

— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;

— Open source R: impossible, open source R cannot load the data set;

— RevoScale R running  on 5-node/20-core LSF cluster: a little over five minutes to run.

In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would  be interesting to see results from such a test.

Some critics have pointed out that the environments aren’t equal.  It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded.    SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.

It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results.  One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture.  While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.

One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?

— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;

— Hadoop would certainly require a custom MapReduce procedure;

— With RevoScaleR, Allstate can push the scoring into IBM Netezza.

This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS.  It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.

Customer Endorsement for SAS High Performance Analytics

When SAS released its new in-memory analytic software last December, I predicted that SAS would have one reference customer in 2012.  I believed at the time that several factors, including pricing, inability to run most existing SAS programs and SAS’ track record with new products would prevent widespread adoption, but that SAS would do whatever it takes to get at least one customer up and running on the product.

It may surprise you to learn that SAS does not already have a number of public references for the product.  SAS uses the term ‘High Performance Analytics’ in two ways: as the name for its new high-end in-memory analytics software, and to refer to an entire category of products, both new and existing.  Hence, it’s important to read SAS’ customer success stories carefully; for example, SAS cites CSI-Piemonte as a reference for in-memory analytics, but the text of the story indicates the customer has selected SAS Grid Manager, a mature product.

Recently, a United Health Group executive spoke at SAS’ Analytics 2012 conference and publicly endorsed the High Performance Analytics product; a search through SAS press releases and blog postings appears to show that this is the first genuine public endorsement.  You can read the story here.

Several comments:

— While it appears the POC succeeded, the story does not say that United Healthcare has licensed SAS HPA for production.

— The executive interviewed in the article appears to be unaware of alternative technologies, some of which are already owned and used by his employer.

— The use case described in the article is not particularly challenging.  Four million rows of data was a large data set ten years ago; today we work with data sets that are orders of magnitude larger than that.

— The reported load rate of 9.2 TB is good, but not better than what can be achieved with competing products.  The story does not state whether this rate measure load from raw data to Greenplum or from Greenplum into SAS HPA’s memory.

— Performance for parsing unstructured data — “millions of rows of text data in a few minutes” — is not compelling compared to alternatives.

The money quote in this story: “this Big Data analytics stuff is expensive…”  That statement is certainly true of SAS High Performance Analytics, but not necessarily so for alternatives.   Due to the high cost of this software, the executive in the story does not believe SAS HPA can be deployed broadly as an architecture, but must be implemented in a silo that will require users to move data around.

That path doesn’t lead to the Analytic Enterprise.

EMC Announces Partnership with Alpine Data Labs

Catching up on the news here.

The keyword in the title of this post is “announces”.  It’s not news that EMC partners with Alpine Data Labs.   Alpine Miner is a nifty product, but in the predictive analytics market Alpine is an ankle-biter compared to SAS, SPSS, Mathsoft and other vendors.   Greenplum and Alpine were sister companies funded by the same VC before EMC entered the picture.  When EMC acquired Greenplum, they passed on Alpine because (a) it didn’t fit into EMC’s all-things-data warehousing strategy, and (b) EMC didn’t want to mess up their new alliance with SAS.

SAS does not look kindly on alliance partners that compete with them; this is, in part, a knee-jerk response.  In the analytics software market, clients rarely switch from one vendor to another, and growth opportunities in the analytic tools market are limited.  Most of the action is in emerging users and analytic applications, where SAS’ core strengths don’t play as well.  Nevertheless, SAS expects to own every category in which it chooses to compete and expects its partners to go along even as SAS invades new territory.

After EMC acquired Greenplum, GP reps continued to work together on a “sell-with” basis in a kind of “stealth” partnership.

So it’s significant that EMC entered into a reseller agreement with Alpine and announced it to the world.  It’s a smart move by EMC; as I said earlier, Alpine is a nifty product. But it suggests that EMC isn’t getting the traction it expected from the SAS alliance — a view that’s supported by scuttlebutt from inside both SAS and EMC.