Still catching up on news from Strata conference.
This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.
The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records. The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.
— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;
— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;
— Open source R: impossible, open source R cannot load the data set;
— RevoScale R running on 5-node/20-core LSF cluster: a little over five minutes to run.
In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would be interesting to see results from such a test.
Some critics have pointed out that the environments aren’t equal. It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded. SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.
It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results. One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture. While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.
One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?
— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;
— Hadoop would certainly require a custom MapReduce procedure;
— With RevoScaleR, Allstate can push the scoring into IBM Netezza.
This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS. It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.