Overstock.com Joins the Mahout Parade

Interesting story here in Wired about how Overstock.com used Mahout to build and deploy a recommendation engine to replace RichRelevance, thereby saving $2 million in annual fees.

Overstock joins an elite list of companies who are monetizing Mahout. including  Adobe,  Amazon,  AOL,  Buzzlogic,  Foursquare,  Twitter and Yahoo.

(h/t Bill Zanine)

Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

Recent Books on Analytics

For your Christmas gift list,  here is a brief roundup of four recently published books on analytics.

Business Intelligence in Plain Language by Jeremy Kolb (Kindle Edition only) is a straightforward and readable summary of conventional wisdom about Business Intelligence.  Unlike many guides to BI, this book devotes some time and attention to data mining.  As an overview, however, Mr. Kolb devotes too little attention to the most commonly used techniques in predictive analytics, and too much attention to more exotic methods.  There is nothing wrong with this per se, but given the author’s conventional approach to implementation it seems eccentric.  At $6.99, though, even an imperfect book is a pretty good value.

Tom Davenport’s original Harvard Business Review article Competing on Analytics is one of the ten most-read articles in HBR’s history; Google Trends shows a spike in search activity for the term “analytics” concurrent with its publication, and steady growth in interest since them.  Mr. Davenport’s latest book  Enterprise Analytics: Optimize Performance, Process, and Decisions Through Big Data is a collection of essays by Mr. Davenport and members of the International Institute of Analytics, a commercial research organization funded in part by SAS.   (Not coincidentally, SAS is the most frequently mentioned analytics vendor in the book).  Mr. Davenport defines enterprise analytics in the negative, e.g. not “sequestered into several small pockets of an organization — market research, or actuarial or quality management”.    Ironically, though, the best essays in this book are about narrowly focused applications, while the worst essay, The Return on Investments in Analytics, is little more than a capital budgeting primer for first-year MBA students, with the word “analytics” inserted.  This book would benefit from a better definition of enterprise analytics, the value of “unsequestering” analytics from departmental silos, and more guidance on exactly how to make that happen.

Jean-Paul Isson and Jesse Harriott have hit a home run with Win with Advanced Business Analytics: Creating Business Value from Your Data, an excellent survey of the world of Business Analytics.   This book combines an overview of traditional topics in business analytics (with a practical “what works/what does not work” perspective) with timely chapters on emerging areas such as social media analytics, mobile analytics and the analysis of unstructured data.  A valuable contribution to the business library.

The “analytical leaders” featured in Wayne Eckerson’s  Secrets of Analytical Leaders: Insights from Information Insiders — Eric Colson, Dan Ingle, Tim Leonard, Amy O’Connor, Ken Rudin, Darren Taylor and Kurt Thearling — are executives who have actually done this stuff, which distinguishes them from many of those who write and speak about analytics.  The practical focus of this book is apparent from its organization — departing from the conventional wisdom of how to talk about analytics, Eckerson focuses on how to get an analytics initiative rolling, and keep it rolling.  Thus, we read about how to get executive support for an analytics program, how to gain momentum, how to hire, train and develop analysts, and so forth.  Instead of writing about “enterprise analytics” from a top-down perspective, Eckerson writes about how to deploy analytics in an enterprise — which is the real problem that executives need to solve.

Advanced Analytics in Hadoop, Part One

This is the first of a two-part post on the current state of advanced analytics in Hadoop.  In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout.  In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.

For starters, a few definitions.

I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data.   Discovery means the articulation of patterns as rules or mathematical expressions;  deployment means the mobilization of discovered patterns to improve a business process.  Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance.  Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.

By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.

Analysts can and do code advanced analytics directly in MapReduce.  For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.

The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics.   External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore.  This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach.  Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both.  The end result is an extended analytic discovery-to-deployment cycle.

Eliminating data movement radically reduces analytic cycle time.  This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place.  This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.

Ideal use cases for advanced analytics in Hadoop have the following profile:

  • Source data is already in Hadoop
  • Applications that consume the analytics are also in Hadoop
  • Business need to use all of available data (e.g. sampling is not acceptable)
  • Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself

The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly.  These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.

Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm.   Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases.   Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.

As of Release 0.7 (June 16, 2012), the following algorithms are implemented:

Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models

Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down

Association: Parallel FP-Growth

Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition

Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization

Lexical Analysis: Collocations

For a clever introduction to machine learning and Mahout, watch this video.

For more detail, review this presentation on Slideshare.

There are no recently released books on Mahout.  This book is two releases out of date, but provides a good introduction to the project.

Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others.   Check the Powered by Mahout page for an extended list.

Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.

Book Review: Antifragile

There is a (possibly apocryphal) story about space scientist James Van Allen.  A reporter asked why the public should care about Van Allen belts, which are layers of particles held in place by Earth’s magnetic field.    Dr. Van Allen puffed on his pipe a few times, then responded:  “Van Allen belts?  I like them.  I make a living from them.”

One can imagine a similar conversation with Nassim Nicholas Taleb, author of The Black Swan and most recently Antifragile: Things That Gain From Disorder.

Reporter: Why should the public care about Black Swans?

Taleb: Black Swans?  I like them.  I make a living from them.

And indeed he does.   Born in Lebanon, educated at the University of Paris and the Wharton School, Mr. Taleb pursued a career in trading and arbitrage (UBS, CS First Boston, Banque Indosuez, CIBC Wood Gundy, Bankers Trust and BNP Paribas) where he developed the practice of tail risk hedging, a technique designed to insure a portfolio against rare but catastrophic events.  Later, he established his own hedge fund (Empirica Capital), then retired from active trading to pursue a writing and academic career.  Mr. Taleb now positions at NYU and Oxford, together with an assortment of adjuncts.

Antifragile is Mr. Taleb’s third book in a series on randomness.  The first, Fooled by Randomness, published in 2001, made Fortune‘s  2005  list of “the 75 smartest books we know.”   The Black Swan, published in 2007, elaborated Mr. Taleb’s theory of Black Swan Events (rare and unforeseen events of enormous consequences) and how to cope with them; the book has sold three million copies to date in thirty-seven languages.   Mr. Taleb was elevated to near rock-star status on the speaker circuit in part due to his claim to have predicted the recent financial crisis, a claim that would  be more credible had he published his book five years earlier.

I recommend this book; it is erudite, readable and full of interesting tidbits, such as an explanation of Homer’s frequent use of the phrase “the wine-dark sea”.   (Mr. Taleb attributes this to the absence of the word ‘blue’ in Ancient Greek.  I’m unable to verify this, but it sounds plausible.)  Erudition aside, Antifragile is an excellent sequel to The Black Swan because it enables Mr. Taleb to elaborate on how we should build institutions and businesses that benefit from unpredictable events.  Mr. Taleb contrasts the “too big to fail” model of New York banking with the “fail fast” mentality of Silicon Valley, which he cites as an example of antifragile business.

Some criticism is in order.  Mr. Taleb’s work sometimes seems to strive for a philosophical universalism that explains everything but provides few of the practical heuristics which he says are the foundation of an antifragile order.  In other words, if you really believe what Mr. Taleb says, don’t read this book.

Moreover, it’s not exactly news that there are limits to scientific rationalism; the problem, which thinkers have grappled with for centuries, is that it is difficult to build systematic knowledge outside of  a rationalist perspective.   One cannot build theology on the belief that the world is a dark and murky place where the gods can simply zap you at any time for no reason.  Mr. Taleb cites Nietzsche as an antifragile philosopher, and while Nietzsche may be widely read among adolescent lads and lassies, his work is pretty much a cul-de-sac.

One might wonder what the study of unpredictable events has to do with predictive analytics, where many of us make a living.  In Reckless Endangerment, Gretchen Morgenstern documents how risk managers actually did a pretty good job identifying financial risks, but that bank leadership chose to ignore, obfuscate or shift risks to others.  Mr. Taleb’s work offers a more compelling explanation for this institutional failure than the customary “greedy robber baron” theory.  Moreover, everyone in the predictive analytics business (and every manager who relies on predictive analytics) should remember that predictive models have boundary conditions, which we ignore at our peril.

Fact-Check: SAS and Greenplum

Does SAS run “inside” Greenplum?  Can existing SAS programs run faster in Greenplum without modification?  Clients say that their EMC rep makes such claims.

The first claim rests on confusion about EMC Greenplum’s product line.  It’s important to distinguish between Greenplum Database and Greenplum DCA.  Greenplum DCA is a rack of commodity blade servers which can be configured with Greenplum Database running on some of the blades and SAS running on the other blades.  For most customers, a single DCA blade provides insufficient computing power to support SAS, so EMC and SAS typically recommend deployment on multiple blades, with SAS Grid Manager implemented for workload management.   This architecture is illustrated in this white paper on SAS’ website.

As EMC’s reference architecture clearly illustrates, SAS does not run “inside” Greenplum database (or any other database); it simply runs on server blades that are co-located in the same physical rack as the database.  The SAS instance installed on the DCA rack works just like any other SAS instance installed on freestanding servers.  SAS interfaces with Greenplum Database through a SAS/ACCESS interface, which is exactly the same way that SAS interacts with other databases.

Does co-locating SAS and the database in the same rack offer any benefits?  Yes, because when data moves back and forth between SAS and Greenplum Database, it does so over a dedicated 10GB Ethernet connection.   However, this is not a unique benefit — customers can implement a similar high-speed connection between a free-standing instance of SAS and any data warehouse appliance, such as IBM Netezza.

To summarize, SAS does not run “inside” Greenplum Database or any other database; moreover, SAS’  interface with Greenplum is virtually the same as SAS’ interface with any other supported database.  EMC offers customers the ability to co-locate SAS in the same rack of servers as the Greenplum Database, which expedites data movement between SAS and the database, but this is a capability that can be replicated cheaply in other ways.

The second claim — that SAS programs run faster in Greenplum DCA without modification — requires more complex analysis.   For starters, though, keep in mind that SAS programs always require at least some modification when moved from one SAS instance to another, if only to update SAS libraries and adjust for platform-specific options.  Those modifications are small, however, so let’s set them aside and grant EMC some latitude for sales hyperbole.

To understand how existing SAS program will perform inside DCA, we need to consider the building blocks of those existing programs:

  1. SAS DATA Steps
  2. SAS PROC SQL
  3. SAS Database-Enabled PROCs
  4. SAS Analytic PROCs (PROC LOGISTIC, PROC REG, and so forth)

Here’s how SAS will handle each of these workloads within DCA:

(1) SAS DATA Steps: SAS attempts to translate SAS DATA Step statements into SQL.   When this translation succeeds, SAS submits the SQL expression to Greenplum Database, which runs the query and returns the result set to SAS.  Since SAS DATA Step programming includes many concepts that do not translate well to SQL, in most cases SAS will extract all required data from the database and run the required operations as a single-threaded process on one of the SAS nodes.

(2) SAS PROC SQL: SAS submits the embedded SQL to Greenplum Database, which runs the query and return the result set to SAS.   The SAS user must verify that the embedded SQL expression is syntactically correct for Greenplum.

(3) SAS Database-Enabled PROCs;  SAS converts the user request to database-specific SQL and submits to Greenplum Database, which runs the query and returns the result set to SAS.

(4) SAS Analytic PROCs:  In most cases, SAS runs the PROC on one of the server blades.  A limited number of SAS PROCs are automatically enabled for Grid Computing; these PROCs will run multi-threaded.

In each case, the SAS workload runs in the same way inside DCA as it would if implemented in a free-standing SAS instance with comparable computing power.   Existing SAS programs are not automatically enabled to leverage Greenplum’s parallel processing; the SAS user must explicitly modify the SAS program to exploit Greenplum Database just as they would when using SAS with other databases.

So, returning to the question: will existing SAS programs run faster in Greenplum DCA without modification?  Setting aside minor changes when moving any SAS program, the performance of existing programs when run in DCA will be no better than what would be achieved when SAS is deployed on competing hardware with comparable computing specifications.

SAS users can only realize radical performance improvements when they explicitly modify their programs to take advantage of in-database processing.   Greenplum has no special advantage in this regard; conversion effort is similar for all databases supported by SAS.

RevoScaleR Beats SAS, Hadoop for Regression on Large Dataset

Still catching up on news from Strata conference.

This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.

The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records.  The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.

Results:

— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;

— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;

— Open source R: impossible, open source R cannot load the data set;

— RevoScale R running  on 5-node/20-core LSF cluster: a little over five minutes to run.

In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would  be interesting to see results from such a test.

Some critics have pointed out that the environments aren’t equal.  It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded.    SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.

It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results.  One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture.  While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.

One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?

— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;

— Hadoop would certainly require a custom MapReduce procedure;

— With RevoScaleR, Allstate can push the scoring into IBM Netezza.

This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS.  It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.

Customer Endorsement for SAS High Performance Analytics

When SAS released its new in-memory analytic software last December, I predicted that SAS would have one reference customer in 2012.  I believed at the time that several factors, including pricing, inability to run most existing SAS programs and SAS’ track record with new products would prevent widespread adoption, but that SAS would do whatever it takes to get at least one customer up and running on the product.

It may surprise you to learn that SAS does not already have a number of public references for the product.  SAS uses the term ‘High Performance Analytics’ in two ways: as the name for its new high-end in-memory analytics software, and to refer to an entire category of products, both new and existing.  Hence, it’s important to read SAS’ customer success stories carefully; for example, SAS cites CSI-Piemonte as a reference for in-memory analytics, but the text of the story indicates the customer has selected SAS Grid Manager, a mature product.

Recently, a United Health Group executive spoke at SAS’ Analytics 2012 conference and publicly endorsed the High Performance Analytics product; a search through SAS press releases and blog postings appears to show that this is the first genuine public endorsement.  You can read the story here.

Several comments:

— While it appears the POC succeeded, the story does not say that United Healthcare has licensed SAS HPA for production.

— The executive interviewed in the article appears to be unaware of alternative technologies, some of which are already owned and used by his employer.

— The use case described in the article is not particularly challenging.  Four million rows of data was a large data set ten years ago; today we work with data sets that are orders of magnitude larger than that.

— The reported load rate of 9.2 TB is good, but not better than what can be achieved with competing products.  The story does not state whether this rate measure load from raw data to Greenplum or from Greenplum into SAS HPA’s memory.

— Performance for parsing unstructured data — “millions of rows of text data in a few minutes” — is not compelling compared to alternatives.

The money quote in this story: “this Big Data analytics stuff is expensive…”  That statement is certainly true of SAS High Performance Analytics, but not necessarily so for alternatives.   Due to the high cost of this software, the executive in the story does not believe SAS HPA can be deployed broadly as an architecture, but must be implemented in a silo that will require users to move data around.

That path doesn’t lead to the Analytic Enterprise.

How Important is Model Accuracy?

Go to a trade show for predictive analytics and listen to the presentations; most will focus on building more accurate predictive models.  Presenters will differ on how this should be done: some will tell you to purchase their brand of software, others will encourage you to adopt one method or another, but most will agree: accuracy isn’t everything, it’s the only thing.

I’m not going to argue in this post that accuracy isn’t a good thing (all other things equal), but consider the following scenario: you have a business problem that can be mitigated with a predictive model.  You ask three consultants to submit proposals, and here’s what you get:

  • Consultant A proposes to spend one day and promises to produce a model that is more accurate than a coin flip
  • Consultant B proposes to spend one week, and promises to produce a model that is more accurate than Consultant A’s model
  • Consultant C proposes to spend one year, and promises to produce the most accurate model of all

Which one will you choose?

This is an extreme example, of course, but my point is that one rarely hears analysts talk about the time and effort needed to achieve a given level of accuracy. or the time and effort needed to implement a predictive model in production.  But in real enterprises, there are essential trade-offs that must be factored into the analytics process.  As we evaluate these three proposals, consider the following points:

(1) We can’t know how accurate a prediction will be; we can only know how accurate it was.

We judge the accuracy of a prediction after the event of interest has occurred.  In practice, we evaluate the accuracy of a predictive model by examining how accurate it is with historical data.  This is a pretty good method, since past patterns often persist into the future.  The key word is “often”, as in “not always”; the world isn’t in a steady state, and black swans happen.   This does not mean we should abandon predictive modeling, but it does mean we should treat very small differences in model back-testing with skepticism.

(2) Overall model accuracy is irrelevant.

We live in an asymmetrical world, and errors in prediction are not all alike.   Let’s suppose that your doctor thinks that you may have a disease that is deadly, but can be treated with an experimental treatment that is painful and expensive.  The doctor gives you the option of two different tests, and tells you that Test X has an overall accuracy rate of 60%, while Test Y has an overall accuracy of 40%.

Hey, you think, that’s a no-brainer; give me Test X.

What the doctor did not tell you is that all of the errors for Test X are false negatives: the test says you don’t have the disease when you actually do.  Test Y, on the other hand, produces a lot of false positives, but also correctly predicts all of the actual disease cases.

If you chose Test X, congratulations!  You’re dead.

(3) We can’t know the value of model accuracy without understanding the differential cost of errors.

In the previous example, the differential cost of errors is starkly exaggerated: on the one hand, we risk death, on the other hand, we undergo painful and expensive treatment.  In commercial analytics, the economics tend to be more subtle:  in Marketing, for example, we send a promotional message to a customer who does not respond (false positive) or decline a credit line for a prospective customer who would have paid on time (false negative).  The actual measure of a model, however isn’t its statistical accuracy, but its economic accuracy: the overall impact of the predictive model on the business process it is designed to support.

Taking these points into consideration, Consultant A’s quick and dirty approach looks a lot better, for three reasons:

  • Better results in back-testing for Consultants B and C may or may not be sustainable in production
  • Consultant A’s model benefits the business sooner than the other models
  • Absent data on the cost of errors, it’s impossible to say whether Consultants B and C add more value

A fourth point goes to the heart of what Agile Analytics is all about.  While Consultants B and C continue to tinker in the sandbox, Consultant A has in-market results to use in building a better predictive model.

The bottom line is this: the first step in any predictive modeling effort must always focus on understanding the economics of the business process to be supported.  Unless and until the analyst knows the business value of a correct prediction — and the cost of incorrect predictions — it is impossible to say which predictive model is “best”.

Deconstructing SAS Analytics Accelerator

Now and then I get queries from clients about SAS Analytics Accelerator, an in-database product that SAS supports exclusively with Teradata database.  SAS does not publicize sales figures for individual products, so we don’t know for sure how widely Analytics Accelerator is used.  There are some clues, however.

  • Although SAS released this product in 2008, it has published no customer success stories.  Follow the customer success link from SAS’ overview for Analytics Accelerator and read the stories; none of them describes using this product.
  • SAS has never expanded platform support beyond Teradata.  SAS is a customer-driven company that does not let partner and channel considerations impact its customers.   Absence of a broader product rollout implies absence of market demand
  • Unlike the rest of its product line, SAS has not significantly enhanced Analytics Accelerator since the product was introduced four years ago.

SAS supports seven in-database Base SAS PROCs (FREQ, MEANS, RANK, REPORT, SORT, SUMMARY and TABULATE) on many databases, including Aster, DB2, Greenplum, Netezza, Oracle and Teradata.  Analytics Accelerator supports seven SAS/STAT PROCs (CORR, CANCORR,  FACTOR, PRINCOMP, REG, SCORE, and VARCLUS), one SAS/ETS PROC (TIMESERIES), three Enterprise Miner PROCs (DMDB, DMINE and DMREG) plus macros for sampling and binning for use with Enterprise Miner.   Customers must license SAS/STAT, SAS/ETS and SAS Enterprise Miner to use the in-database capabilities.

On the surface, these appear to be features that offer SAS users significantly better integration with Teradata than with other databases.  When we dig beneath the surface, however, the story is different.

Anyone familiar with SAS understands from a quick look at the supported PROCs that it’s an odd list;  it includes some rarely used PROCs and omits PROCs that are frequently used in business analytics (such as PROC LOGISTIC and PROC GLM).  I generally ask SAS clients to list the PROCs they use most often; when they do so, they rarely list any of the PROCs supported by Analytics Accelerator.

The SAS/STAT PROCs supported by Analytics Accelerator do not actually run in Teradata; the PROC itself runs on a SAS server, just like any other SAS PROC.   Instead, SAS passes a request to Teradata to build a Sum of Squares Cross Product (SSCP) matrix.  SAS then pulls the SSCP matrix over to the SAS server, loads it into memory and proceeds with the computational algorithm.

This is a significant performance enhancement, since MPP databases are well suited to matrix operations and the volume of data moved over the network is reduced to a minimum.   But here’s the kicker: any SAS user can construct SSCP matrices in an MPP database (such as IBM Netezza) and import it into SAS/STAT.  You don’t need to license SAS Analytics Accelerator; every SAS customer who licenses SAS/STAT already has this capability.

This explains, in part, the unusual selection of PROCs:  SAS chose PROCs that could be included with minimal R&D investment.  This is a smart strategy for SAS, but says little about the value of the product for users.

Since SAS/STAT does not currently export PMML documents for downstream integration, in-database support for PROC SCORE is intriguing; once again, though,  the devil is in the details.  Analytics Accelerator converts the SAS model to a SQL expression and submits it to the database; unfortunately, this translation only supports linear models.  SAS users can score with models developed in thirteen different SAS PROCs  (ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, ORTHOREG, QUANTREG, REG, ROBUSTREG, TCALIS and VARCLUS), but with the exception of PROC REG these are rarely used in predictive modeling for business analytics.  SAS seems to have simply selected those PROCs whose output is easy to implement in SQL, regardless of whether or not these PROCs are useful.

Overall, Analytics Accelerator lacks a guiding design approach, and reflects little insight into actual use cases; instead, SAS has cobbled together a collection of features that are easy to implement.   When clients consider the tasks they actually want to do in SAS, this product offers little value.