Fact-Check: SAS and Greenplum

Does SAS run “inside” Greenplum?  Can existing SAS programs run faster in Greenplum without modification?  Clients say that their EMC rep makes such claims.

The first claim rests on confusion about EMC Greenplum’s product line.  It’s important to distinguish between Greenplum Database and Greenplum DCA.  Greenplum DCA is a rack of commodity blade servers which can be configured with Greenplum Database running on some of the blades and SAS running on the other blades.  For most customers, a single DCA blade provides insufficient computing power to support SAS, so EMC and SAS typically recommend deployment on multiple blades, with SAS Grid Manager implemented for workload management.   This architecture is illustrated in this white paper on SAS’ website.

As EMC’s reference architecture clearly illustrates, SAS does not run “inside” Greenplum database (or any other database); it simply runs on server blades that are co-located in the same physical rack as the database.  The SAS instance installed on the DCA rack works just like any other SAS instance installed on freestanding servers.  SAS interfaces with Greenplum Database through a SAS/ACCESS interface, which is exactly the same way that SAS interacts with other databases.

Does co-locating SAS and the database in the same rack offer any benefits?  Yes, because when data moves back and forth between SAS and Greenplum Database, it does so over a dedicated 10GB Ethernet connection.   However, this is not a unique benefit — customers can implement a similar high-speed connection between a free-standing instance of SAS and any data warehouse appliance, such as IBM Netezza.

To summarize, SAS does not run “inside” Greenplum Database or any other database; moreover, SAS’  interface with Greenplum is virtually the same as SAS’ interface with any other supported database.  EMC offers customers the ability to co-locate SAS in the same rack of servers as the Greenplum Database, which expedites data movement between SAS and the database, but this is a capability that can be replicated cheaply in other ways.

The second claim — that SAS programs run faster in Greenplum DCA without modification — requires more complex analysis.   For starters, though, keep in mind that SAS programs always require at least some modification when moved from one SAS instance to another, if only to update SAS libraries and adjust for platform-specific options.  Those modifications are small, however, so let’s set them aside and grant EMC some latitude for sales hyperbole.

To understand how existing SAS program will perform inside DCA, we need to consider the building blocks of those existing programs:

  1. SAS DATA Steps
  3. SAS Database-Enabled PROCs
  4. SAS Analytic PROCs (PROC LOGISTIC, PROC REG, and so forth)

Here’s how SAS will handle each of these workloads within DCA:

(1) SAS DATA Steps: SAS attempts to translate SAS DATA Step statements into SQL.   When this translation succeeds, SAS submits the SQL expression to Greenplum Database, which runs the query and returns the result set to SAS.  Since SAS DATA Step programming includes many concepts that do not translate well to SQL, in most cases SAS will extract all required data from the database and run the required operations as a single-threaded process on one of the SAS nodes.

(2) SAS PROC SQL: SAS submits the embedded SQL to Greenplum Database, which runs the query and return the result set to SAS.   The SAS user must verify that the embedded SQL expression is syntactically correct for Greenplum.

(3) SAS Database-Enabled PROCs;  SAS converts the user request to database-specific SQL and submits to Greenplum Database, which runs the query and returns the result set to SAS.

(4) SAS Analytic PROCs:  In most cases, SAS runs the PROC on one of the server blades.  A limited number of SAS PROCs are automatically enabled for Grid Computing; these PROCs will run multi-threaded.

In each case, the SAS workload runs in the same way inside DCA as it would if implemented in a free-standing SAS instance with comparable computing power.   Existing SAS programs are not automatically enabled to leverage Greenplum’s parallel processing; the SAS user must explicitly modify the SAS program to exploit Greenplum Database just as they would when using SAS with other databases.

So, returning to the question: will existing SAS programs run faster in Greenplum DCA without modification?  Setting aside minor changes when moving any SAS program, the performance of existing programs when run in DCA will be no better than what would be achieved when SAS is deployed on competing hardware with comparable computing specifications.

SAS users can only realize radical performance improvements when they explicitly modify their programs to take advantage of in-database processing.   Greenplum has no special advantage in this regard; conversion effort is similar for all databases supported by SAS.

How Important is Model Accuracy?

Go to a trade show for predictive analytics and listen to the presentations; most will focus on building more accurate predictive models.  Presenters will differ on how this should be done: some will tell you to purchase their brand of software, others will encourage you to adopt one method or another, but most will agree: accuracy isn’t everything, it’s the only thing.

I’m not going to argue in this post that accuracy isn’t a good thing (all other things equal), but consider the following scenario: you have a business problem that can be mitigated with a predictive model.  You ask three consultants to submit proposals, and here’s what you get:

  • Consultant A proposes to spend one day and promises to produce a model that is more accurate than a coin flip
  • Consultant B proposes to spend one week, and promises to produce a model that is more accurate than Consultant A’s model
  • Consultant C proposes to spend one year, and promises to produce the most accurate model of all

Which one will you choose?

This is an extreme example, of course, but my point is that one rarely hears analysts talk about the time and effort needed to achieve a given level of accuracy. or the time and effort needed to implement a predictive model in production.  But in real enterprises, there are essential trade-offs that must be factored into the analytics process.  As we evaluate these three proposals, consider the following points:

(1) We can’t know how accurate a prediction will be; we can only know how accurate it was.

We judge the accuracy of a prediction after the event of interest has occurred.  In practice, we evaluate the accuracy of a predictive model by examining how accurate it is with historical data.  This is a pretty good method, since past patterns often persist into the future.  The key word is “often”, as in “not always”; the world isn’t in a steady state, and black swans happen.   This does not mean we should abandon predictive modeling, but it does mean we should treat very small differences in model back-testing with skepticism.

(2) Overall model accuracy is irrelevant.

We live in an asymmetrical world, and errors in prediction are not all alike.   Let’s suppose that your doctor thinks that you may have a disease that is deadly, but can be treated with an experimental treatment that is painful and expensive.  The doctor gives you the option of two different tests, and tells you that Test X has an overall accuracy rate of 60%, while Test Y has an overall accuracy of 40%.

Hey, you think, that’s a no-brainer; give me Test X.

What the doctor did not tell you is that all of the errors for Test X are false negatives: the test says you don’t have the disease when you actually do.  Test Y, on the other hand, produces a lot of false positives, but also correctly predicts all of the actual disease cases.

If you chose Test X, congratulations!  You’re dead.

(3) We can’t know the value of model accuracy without understanding the differential cost of errors.

In the previous example, the differential cost of errors is starkly exaggerated: on the one hand, we risk death, on the other hand, we undergo painful and expensive treatment.  In commercial analytics, the economics tend to be more subtle:  in Marketing, for example, we send a promotional message to a customer who does not respond (false positive) or decline a credit line for a prospective customer who would have paid on time (false negative).  The actual measure of a model, however isn’t its statistical accuracy, but its economic accuracy: the overall impact of the predictive model on the business process it is designed to support.

Taking these points into consideration, Consultant A’s quick and dirty approach looks a lot better, for three reasons:

  • Better results in back-testing for Consultants B and C may or may not be sustainable in production
  • Consultant A’s model benefits the business sooner than the other models
  • Absent data on the cost of errors, it’s impossible to say whether Consultants B and C add more value

A fourth point goes to the heart of what Agile Analytics is all about.  While Consultants B and C continue to tinker in the sandbox, Consultant A has in-market results to use in building a better predictive model.

The bottom line is this: the first step in any predictive modeling effort must always focus on understanding the economics of the business process to be supported.  Unless and until the analyst knows the business value of a correct prediction — and the cost of incorrect predictions — it is impossible to say which predictive model is “best”.

Deconstructing SAS Analytics Accelerator

Now and then I get queries from clients about SAS Analytics Accelerator, an in-database product that SAS supports exclusively with Teradata database.  SAS does not publicize sales figures for individual products, so we don’t know for sure how widely Analytics Accelerator is used.  There are some clues, however.

  • Although SAS released this product in 2008, it has published no customer success stories.  Follow the customer success link from SAS’ overview for Analytics Accelerator and read the stories; none of them describes using this product.
  • SAS has never expanded platform support beyond Teradata.  SAS is a customer-driven company that does not let partner and channel considerations impact its customers.   Absence of a broader product rollout implies absence of market demand
  • Unlike the rest of its product line, SAS has not significantly enhanced Analytics Accelerator since the product was introduced four years ago.

SAS supports seven in-database Base SAS PROCs (FREQ, MEANS, RANK, REPORT, SORT, SUMMARY and TABULATE) on many databases, including Aster, DB2, Greenplum, Netezza, Oracle and Teradata.  Analytics Accelerator supports seven SAS/STAT PROCs (CORR, CANCORR,  FACTOR, PRINCOMP, REG, SCORE, and VARCLUS), one SAS/ETS PROC (TIMESERIES), three Enterprise Miner PROCs (DMDB, DMINE and DMREG) plus macros for sampling and binning for use with Enterprise Miner.   Customers must license SAS/STAT, SAS/ETS and SAS Enterprise Miner to use the in-database capabilities.

On the surface, these appear to be features that offer SAS users significantly better integration with Teradata than with other databases.  When we dig beneath the surface, however, the story is different.

Anyone familiar with SAS understands from a quick look at the supported PROCs that it’s an odd list;  it includes some rarely used PROCs and omits PROCs that are frequently used in business analytics (such as PROC LOGISTIC and PROC GLM).  I generally ask SAS clients to list the PROCs they use most often; when they do so, they rarely list any of the PROCs supported by Analytics Accelerator.

The SAS/STAT PROCs supported by Analytics Accelerator do not actually run in Teradata; the PROC itself runs on a SAS server, just like any other SAS PROC.   Instead, SAS passes a request to Teradata to build a Sum of Squares Cross Product (SSCP) matrix.  SAS then pulls the SSCP matrix over to the SAS server, loads it into memory and proceeds with the computational algorithm.

This is a significant performance enhancement, since MPP databases are well suited to matrix operations and the volume of data moved over the network is reduced to a minimum.   But here’s the kicker: any SAS user can construct SSCP matrices in an MPP database (such as IBM Netezza) and import it into SAS/STAT.  You don’t need to license SAS Analytics Accelerator; every SAS customer who licenses SAS/STAT already has this capability.

This explains, in part, the unusual selection of PROCs:  SAS chose PROCs that could be included with minimal R&D investment.  This is a smart strategy for SAS, but says little about the value of the product for users.

Since SAS/STAT does not currently export PMML documents for downstream integration, in-database support for PROC SCORE is intriguing; once again, though,  the devil is in the details.  Analytics Accelerator converts the SAS model to a SQL expression and submits it to the database; unfortunately, this translation only supports linear models.  SAS users can score with models developed in thirteen different SAS PROCs  (ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, ORTHOREG, QUANTREG, REG, ROBUSTREG, TCALIS and VARCLUS), but with the exception of PROC REG these are rarely used in predictive modeling for business analytics.  SAS seems to have simply selected those PROCs whose output is easy to implement in SQL, regardless of whether or not these PROCs are useful.

Overall, Analytics Accelerator lacks a guiding design approach, and reflects little insight into actual use cases; instead, SAS has cobbled together a collection of features that are easy to implement.   When clients consider the tasks they actually want to do in SAS, this product offers little value.

Embrace Open Source Analytics

Suppose you could implement an analytics platform with comprehensive out-of-the-box capabilities, a flexible programming environment, good visualization capabilities and a growing body of skilled users.  Suppose this platform leveraged a massively parallel architecture for high performance and scalability.  And suppose you could do this without investing in software fees.

You don’t have to suppose, because IBM Netezza helps you leverage the power and capability of R.

R is the best known open source analytics project, but there are many other open source analytics available, including the Data Mining Template Library, the dlib and Orange C++ libraries and the Java Data Mining Package.  In this article, we’ll focus on R.

There are three main reasons R should be part of your enterprise analytics architecture:

  • R has capabilities not available in commercial analytics software
  • Usage of R by analysts is growing rapidly
  • R’s total cost of ownership is attractive

R functionality is a superset of the functionality available in commercial analytics packages. There are currently 3,047 packages published in the CRAN repository, and almost 5,000 packages in all repositories worldwide.  Moreover, the number of available packages is growing rapidly.  While commercial software vendors must prioritize development effort towards features with predictable demand and broad appeal, R developers work under no such constraints.  As a result, new, cutting-edge and niche applications tend to be published in R before they are available in commercial packages.

A customer we’re working with in the life sciences industry wants to apply four new methods to their analytic toolkit.  This customer spends almost a billion dollars each year to run hundreds of thousands of experiments; very small improvements in precision directly impact this customer’s bottom line.  Right now, all of these new methods are available in R, and none are available in commercial packages.

Interest in R is growing exponentially.   According to the most recent Rexer Analytics survey, R is the preferred analytics package for more respondents than for any other analytic software.  R outperforms all other analytics packages on various measures of mindshare, including listserv activity, website popularity, page rank and blogging activity.

Some customers we work with express concerns that open source software may be full of bugs, trojan horses or other security risks.  This view is based on the mistaken belief that developers can publish anything they like in R.  In fact, the R Project has a highly-developed review and testing process, and well-defined procedures for bug tracking and fixing.  R’s large and highly engaged user community ensures that R packages receive as much scrutiny and testing as many commercial software packages.

Like many analytical packages, R performs calculations in memory, which limits the amount of data that can be used in analysis to the size of memory on the host.  IBM Netezza partner Revolution Analytics has developed a commercial version of R (Revolution R Enterprise) that combines the capability and value of open source R with the quality assurance and technical support of vendor-supported software.   Revolution has also developed a set of enhancements that enable R to scale to terabyte-sized problems.  The combination of Revolution R Enterprise and Netezza’s massively parallel architecture provides a truly scalable and high-performance analytics platform.

Open source analytics like R offer firms rich capabilities, a flexible platform and great value.   With Netezza and Revolution Analytics, R is a scalable and high performance platform.

Leverage the In-Database Capabilities of Analytic Software

Many analysts have a strong preference for commercial analytic workbenches such as SAS or SPSS.  Both packages are widely used, respected by analysts, and each has strong advocates.  The purpose of this article is to point out that analytic users can benefit from the performance and simplicity of IBM Netezza in-database analytics without abandoning their preferred interface.

Let’s start with SAS.  One of the most frequent complaints from IT organizations about SAS users is the propensity for users to require significant amounts of storage space for SAS data sets.  A leading credit card issuer, for example, reports that users have more than one hundred terabytes of SAS files – and the volume is growing rapidly.

But SAS users can store data tables in the Netezza appliance and run data preparation steps against those tables using the SAS Pass-Through Facility.  In addition to centralizing storage, reducing data movement and simplifying security, users can realize 100X improvements in program runtime.

In-database PROCs are another SAS feature.  SAS currently enables in-database execution of FREQ, MEANS, RANK, REPORT, SORT, SUMMARY, and TABULATE in a number of databases and data warehouse appliances, including Netezza.   For the user, database-enabled PROCs operate like any other SAS PROC — but instead of running on the server, the PROC runs in the database.

SAS supports a number of other in-database capabilities through SAS/ACCESS, including the ability to pass functions and formats to Netezza, the ability to create temporary tables and the ability to leverage Netezza’s bulk load/unload facility

SAS users can make calls to Netezza in-database functions by invoking Netezza In-Database Analytics through PROC SQL.  In-database functions are far more efficient for building analytic data sets, data cleansing and enhancement.  Customers who have implemented this approach have observed remarkable improvements in overall runtime: jobs that ran in hours now run in minutes.

SAS customers using SAS Enterprise Miner or SAS Model Manager can also benefit from SAS Scoring Accelerator.  Scoring Accelerator which SAS enables an Enterprise Miner user to export a scoring function that runs on Netezza.  This capability helps the organization avoid a custom programming task, and enables the analyst to easily hand off model scoring to a production operation.

IBM SPSS Modeler also offers the capability to work directly with database tables in Netezza; like SAS, it can be configured to minimize storage on the SPSS server.  Modeler also offers Pushback SQL capabilities, which enable the user to perform functions within the Netezza appliance, including table joins, aggregation, selections, sorting, field derivation, field projection and scoring.  While the in-database functional capabilities of the two packages are similar, SPSS accomplishes this entirely within the graphical environment of the Stream canvas.

As with SAS, SPSS Modeler users can leverage Netezza in-database analytics to build, score and store predictive models, either through custom nodes or out-of-the box integration in Release 15.0.  Again, a key difference between SAS and SPSS is that while SPSS Modeler surfaces Netezza in-database analytics through the graphical user environment, SAS users must have programming and SQL skills.

To summarize, leading commercial software packages like SAS and SPSS already offer the ability to manage files, perform data preparation, build models and run scoring processes entirely within the Netezza appliance.  Users of these tools can significantly improve runtime performance by leveraging these existing capabilities.

What Business Practices Enable Agile Analytics?

Part four in a four-part series.

We’ve mentioned some of the technical innovations that support an Agile approach to analytics; there are also business practices to consider.   Some practices in Agile software development apply equally well to analytics as any other project, including the need for a sustainable development pace; close collaboration; face-to-face conversation; motivated and trustworthy contributors, and continuous attention to technical excellence.  Additional practices pertinent to analytics include:

  • Commitment to open standards architecture
  • Rigorous selection of the right tool for the task
  • Close collaboration between analysts and IT
  • Focus on solving the client’s business problem

More often than not, customers with serious cycle time issues are locked into closed single-vendor architecture.  Lacking an open architecture to interface with data at the front end and back end of the analytics workflow, these organizations are forced into treating the analytics tool as a data management tool and decision engine; this is comparable to using a toothbrush to paint your house.  Server-based analytic software packages are very good at analytics, but perform poorly as databases and decision engines.

Agile analysts take a flexible, “best-in-class” approach to solving the problem at hand.  No single vendor offers “best-in-class” tools for every analytic method and algorithm.  Some vendors, like KXEN, offer unique algorithms that are unavailable from other vendors; others, like Salford Systems, have specialized experience and intellectual property that enables them to offer a richer feature set for certain data mining methods.  In an Agile analytics environment, analysts freely choose among commercial, open source and homegrown software, using a mashup of tools as needed.

While it may seem like a platitude to call for collaboration between an organization’s analytics community and the IT organization, we frequently see customers who have developed complex processes for analytics that either duplicate existing IT processes, or perform tasks that can be done more efficiently by IT. Analysts should spend their time doing analysis, not data movement, management, enhancement, cleansing or scoring; but surveyed analysts typically report that they spend much of their time performing these tasks.  In some cases, this is because IT has failed to provide the needed support; in other cases, the analytics team insists on controlling the process.   Regardless of the root cause, IT and analytics leadership alike need to recognize the need for collaboration, and an appropriate division of labor.

Focusing the analytics effort on the client’s business problem is essential for the practice of Agile analytics.  Organizations frequently get stuck on issues that are difficult to resolve because the parties are focused on different goals; in the analytics world, this takes the form of debates over tools, methods and procedures.  Analysts should bear in mind that clients are not interested in winning prizes for the “best” model, and they don’t care about the analyst’s advanced degrees.   Business requires speed, agility and clarity, and analysts who can’t deliver on these expectations will not survive.

What Is Driving Interest in Agile Analytics?

Part three in a four-part series.

A combination of market forces and technical innovation drive interest in Agile methods for analytics:

  • Clients require more timely and actionable analytics
  • Data warehouses have reduced latency in the data used by predictive models
  • Innovation directly impacts the analytic workflow itself

Business requirements for analytics are changing rapidly, and clients demand predictive analytics that can support decisions today.  For example, consider direct marketing:  ten years ago, firms relied mostly on direct mail and outbound telemarketing; marketing campaigns were served by batch-oriented systems, and analytic cycle times were measured in months or even years.  Today, firms have shifted that marketing spend to email, web media and social media, where cycle times are measured in days, hours or even minutes.  The analytics required to support these channels are entirely different, and must operate at a digital cadence.

Organizations have also substantially reduced the latency built into data warehouses.  Ten years ago, analysts frequently worked with monthly snapshot data, delivered a week or more into the following month.  While this is still the case for some organizations, data warehouses with daily, inter-day and real-time updates are increasingly common.  A predictive model score is as timely as the data it consumes; as firms drive latency from data warehousing processes, analytical processes are exposed as cumbersome and slow.

Numerous innovations in analytics create the potential to reduce cycle time:

  • In-database analytics eliminate the most time-consuming tasks, data marshalling and model scoring
  • Tighter database integration by vendors such as SAS and SPSS enable users to achieve hundred-fold runtime improvements for front-end processing
  • Enhancements to the PMML standard make it possible for firms to integrate a wide variety of end-user analytic tools with high performance data warehouses

All of these factors taken together add up to radical reductions in time to deployment for predictive models.  Organizations used to take a year or more to build and deploy models; a major credit card issuer I worked with in the 1990s needed two years to upgrade its behavior scorecards.  Today, IBM Netezza customers who practice Agile methods can reduce this cycle to a day or less.

What Is Agile Analytics?

This post is the second in a four-part series.

Agile Analytics is an approach to predictive analytics that emphasizes:

  • Client satisfaction through rapid delivery of usable predictions
  • Focus on model performance when deployed “in market”
  • Iterative and evolutionary approach to model development
  • Rapid cycle time through radical reduction in time to deployment

The Agile approach focuses on the client’s end goal: using data-driven predictions to make better decisions that impact the business.  In contrast, conventional approaches to predictive modeling (such as the well-known SEMMA[1] model) tend to focus on the model development process, with minimal attention given to either the client’s business problem or how the model will be deployed.

Since Agile Analytics is most concerned with how well the predictive model supports the client’s decision-making process, the analyst evaluates the model based on how well it serves this purpose when deployed under market conditions.  In practice, this means that the analyst evaluates model accuracy in production together with score latency, deployment cost and interpretability – a critical factor when building predictive analytics into a human process.   Conventional approaches typically evaluate predictive models solely on model accuracy when back-tested on a sample, a measure that often overstates the accuracy that the model will achieve when deployed under market conditions.

Agile analysts stress rapid deployment and iterative learning; they assume that the knowledge produced from tracking an initial model after it is deployed enables enhancements in subsequent iterations, and they build this expectation into the modeling process.  An Agile analyst quickly develops a predictive model using fast, robust methods and available data, deploys the model, monitors the model in production and improves it as soon as possible.  A conventional analyst tends to take extra time perfecting an initial model prior to deployment, and may pay no attention to in-market performance unless the client complains about anomalies.

Reducing cycle time is critical for the Agile analyst, since every iteration produces new knowledge.  The Agile analyst aggressively looks for ways to reduce the time needed to develop and deploy models, and factors cycle time into the choice of analytic methods.  Conventional analysts are often strikingly unengaged with what happens outside of the model development task; larger analytic teams often delegate tasks like data marshalling, cleansing and scoring to junior members, who perform the “grunt” work with programming tools.

[1] Sample, Explore, Modify, Model, Assess

Agile Analytics: Overview

Is this the year of Agile Analytics?  Recent publications show growing interest in the application of Agile methods to analytics:

  • Ken Collier, an Agile pioneer, tackles analytics in his aptly named new book Agile Analytics .
  • A quick Google search surfaces a number of recent blogs and articles (here, here and here)
  • Curt Monash recently published an excellent two-part blog on the subject (here and here)

I’ve commented in the past on IBM’s Big Data Hub about techniques that contribute to Agile Analytics, such as in-database analyticsopen source analytics and tighter integration with commercial packages like SAS.  In addition, I’ve commented on some of the barriers to agility, such as limitations of the PMML standard.

In this series, I’ll cover these topics

(1) What is Agile Analytics?

(2) What’s driving interest in Agile Analytics?

(3) What business practices enable Agile Analytics?