SAS Versus R Part Two

In a previous post, I summarized some myths about SAS and R — arguments offered by proponents of one or the other that deserve to be dismissed.

In this post, I will review some arguments that do make sense — things to consider if you are an aspiring analyst or if you are an executive making decisions about software for your organization.

(1) Every analysis technique available in SAS is available in R — plus many more

It’s fair to say that any analysis you can do in SAS you can also do in R.  The reverse, however, is not true — there are many techniques available in R that are not available in SAS.

As an open source platform, R is open to innovation, and offers few barriers to entry for new techniques.  An analyst who develops a new technique can quickly publish it in R, even if the technique has only niche appeal; it’s a great example of the long tail effect in action.

Commercial software providers like SAS, on the other hand, use product management calculus to balance the benefits of introducing a new technique against the cost to develop and support it.  The marginal revenue from adding a feature is hard to measure, while the costs are known, so conservative companies like SAS tend to lag well behind the cutting edge.  SAS also tends to bundle popular new capabilities into new products rather than enhancing the existing product, forcing customers to add more SAS software licenses to the stack if they want the capability.

Random Forests is a case in point.  Breiman and Cutler published their seminal article describing the technique in October, 2001; the following year, they published the randomForest package in R.  In December, 2012, SAS released an “experimental” version of what it calls “HP Forests” in SAS High Performance Analytics, and in 2013 included the PROC in SAS Enterprise Miner 13.1.

Ten years is a long time to wait.

(2) SAS is easier to learn and use than R

R mavens dispute this point, but they are wrong.  R is significantly harder to learn and use than SAS, at several levels, and for a number of reasons.

Bob Muenchen recently published an excellent catalogue of Things That Make R Hard to Learn.  Bob should know; he makes a living helping users cross the chasm from SAS to R.  Here is a brief except, but you should definitely read the whole thing:

R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.

There are two main reasons SAS is easier to user than R.  First, as a commercial product every element of SAS is governed by a common design that unifies the SAS programming language, user interfaces and documentation.  As a result, SAS programming syntax and documentation is generally consistent across procedures; statements generally mean the same thing whether you are working in PROC ACCESS or PROC XML.

Developers who contribute R packages, on the other hand, operate independently and without a comparable design.  While each individual package may be well or poorly written, there is no governing principle that ensures packages are consistent with one another.  While R aficionados celebrate its diversity, to the outsider it just seems messy.

SAS’ strong development tools add significant value for the user.  SAS Enterprise Guide, for example, included with Analytics Pro at no extra charge, offers a workflow interface and the ability to generate SAS or SQL code behind the scenes.  There is no equivalent code-generating tool available for R today.

(3) SAS offers an “enterprise-grade” solution

Individual analysts surveyed by Rexer last year said that cost and ease of use are the most important factors they consider when choosing analytic tools.  For enterprises, however, the selection criteria are more complex.

Technical support is a key concern for most organizations; some go so far as to adopt blanket policies banning the use of unsupported software.  SAS invests heavily in its Support organization; unlike many large software vendors, Technical Support is a career track at SAS, with low employee turnover.  With locations located in multiple countries, SAS is able to support customers globally and at enterprise scale.

When SAS licenses its software, it warrants that the software is materially free of defects.  This warranty is backed up by a contractual commitment to fix defects that surface.  Hence, SAS offers the customer a “single throat to choke” — customers know when they license SAS that a single organization is responsible for development, distribution, implementation and support of the software, and accountability is clear.

Open source R, of course, has no organic technical support.  Organizations such as Revolution Analytics offer technical support either for open source R or Revolution’s own commercial R distribution.  Third party service providers like Revolution can be highly knowledgeable and effective; however, if there is a software defect in an R package, the support provider can only notify the developer and request resolution.

(4) SAS costs more than R

“Duh!” you say; “R is free!”  True enough.  R is open source software, distributed with a free license to use; for a single analyst, the incremental TCO to download, install and use R on an existing machine is zero.   This is also true for other key components of the R ecosystem, such as RStudio, the popular development environment.   Low cost of entry is a key driver behind R’s growing popularity.

SAS, on the other hand, charges a subscription fee which consists of a term license to use the software plus technical support and maintenance into a subscription fee.  Entry costs to license the most basic package (SAS Analytics Pro) costs $8,700 (first year fee) at the SAS online store; this package includes Base SAS, SAS/STAT and SAS/Graph.  SAS renewal fees generally run 25-30% of the first year fee.  SAS bundles its analytic features into a number of separate packages, such as SAS/ETS for time series, SAS/OR for optimization and SAS/IML for matrix manipulation; if you require these capabilities, you must pay extra.  SAS also offers access engines for an assortment of data sources, each of which can be licensed individually for $3,000 each.

The version of SAS sold through the online store is for single Windows machines only.  SAS sells its software for servers through its sales force, and pricing is negotiated; “list” price depends on the computing power of the server, measured by cores or sockets.  Server pricing for Analytics Pro starts in the low six figures.

SAS offers a virtualized “University Edition” which is free but not open source.  See here for a review.

Bottom line — for the analyst

Aspiring analysts ask “should I learn SAS or R?”   I’m tempted to answer “why not both?” but that begs the question of which to learn first.

If SAS is the primary tool at your organization or university, learn it and use it.  There are still more jobs available for SAS users than R users (though the gap is narrowing); and even prospective employers who do not currently use SAS treat it as a proxy for analytics know-how.

If your organization or university supports both SAS and R, look for trends in usage.  Is the R community growing rapidly?  Are the “best and brightest” people using SAS or R?  Is your management putting out subtle (or not-so-subtle) messages promoting use of one or the other?   Take the pulse of your organization and make your choice.

If your organization or university does not already license SAS, if you aspire to free-lance consulting or you are simply unemployed, learn R.  Doing so costs you nothing, and there are plenty of low-cost options for training and self-directed learning.

Bottom line — for the enterprise

If you are making decisions about software for an analytics team or an entire organization, the calculus is more complex.

R has more analytic techniques than SAS, but what techniques to you actually need?  Take note of your team’s actual current and future analytic needs, and act accordingly.  If you are using SAS today, the chances are very good that a handful of PROCs account for 95% of current usage; the same is true for R.

SAS is easier to learn than R, but if all or most of your analysts already know R, what difference does it make?  Many younger analysts entering the workforce already know how to use R, and it is a waste of time and money to force them to learn SAS.  On the other hand, if your analysts rely on SAS, you can expect to invest considerable time and money for retraining.

Do you need an enterprise solution?  If you organization spans multiple countries, if you support more than twenty users, the chances are that the answer is “yes”.  For a larger organization, it’s hard to beat SAS’ ability to mobilize support, training and consulting resources around the world.  This is likely to change in the future, as organizations like Revolution Analytics build scale and credibility.

SAS costs more than R, but R is not free.  If you are concerned about SAS costs, carefully evaluate your spending and take note of the value offered by SAS.  Keep in mind that software licensing costs are only one component of Total Cost of Ownership (TCO); third-party support for R is not free, and neither is training and conversion.  Do the math.

In general, SAS works well for organizations that are in the middle of Tom Davenport’s maturity cycle pictured above.   These organizations have the basic data infrastructure and business cases for analytics, combined with a need for rapid scale and consistency across locations.  As organizations mature, they become less dependent on a single vendor for analytics and more willing to develop a “best-in-breed” approach; they are more interested in innovation and “cutting-edge” techniques, and the analysts they hire have the will and skill to learn R.  These organizations are adopting R at an increasing rate.

Adoption of R is most pervasive among analytic service providers, such as consultants, system integrators and marketing service providers.  These organizations are sensitive to software costs and tend to hire most highly skilled analysts, for whom R’s learning curve is not a serious issue.  Costs aside, SAS restrictions on use — designed to prevent cannibalization– are highly problematic for service providers.

SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.