SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.

Strata Report: Advanced Analytics in Hadoop

Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:

0XData

Product(s)

  • H20 (open source project)
  • h2o (R package)

Description

Smart people from Stanford with VC backing and a social media program.   Services business model with open source software.  H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters;  aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests.   Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA).  They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence.  R users can interface through h2o package.  Limited customer base.  Partners with Cloudera and MapR.

Key Points

  • True open source model
  • Comprehensive roadmap
  • Limited functionality
  • Limited user base
  • Performance claims undocumented

Alpine Data Labs

Product(s)

  • Alpine 2.8

Description

Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler).   Supports a reasonably broad range of analytic features.  Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works.  (Appears to be SQL/HiveQL push-down).   In practice, most customers seem to use Alpine with Greenplum.  Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability.  Partners with Pivotal, Cloudera and MapR.

Key Points

  • Reasonable option for users already committed to Greenplum Database
  • Limited partner and user ecosystem
  • Performance and stability should be vetted thoroughly in POC

Oracle

Product(s)

Description

Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).

Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise.   Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents.  When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.

Oracle Connection to Hadoop  is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive.  As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.

Key Points

  • Good choice for Oracle-centric organizations
  • Oracle Data Mining is a mature product with an excellent user interface
  • Must move data from Hadoop to Oracle Database to leverage OAA
  • Hadoop push-down from R requires expertise in MapReduce

SAS

Products

  • SAS/ACCESS Interface to Hadoop
  • SAS Scoring Accelerator for Cloudera
  • SAS Visual Analytics/SAS LASR Server
  • SAS High Performance Analytics Server

Description

SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server.   With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees.   Update:  SAS has announced SAS/ACCESS for Impala.

Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics.  High Performance Analytics is a suite of distributed in-memory analytics.   LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware.  Partners with Cloudera and Hortonworks.

Key Points

  • Legacy SAS connects to Hadoop, does not run in Hadoop
  • SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
  • Visual Analytics cannot work with “raw” data in Hadoop
  • Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
  • High TCO, proprietary architecture for all SAS products

Skytree

Product(s)

  • Skytree Server

Description

Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013.  Server-based technology, can connect to a range of data sources, including Hadoop.  Programming interface; claims ability to run from R, Weka, C++ and Python.  Good library of algorithms.  Partners with Cloudera, Hortonworks, MapR.  Skytree is opaque about technology and performance claims.

Key Points

  • Limited customer base, no announced sales since company launch
  • Hadoop integration is a connection, not “inside” architecture
  • Performance claims should be carefully vetted

SAS Global Forum Link-O-Rama

Google Alerts firing all day like an Uzi.  Quick summary.

Cloudera hearts SAS.  Cloudera announced a “strategic alliance” with SAS today at SAS Global Forum.    According to the announcement,

Customers are now empowered to quickly and easily analyze their data in Hadoop by connecting SAS directly to their Cloudera-powered Big Data repositories.

Customers were also so empowered last year at this time, when SAS first released an access engine for Cloudera.  SAS/ACCESS for Hadoop enables a user to embed MapReduce, HiveQL and Pig statements in a SAS program.

Leveraging SAS data management tools with Hadoop’s open platform and parallel architecture, business analysts can instantly query data in Hadoop without additional training

This statement is true for business analysts who already know MapReduce, HiveQL and Pig.

Does this mean that SAS predictive analytics will someday run inside a Cloudera Hadoop distribution?  Don’t hold your breath on that.   SAS seems to be putting all of its R&D eggs in the in-memory basket.

SAS hearts Cloud.  After poo-pooing public cloud for years, SAS finally admits that cloud has some potential; hence the hoopla about SAS 9.4 being “cloud-ready”.   Note to SAS: all software is “cloud-ready” unless you deliberately build in obstacles, like a cumbersome license key, terms and conditions or pricing that makes it not worth doing.

Your software renewal fees at work.   The Umstead, captive “hotel, restaurant and spa” where SAS wines and dines customers and employees of the month, has lovely glass sculptures in custom pots personally selected by Mrs Goodnight, with the able assistance of the SAS art and scenic crew.

I ate at the Umstead once.  When it first opened, word came down that SAS employees weren’t supposed to stay there, because  it was way too fancy.  Then it seems they had a hard time filling the place, because there’s no good reason to hang out at the corner of Harrison and I-40 unless you’re a SAS employee or your SAS rep gives you a free ticket to come on down, hang out and watch the Visual Analytics demo or whatever.

Food was pretty good, a little better than the Bonefish Grill or Ruth’s Chris up the street.  Service was mannered, as if the young people were still learning where to put the salad fork, which is not the sort of thing they teach you at N.C. State.  The cocktail waitress had a tramp stamp.

SAS reboots High Performance Analytics Server.  The global user group for SAS High Performance Analytics Server can meet at one of the small tables in Starbucks at the Moscone Center.   Announced two years ago and launched seventeen months ago, as of this writing SAS still has no public success stories for the product, possibly because customers are unwilling to shell out a couple million in first year fees plus a couple more million for the appliance it runs in for a big sandbox.

As the press release puts it:

Each of the new products is laser-focused on analytic technique, including data mining, text mining, optimization, forecasting, statistics and econometrics, and useful across any industry.

Got that?  It’s “laser-focused”.  Sounds like SAS is repackaging the HPA stuff into smaller bundles, presumably with a lower price, which seems like a smart move.

SAS also plans to add the HPA algorithms into SAS/Stat, Analytics Pro and Enterprise Miner for deployment in legacy environments.  This is great news for people with tiny data sets who like to play with SAS.

In other HPA news, SAS announced support for Oracle Exadata, a move that will cause long faces at IBM, who were really, really optimistic recently that SAS would soon support HPA on IBM boxes.  Note to IBM: if you want SAS to run on your boxes, you have to buy the Gold Sponsorship.