SAS and Hadoop
SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities. Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.
Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area. In March 2012, SAS announced support for Hadoop connectivity; since then, SAS has gradually expanded the features it supports with Hadoop.
As of today, there are four primary ways that a SAS user can leverage Hadoop:
- Legacy SAS users can connect to Hadoop through the SAS/ACCESS Interface to Hadoop
- SAS Enterprise Miner users can export scoring models to Hadoop with SAS Scoring Accelerator
- SAS LASR Server, the back end for SAS Visual Analytics, can be co-located in Hadoop
- The SAS High Performance Analytics suite can be co-located in Hadoop
Let’s take a look at each option.
“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface. SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing. It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user. For more detailed information, read the manual.
SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop. If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop. If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.
A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps. (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce). SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document); if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.
SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through. Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”. The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.
SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.
SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera. Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product. Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.
Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server. These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture. That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.
That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative. In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time. SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine. Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode. This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.
While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format. To import the data into SASHDAT, you will need to license SAS Data Integration Server.
A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes. SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations. Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.
SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K. Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.
While SAS has struggled to implement its in-memory software in Hadoop to date, YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop. Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.