SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:
(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.
(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.
SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post. At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server. Paul was most generous with his time. We discussed three areas:
(1) Can SAS LASR Server work directly with data in Hadoop?
According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format. That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.
Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.
SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats. This builds on SAS’ long-standing ability to work with enterprise data everywhere.
Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server. Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place. Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.
SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.
(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?
At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.
Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload. This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.
Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce. Applications operating independently from MapReduce (like SAS) were essentially ungoverned. The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.
(3) Can SAS LASR Server run on standard commodity hardware?
SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.
While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration. A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.
In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.
It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory. While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.