Fact-Check: SAS and Greenplum
Does SAS run “inside” Greenplum? Can existing SAS programs run faster in Greenplum without modification? Clients say that their EMC rep makes such claims.
The first claim rests on confusion about EMC Greenplum’s product line. It’s important to distinguish between Greenplum Database and Greenplum DCA. Greenplum DCA is a rack of commodity blade servers which can be configured with Greenplum Database running on some of the blades and SAS running on the other blades. For most customers, a single DCA blade provides insufficient computing power to support SAS, so EMC and SAS typically recommend deployment on multiple blades, with SAS Grid Manager implemented for workload management. This architecture is illustrated in this white paper on SAS’ website.
As EMC’s reference architecture clearly illustrates, SAS does not run “inside” Greenplum database (or any other database); it simply runs on server blades that are co-located in the same physical rack as the database. The SAS instance installed on the DCA rack works just like any other SAS instance installed on freestanding servers. SAS interfaces with Greenplum Database through a SAS/ACCESS interface, which is exactly the same way that SAS interacts with other databases.
Does co-locating SAS and the database in the same rack offer any benefits? Yes, because when data moves back and forth between SAS and Greenplum Database, it does so over a dedicated 10GB Ethernet connection. However, this is not a unique benefit — customers can implement a similar high-speed connection between a free-standing instance of SAS and any data warehouse appliance, such as IBM Netezza.
To summarize, SAS does not run “inside” Greenplum Database or any other database; moreover, SAS’ interface with Greenplum is virtually the same as SAS’ interface with any other supported database. EMC offers customers the ability to co-locate SAS in the same rack of servers as the Greenplum Database, which expedites data movement between SAS and the database, but this is a capability that can be replicated cheaply in other ways.
The second claim — that SAS programs run faster in Greenplum DCA without modification — requires more complex analysis. For starters, though, keep in mind that SAS programs always require at least some modification when moved from one SAS instance to another, if only to update SAS libraries and adjust for platform-specific options. Those modifications are small, however, so let’s set them aside and grant EMC some latitude for sales hyperbole.
To understand how existing SAS program will perform inside DCA, we need to consider the building blocks of those existing programs:
- SAS DATA Steps
- SAS PROC SQL
- SAS Database-Enabled PROCs
- SAS Analytic PROCs (PROC LOGISTIC, PROC REG, and so forth)
Here’s how SAS will handle each of these workloads within DCA:
(1) SAS DATA Steps: SAS attempts to translate SAS DATA Step statements into SQL. When this translation succeeds, SAS submits the SQL expression to Greenplum Database, which runs the query and returns the result set to SAS. Since SAS DATA Step programming includes many concepts that do not translate well to SQL, in most cases SAS will extract all required data from the database and run the required operations as a single-threaded process on one of the SAS nodes.
(2) SAS PROC SQL: SAS submits the embedded SQL to Greenplum Database, which runs the query and return the result set to SAS. The SAS user must verify that the embedded SQL expression is syntactically correct for Greenplum.
(3) SAS Database-Enabled PROCs; SAS converts the user request to database-specific SQL and submits to Greenplum Database, which runs the query and returns the result set to SAS.
(4) SAS Analytic PROCs: In most cases, SAS runs the PROC on one of the server blades. A limited number of SAS PROCs are automatically enabled for Grid Computing; these PROCs will run multi-threaded.
In each case, the SAS workload runs in the same way inside DCA as it would if implemented in a free-standing SAS instance with comparable computing power. Existing SAS programs are not automatically enabled to leverage Greenplum’s parallel processing; the SAS user must explicitly modify the SAS program to exploit Greenplum Database just as they would when using SAS with other databases.
So, returning to the question: will existing SAS programs run faster in Greenplum DCA without modification? Setting aside minor changes when moving any SAS program, the performance of existing programs when run in DCA will be no better than what would be achieved when SAS is deployed on competing hardware with comparable computing specifications.
SAS users can only realize radical performance improvements when they explicitly modify their programs to take advantage of in-database processing. Greenplum has no special advantage in this regard; conversion effort is similar for all databases supported by SAS.