How to Buy SAS Visual Analytics

Stories about SAS Visual Analytics are among the most widely read posts on this blog.  In the last two years I’ve received many queries from readers who complain that it’s hard to get clear answers about the software from SAS.

In software procurement, the customer has bargaining power until the deal closes; after that, power shifts to the vendor.   In this post, I’ve compiled some key questions prospective customers should resolve before signing a license agreement with SAS.

SAS Visual Analytics (VA), first launched in 2012, is now in its seventh dot release.  With a total of ~3,400 sites licensed, the most serious early release issues are resolved.  The product itself has improved.  In early releases, for example, it was impossible to join tables after loading them into VA; now you can.  SAS has gradually added features to the product, and will continue to do so.

Privately, SAS account executives describe VA as a “Tableau-Killer”; a more apt description is “Tableau for SAS Lovers.”   An experienced Tableau user will immediately notice features missing from VA.  On the other hand, SAS offers some statistical features (SAS Visual Statistics) not currently available in Tableau, for an extra license fee.

As this chart shows, Tableau is still alive:

SASVA vs Tableau

Source: Tableau Annual Report: SAS Revenue Press Release

SAS positions VA to its existing BI customers as a replacement product, and not a moment too soon; Gartner reports that organizations are rapidly pulling the plug on the legacy SAS BI product.  SAS prices VA to sell, clearly seeking to underprice Tableau and build a footprint.  Ordinarily, SAS pricing is a closely held secret, but SAS discloses its low VA pricing in the latest Gartner BI Magic Quadrant report.

Is VA the Right Solution?

VA works with SAS LASR Server, a proprietary in-memory analytic datastore, which should not be confused with in-memory databases like SAP HANA, Exasol or MemSQL.   In-memory databases have many features that are missing from LASR Server, such as ACID compliance, ANSI SQL engines and automated archiving.  Most in-memory databases can update data in real time; for LASR Server, you update a table by reloading it.  Commercial in-memory databases support many different end-user products for visualization and BI, so you aren’t locked in with a single vendor.  LASR Server supports SAS software only.

Like any other in-memory datastore, LASR Server is best for small high-value databases that will be queried by many users who require low latency.  LASR Server reads an entire table into memory and persists it there, so the amount of available memory is a limiting factor.

Since LASR Server is a distributed engine you can add more servers if you need more memory.  But keep in mind that while the cost of memory is declining, it is not free; it is still quite expensive per byte compared to disk storage.  In practice, most working in-memory databases support less than a terabyte of data.  By contrast, the smallest data warehouse appliances sold by vendors like IBM support thirty terabytes.

LASR Server’s principal selling point is speed.  The product is fast because it persists data in memory, and separates the disk I/O bottleneck from the user experience.  (You still need to load data into LASR Server, but you can do this separately, when the user isn’t waiting for a response.)

In contrast, Tableau uses a patented (e.g. proprietary) data engine that interfaces with your data source.  For extracts not already cached on the server, Tableau submits a query whose runtime depends on the data source; if the supporting database is poorly tuned, the query may take a long time to run.  In most cases, VA will be faster than Tableau, but it’s debatable how critical this is for a decision support application.

VA and LASR Server are the right solution for your business problem if all of the following conditions are true:

  • You work with less than a terabyte of data
  • You are willing to limit your visualization and BI tools to SAS software
  • You expect more than a handful of concurrent users
  • Your users require subsecond query response times

If you are thinking of using VA and LASR Server in distributed mode (implemented across more than one server), keep in mind that distributed computing is an order of magnitude more difficult to deliver.  Since SAS pitches a low-cost “Single Box Solution” as an entry-level product, most of those 3,400 customer sites run on a single server.  Before you commit to licensing the product in a multi-server configuration, you should insist on additional proof of product viability from SAS.  For example, insist on references from customers running in production in configurations at least as large as what you have in mind; and consider a full proof-of-concept (funded by SAS).

SAS’ low software pricing for VA makes it seem attractive.  However, you need to focus on the total cost of ownership, which we discuss below.

Infrastructure Costs

According to SAS’ sizing guidelines for VA, a single 16-CPU server with 256G RAM can support a 20GB table with seven heavy users.  (That’s 20 gigabytes of uncompressed data.)

For a rough estimate of the amount of hardware required:

  1. Determine the size of the largest table you plan to load
  2. Determine the total amount of data you plan to load
  3. Determine the planned number of “heavy” and “light users.  SAS defines a heavy user as “any SAS Visual Analytics Explorer user or a user who runs correlational analysis with multiple variables, box plots with four or more measures, or crosstabs with four or more class variables.”  In practice, this means every user.

In Step #4, you write a large check to your preferred hardware vendor, unless you are working with tiny data.

SAS will tell you that VA runs on commodity servers.  That is technically true, but a little misleading.  SAS does not require you to buy your servers from any specific vendor; however, the specs needed for good performance are quite different from a typical Hadoop node server.  Not surprisingly, VA requires specially configured high-memory machines, such as these from HP.

HP4VA

Node servers are just the beginning of the story. According to an HP engineer with extensive VA experience, networking is a key bottleneck in implementations.  Before you sign a license agreement for VA, check with your preferred hardware vendor to determine how much experience they have with the product.  Ask them to provide a firm quote for all of the necessary hardware, and a firm schedule for delivery and installation.

Keep in mind that SAS does not actually recommend hardware for any of its software.  While SAS will work with you to estimate volume and workload, it passes this information to the hardware vendors you specify for the actual recommended sizing and configuration.  Your hardware vendor plays a key role in the success of your implementation of this product, so it’s important that you choose a vendor that has significant experience with this software.

Implementation

SAS publishes most of its documentation on its support website.  For VA, however, SAS keeps technical documentation for installation, configuration and administration under lock and key.  The implication is that it’s not pretty.  Before you sign a license agreement, you should insist that SAS provide the documentation for your team to review.

There is more to implementing this product than software installation.  Did you notice the fine print in SAS’ Hardware Sizing Guidelines?  I quote:

“These guidelines do not address the data management resources needed outside of SAS Visual Analytics.  Getting data into SAS Visual Analytics and performing other ETL functions are solely the responsibility of the user.”  

VA’s native capabilities for data cleansing and transformation have improved since the first release, but they are still rudimentary.  So unless your source data is perfectly clean and ready to use — ha ha — you’re going to need ETL processes to prepare your data.  Unless your prospective users are ETL experts, they will need someone to build those feeds; and unless you have SAS developers sitting on the bench, you’re going to need SAS or a SAS Partner to provide developers who can do the job.

If you are thinking about licensing VA, you are almost certainly using legacy SAS products already.  You may think that will make implementation easier, but think again: VA and LASR Server are fundamentally new products with a new architecture.  Your SAS users and developers will all need training.  Moreover, your existing SAS programs may need conversion to work with the new software.

Before you sign a license agreement for VA, insist on a firm, fixed price quote from SAS for all implementation tasks, including data feeds.  Your SAS Account Executive will tell you that SAS “does not do” fixed price quotes.  Nonsense.  SAS will happily give away consulting services if they can win your software business, so don’t take “no” for an answer.

SAS will need to do an assessment, of course, before fixing the price, which is fine as long as you don’t have to pay for it.

Time to Value

When SAS first released VA, implementations ran around three months under ideal circumstances.  Many ran much longer, due to unanticipated issues with networking and infrastructure.  With more experience, SAS has a better understanding of the product’s infrastructure requirements, and can set expectations accordingly.

Nevertheless, there is no reason for you to assume the risk of delay getting the product into production.  SAS charges you for a license to use the software from the moment you sign the contract; if the implementation project runs long, it’s on your dime.

You should insist on a firm contractual commitment from SAS to get the software up and running by a date certain, with financial penalties for failure to deliver.  It’s unlikely that SAS will agree to deferred payment of the first-year fee, or an acceptance deal, since this impacts revenue recognition.  But you should be able to negotiate an extended renewal anniversary based on the date of delivery and acceptance.  You can also negotiate deferred payment of the fixed price consulting fee.

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.