How to Buy SAS Visual Analytics

Stories about SAS Visual Analytics are among the most widely read posts on this blog.  In the last two years I’ve received many queries from readers who complain that it’s hard to get clear answers about the software from SAS.

In software procurement, the customer has bargaining power until the deal closes; after that, power shifts to the vendor.   In this post, I’ve compiled some key questions prospective customers should resolve before signing a license agreement with SAS.

SAS Visual Analytics (VA), first launched in 2012, is now in its seventh dot release.  With a total of ~3,400 sites licensed, the most serious early release issues are resolved.  The product itself has improved.  In early releases, for example, it was impossible to join tables after loading them into VA; now you can.  SAS has gradually added features to the product, and will continue to do so.

Privately, SAS account executives describe VA as a “Tableau-Killer”; a more apt description is “Tableau for SAS Lovers.”   An experienced Tableau user will immediately notice features missing from VA.  On the other hand, SAS offers some statistical features (SAS Visual Statistics) not currently available in Tableau, for an extra license fee.

As this chart shows, Tableau is still alive:

SASVA vs Tableau

Source: Tableau Annual Report: SAS Revenue Press Release

SAS positions VA to its existing BI customers as a replacement product, and not a moment too soon; Gartner reports that organizations are rapidly pulling the plug on the legacy SAS BI product.  SAS prices VA to sell, clearly seeking to underprice Tableau and build a footprint.  Ordinarily, SAS pricing is a closely held secret, but SAS discloses its low VA pricing in the latest Gartner BI Magic Quadrant report.

Is VA the Right Solution?

VA works with SAS LASR Server, a proprietary in-memory analytic datastore, which should not be confused with in-memory databases like SAP HANA, Exasol or MemSQL.   In-memory databases have many features that are missing from LASR Server, such as ACID compliance, ANSI SQL engines and automated archiving.  Most in-memory databases can update data in real time; for LASR Server, you update a table by reloading it.  Commercial in-memory databases support many different end-user products for visualization and BI, so you aren’t locked in with a single vendor.  LASR Server supports SAS software only.

Like any other in-memory datastore, LASR Server is best for small high-value databases that will be queried by many users who require low latency.  LASR Server reads an entire table into memory and persists it there, so the amount of available memory is a limiting factor.

Since LASR Server is a distributed engine you can add more servers if you need more memory.  But keep in mind that while the cost of memory is declining, it is not free; it is still quite expensive per byte compared to disk storage.  In practice, most working in-memory databases support less than a terabyte of data.  By contrast, the smallest data warehouse appliances sold by vendors like IBM support thirty terabytes.

LASR Server’s principal selling point is speed.  The product is fast because it persists data in memory, and separates the disk I/O bottleneck from the user experience.  (You still need to load data into LASR Server, but you can do this separately, when the user isn’t waiting for a response.)

In contrast, Tableau uses a patented (e.g. proprietary) data engine that interfaces with your data source.  For extracts not already cached on the server, Tableau submits a query whose runtime depends on the data source; if the supporting database is poorly tuned, the query may take a long time to run.  In most cases, VA will be faster than Tableau, but it’s debatable how critical this is for a decision support application.

VA and LASR Server are the right solution for your business problem if all of the following conditions are true:

  • You work with less than a terabyte of data
  • You are willing to limit your visualization and BI tools to SAS software
  • You expect more than a handful of concurrent users
  • Your users require subsecond query response times

If you are thinking of using VA and LASR Server in distributed mode (implemented across more than one server), keep in mind that distributed computing is an order of magnitude more difficult to deliver.  Since SAS pitches a low-cost “Single Box Solution” as an entry-level product, most of those 3,400 customer sites run on a single server.  Before you commit to licensing the product in a multi-server configuration, you should insist on additional proof of product viability from SAS.  For example, insist on references from customers running in production in configurations at least as large as what you have in mind; and consider a full proof-of-concept (funded by SAS).

SAS’ low software pricing for VA makes it seem attractive.  However, you need to focus on the total cost of ownership, which we discuss below.

Infrastructure Costs

According to SAS’ sizing guidelines for VA, a single 16-CPU server with 256G RAM can support a 20GB table with seven heavy users.  (That’s 20 gigabytes of uncompressed data.)

For a rough estimate of the amount of hardware required:

  1. Determine the size of the largest table you plan to load
  2. Determine the total amount of data you plan to load
  3. Determine the planned number of “heavy” and “light users.  SAS defines a heavy user as “any SAS Visual Analytics Explorer user or a user who runs correlational analysis with multiple variables, box plots with four or more measures, or crosstabs with four or more class variables.”  In practice, this means every user.

In Step #4, you write a large check to your preferred hardware vendor, unless you are working with tiny data.

SAS will tell you that VA runs on commodity servers.  That is technically true, but a little misleading.  SAS does not require you to buy your servers from any specific vendor; however, the specs needed for good performance are quite different from a typical Hadoop node server.  Not surprisingly, VA requires specially configured high-memory machines, such as these from HP.

HP4VA

Node servers are just the beginning of the story. According to an HP engineer with extensive VA experience, networking is a key bottleneck in implementations.  Before you sign a license agreement for VA, check with your preferred hardware vendor to determine how much experience they have with the product.  Ask them to provide a firm quote for all of the necessary hardware, and a firm schedule for delivery and installation.

Keep in mind that SAS does not actually recommend hardware for any of its software.  While SAS will work with you to estimate volume and workload, it passes this information to the hardware vendors you specify for the actual recommended sizing and configuration.  Your hardware vendor plays a key role in the success of your implementation of this product, so it’s important that you choose a vendor that has significant experience with this software.

Implementation

SAS publishes most of its documentation on its support website.  For VA, however, SAS keeps technical documentation for installation, configuration and administration under lock and key.  The implication is that it’s not pretty.  Before you sign a license agreement, you should insist that SAS provide the documentation for your team to review.

There is more to implementing this product than software installation.  Did you notice the fine print in SAS’ Hardware Sizing Guidelines?  I quote:

“These guidelines do not address the data management resources needed outside of SAS Visual Analytics.  Getting data into SAS Visual Analytics and performing other ETL functions are solely the responsibility of the user.”  

VA’s native capabilities for data cleansing and transformation have improved since the first release, but they are still rudimentary.  So unless your source data is perfectly clean and ready to use — ha ha — you’re going to need ETL processes to prepare your data.  Unless your prospective users are ETL experts, they will need someone to build those feeds; and unless you have SAS developers sitting on the bench, you’re going to need SAS or a SAS Partner to provide developers who can do the job.

If you are thinking about licensing VA, you are almost certainly using legacy SAS products already.  You may think that will make implementation easier, but think again: VA and LASR Server are fundamentally new products with a new architecture.  Your SAS users and developers will all need training.  Moreover, your existing SAS programs may need conversion to work with the new software.

Before you sign a license agreement for VA, insist on a firm, fixed price quote from SAS for all implementation tasks, including data feeds.  Your SAS Account Executive will tell you that SAS “does not do” fixed price quotes.  Nonsense.  SAS will happily give away consulting services if they can win your software business, so don’t take “no” for an answer.

SAS will need to do an assessment, of course, before fixing the price, which is fine as long as you don’t have to pay for it.

Time to Value

When SAS first released VA, implementations ran around three months under ideal circumstances.  Many ran much longer, due to unanticipated issues with networking and infrastructure.  With more experience, SAS has a better understanding of the product’s infrastructure requirements, and can set expectations accordingly.

Nevertheless, there is no reason for you to assume the risk of delay getting the product into production.  SAS charges you for a license to use the software from the moment you sign the contract; if the implementation project runs long, it’s on your dime.

You should insist on a firm contractual commitment from SAS to get the software up and running by a date certain, with financial penalties for failure to deliver.  It’s unlikely that SAS will agree to deferred payment of the first-year fee, or an acceptance deal, since this impacts revenue recognition.  But you should be able to negotiate an extended renewal anniversary based on the date of delivery and acceptance.  You can also negotiate deferred payment of the fixed price consulting fee.

Big Analytics Roundup (March 23, 2015)

This week, Spark Summit East produced a deluge of news and analysis on Apache Spark and Databricks.  Also in the news: a couple of ventures landed funding, SAP released software and SAS soft-launched something new for SAS Visual Analytics.

Analytic Startups

Venture Capital Dispatch on WSJ.D reports that Andreeson Horowitz has invested $7.5 million in AMPLab spinout Tachyon Nexus.  Tachyon Nexus supports the eponymous Tachyon project, a memory-centric storage layer that runs underneath Apache Spark or independently.

Social media mining venture Dataminr pulls $130 million in “D” round financing, demonstrating that the real money in analytics is in applications, not algorithms.

Apache Flink

On the Flink project blog, Fabian Hueske posts an excellent article that describes how joins work in Flink.

Apache Spark

ADTMag rehashes the tired debate about whether Spark and Hadoop are “friends” or “foes”.  Sounds like teens whispering in the hallways of Silicon Valley High.  Spark works with HDFS, and it works with other datastores; it all depends on your use case.  If that means a little less buzz for Hadoop purists, get over it.

To that point, Matt Kalan explains how to use Spark with MongoDB on the Databricks blog.

A paper published by a team at Berkeley summarizes results from Spark benchmark testing, draws surprising conclusions.

In other commentary about Spark:

  • TechCrunch reports on the growth of Spark.
  • TechRepublic wonders if anything can dim Spark.
  • InfoWorld lists five reasons to use Spark for Big Data.

In VentureBeat, Sharmila Mulligan relates how ClearStory Data’s big bet on Spark paid off without explaining the nature of the payoff.  ClearStory has a nice product, but it seems a bit too early for a victory lap.

On the Spark blog, Justin Kestelyn describes exactly-once Spark Streaming with Apache Kafka, a new feature in Spark 1.3.

Databricks

Doug Henschen chides Ion Stoica for plugging Databricks Cloud at Spark Summit East, hinting darkly that some Big Data vendors are threatened by Spark and trying to plant FUD about it.  Vendors planting FUD about competitors that threaten them: who knew that people did such things?  It’s not clear what revenue model Henschen thinks Databricks should pursue; as Hortonworks’ numbers show, “contributing to open source” alone is not a viable business model.  If those Big Data vendors are unhappy that Databricks Cloud competes with what they offer, there is nothing to stop them from embracing Spark and standing up their own cloud service.

In other news:

  • On the Databricks blog, the folks from Uncharted Software describe PanTera, cool visualization software that runs in Databricks Cloud.
  • Rob Marvin of SD Times rounds up new product announcements from Spark Summit East.
  • In PCWorld, Joab Jackson touts the benefits of Databricks Cloud.
  • ConsumerElectronicsNet recaps Databricks’ announcement of the Jobs feature for Databricks Cloud, plus other news from Spark Summit East.
  • On ZDNet, Toby Wolpe reviews the new Jobs feature for production workloads in Databricks Cloud.
  • On the Databricks blog, Abi Mehta announces that Tresata’s TEAK application for AML will be implemented on Databricks Cloud.  Media coverage here, here and here.

Geospatial

MemSQL announced geospatial capabilities for its distributed in-memory NewSQL database.

J. Andrew Rogers asks why geospatial databases are hard to build, then answers his own question.

RapidMiner

Butler Analytics publishes a favorable review of RapidMiner.

SAP

SAP released a new on-premises version of Lumira Edge for visualization, adding to the list of software that is not as good as Tableau.  SAP also released Predictive Analytics 2.0, a product that marries the toylike SAP Predictive Analytics with KXEN InfiniteInsight, a product acquired in 2013.  According to SAP, Predictive Analytics 2.0 is a “single, unified analytics product” with two work environments, which sounds like SAP has bundled two different code bases into a marketing bundle with a common datastore.  Going for a “three-fer”, SAP also adds Lumira Edge to the bundle.

SAS

American Banker reports that SAS has “launched” SAS Transaction Monitoring Optimization for AML scenario testing; in this case, “launch”, means marketing collateral is available.  The product is said to run on top of SAS Visual Analytics, which itself runs on top of SAS LASR Server, SAS’ “other” distributed in-memory platform.

SAS Misses 2014 Growth Forecast

At the beginning of 2014, SAS EVP and CMO Jim Davis predicted double-digit revenue growth for 2014; in October, CEO Jim Goodnight walked that back to 5%, citing a challenging business climate in Europe.  Today, SAS announced 2014 revenue of $3.09 Billion, up 2.3%.

Meanwhile, IBM reported growth in analytics revenue of 7% in Q4.

The challenge for SAS is that the US market is saturated: virtually every enterprise that ever will use SAS already does so, and there are limits to the number of new products one can add to the stack.  Much of SAS’ growth comes from overseas, and a strong dollar impairs SAS’ ability to sell in foreign markets.

On the positive side, SAS reports a total of 3,400 sites for SAS Visual Analytics, its “Tableau-killer”, compared to 1,400 sites announced last year, for a net growth of 2,000 sites.  (In SAS’ parlance, a “site” is roughly equivalent to a server.)  Tableau has not yet released its 2014 results, but in Q3 Tableau reports that it added 2,500 customer accounts.

SAS also reports 24% revenue growth for its cloud services.   IT analyst Synergy Research Group reports that the cloud market is growing at a 49% annualized rate, although AWS, Microsoft, IBM and Google are all growing much faster than that.

In other news, the WSJ reports that Big Data analytics startup Palantir is now valued at $15 billion, which is about the same as what it would cost an acquirer to buy SAS at 5X revenue.

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.

SAS: 5% Revenue Growth in 2013

Today, SAS announced 2013 revenue of $3.02 billion, up 5.2% from 2012.  Reported revenue from “cloud-based” solutions grew by 20%; most of this revenue comes from SAS Solutions On Demand, a private hosting service.

SAS claims more than 1,400 “sites” for SAS Visual Analytics, an impressive figure but well short of SAS’ goal of 2,000 licenses in 2013.  (In SAS lingo, a “site” is a machine — customers have many sites).   Internally, SAS executives refer to Visual Analytics as its “Tableau killer”; Tableau hasn’t reported 2013 results yet, but Q3 revenue was up 90%.  SAS competitor IBM reports 2013 Business Analytics revenue up 9%; Smarter Planet revenue up 20%; Cloud revenue up 69%.

The SAS press release does not cite sales for SAS High Performance Analytics Server, the “other” new in-memory product.

SAS SVP Jim Davis attributed the results in part to a decline in sales to the U.S. Federal government.  Forrester reports that Federal tech spending grew 4% last year.

Strata Report: Advanced Analytics in Hadoop

Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:

0XData

Product(s)

  • H20 (open source project)
  • h2o (R package)

Description

Smart people from Stanford with VC backing and a social media program.   Services business model with open source software.  H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters;  aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests.   Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA).  They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence.  R users can interface through h2o package.  Limited customer base.  Partners with Cloudera and MapR.

Key Points

  • True open source model
  • Comprehensive roadmap
  • Limited functionality
  • Limited user base
  • Performance claims undocumented

Alpine Data Labs

Product(s)

  • Alpine 2.8

Description

Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler).   Supports a reasonably broad range of analytic features.  Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works.  (Appears to be SQL/HiveQL push-down).   In practice, most customers seem to use Alpine with Greenplum.  Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability.  Partners with Pivotal, Cloudera and MapR.

Key Points

  • Reasonable option for users already committed to Greenplum Database
  • Limited partner and user ecosystem
  • Performance and stability should be vetted thoroughly in POC

Oracle

Product(s)

Description

Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).

Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise.   Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents.  When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.

Oracle Connection to Hadoop  is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive.  As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.

Key Points

  • Good choice for Oracle-centric organizations
  • Oracle Data Mining is a mature product with an excellent user interface
  • Must move data from Hadoop to Oracle Database to leverage OAA
  • Hadoop push-down from R requires expertise in MapReduce

SAS

Products

  • SAS/ACCESS Interface to Hadoop
  • SAS Scoring Accelerator for Cloudera
  • SAS Visual Analytics/SAS LASR Server
  • SAS High Performance Analytics Server

Description

SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server.   With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees.   Update:  SAS has announced SAS/ACCESS for Impala.

Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics.  High Performance Analytics is a suite of distributed in-memory analytics.   LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware.  Partners with Cloudera and Hortonworks.

Key Points

  • Legacy SAS connects to Hadoop, does not run in Hadoop
  • SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
  • Visual Analytics cannot work with “raw” data in Hadoop
  • Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
  • High TCO, proprietary architecture for all SAS products

Skytree

Product(s)

  • Skytree Server

Description

Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013.  Server-based technology, can connect to a range of data sources, including Hadoop.  Programming interface; claims ability to run from R, Weka, C++ and Python.  Good library of algorithms.  Partners with Cloudera, Hortonworks, MapR.  Skytree is opaque about technology and performance claims.

Key Points

  • Limited customer base, no announced sales since company launch
  • Hadoop integration is a connection, not “inside” architecture
  • Performance claims should be carefully vetted

SAP and SAS Couple Up

SAS and SAP announced a “strategic partnership” today at the SAP TechEd show.

According to SAS’ press release,

SAP and SAS will partner closely to create a joint technology and product roadmap designed to leverage the SAP HANA® platform and SAS analytics capabilities. By incorporating the in-memory SAP HANA platform into SAS applications and enabling SAS’ industry-proven advanced analytics algorithms to run on SAP HANA, decision makers will have the opportunity to leverage the value of real-time data analysis within their existing SAS and SAP HANA environments.

SAS and SAP plan to execute a co-sell pilot program to engage select joint customers to validate SAS applications running on SAP HANA. The goal of this program is to build and prioritize the two firms’ joint technology throughout 2014, in particular for industries such as financial services, telecommunications, retail, consumer products and manufacturing. The applications are expected to target business areas that require a combination of advanced analytics running on an in-memory platform that will be designed to yield high value results. Such opportunities exist in customer intelligence, risk management, asset management and anti-money laundering, among others.

How soon we forget; just six months ago, SAS leadership trashed SAP HANA from the stage at SAS Global Forum.

SAS and SAP share a commitment to in-memory computing, but they have a fundamentally different approach to the technology.  SAP HANA is a standards-based persistent in-memory database, with a strong vendor ecosystem.  SAS on the other hand, builds its in-memory analytics on a proprietary architecture,  and has a vendor ecosystem of one.  HANA succeeds because it is an easy decision for SAP-centric companies to adopt the product for small high-concurrency databases with one data source.   Meanwhile, even the most loyal SAS customers choke at the TCO of SAS High Performance Analytics.

In-memory databases make economic sense when (a) you don’t have much data, and (b) usage is read-only, (c) users want small random packets of data, and (d) there are lots of users.   The NBA’s statistics website (powered by SAP HANA) is a perfect example: less than a terabyte of data, but up to 20,000 concurrent users seeking information about how many free throws Hal Greer hit in 1968 against the Celtics.   That’s a great application for BI tools, but not for high-end predictive analytics.  SAP’s HANA Predictive Analytics Library may be toylike, but it’s likely good enough for that use case.

SAS Visual Analytics makes more sense coupled to an in-memory database like HANA than to its existing LASR Server architecture.   It doesn’t do anything that can’t be done in Business Objects, but there are likely a few customers in the market who are both SAS-centric and have an all-SAP back end.

SAS and Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities.    Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.

Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area.  In March 2012, SAS announced support for Hadoop connectivity;  since then, SAS has gradually expanded the features it supports with Hadoop.

As of today, there are four primary ways that a SAS user can leverage Hadoop:

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface.  SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing.  It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user.  For more detailed information, read the manual.

SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop.  If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop.  If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps.  (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce).  SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document);  if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.

SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through.  Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”.   The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.

SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.

SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera.  Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product.   Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server.   These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture.  That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.

That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative.  In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time.  SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine.  Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode.  This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format.   To import the data into SASHDAT, you will need to license SAS Data Integration Server.

A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes.   SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations.  Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K.  Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.

While SAS has struggled to implement its in-memory software in Hadoop to date,  YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop.  Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.

SAS Visual Analytics: FAQ (Updated 1/2014)

SAS charged its sales force with selling 2,000 licenses for Visual Analytics in 2013; the jury is still out on whether they met this target.  There’s lots of marketing action lately from SAS about this product, so here’s an FAQ.

Update:  SAS recently announced 1,400 sites licensed for Visual Analytics.  In SAS lingo, a site corresponds roughly to one machine, but one license can include multiple sites; so the actual number of licenses sold in 2013 is less than 1,400.  In April 2013 SAS executives claimed two hundred customers for the product.   In contrast, Tableau reports that it added seven thousand customers in 2013 bringing its total customer count to 17,000.

What is SAS Visual Analytics?

Visual Analytics is an in-memory visualization and reporting tool.

What does Visual Analytics do?

SAS Visual Analytics creates reports and graphs that are visually compelling.  You can view them on mobile devices.

VA is now in its fifth dot release.  Why do they call it Release 6.3?

SAS Worldwide Marketing thinks that if they call it Release 6.3, you will think it’s a mature product.  It’s one of the games software companies play.

Is Visual Analytics an in-memory database, like SAP HANA?

No.  HANA is a standards-based in-memory database that runs on many different brands of hardware and supports a range of end-user tools.  VA is a proprietary architecture available on a limited choice of hardware platforms.  It cannot support anything other than the end-user applications SAS chooses to develop.

What does VA compete with?

SAS claims that Visual Analytics competes with Tableau, Qlikview and Spotfire.  Internally, SAS leadership refers to the product as its “Tableau-killer” but as the reader can see from the update at the top of this page, Tableau is alive and well.

How well does it compare?

You will have to decide for yourself whether VA reports are prettier than those produced by Tableau, Qlikview or Spotfire.  On paper, Tableau has more functionality.

VA runs in memory.  Does that make it better than conventional BI?

All analytic applications perform computations in memory.  Tableau runs in memory, and so does Base SAS.   There’s nothing unique about that.

What makes VA different from conventional BI applications is that it loads the entire fact table into memory.  By contrast, BI applications like Tableau query a back-end database to retrieve the necessary data, then perform computations on the result set.

Performance of a conventional BI application depends on how fast the back-end database can retrieve the data.  With a high-performance database the performance is excellent, but in most cases it won’t be as fast as it would if the data were held in memory.

So VA is faster?  Is there a downside?

There are two.

First, since conventional BI systems don’t need to load the entire fact table into memory, they can support usage with much larger datastores.  The largest H-P Proliant box for VA maxes out at about 10 terabytes; the smallest Netezza appliance supports 30 terabytes, and scales to petabytes.

The other downside is cost; memory is still much more expensive than other forms of storage, and the machines that host VA are far more expensive than data warehouse appliances that can host far more data.

VA is for Big Data, right?

SAS and H-P appear to be having trouble selling VA in larger sizes, and are positioning a small version that can handle 75-100 Gigabytes of data.  That’s tiny.

The public references SAS has announced for this product don’t seem particularly large.  See below.

How does data get into VA?

VA can load data from a relational database or from a proprietary SASHDAT file.  SAS cautions that loading data from a relational database is only a realistic option when VA is co-located in a Teradata Model 720 or Greenplum DCA appliance.

To use SASHDAT files, you must first create them using SAS.

Does VA work with unstructured data?

VA works with structured data, so unstructured data must be structured first, then loaded either to a co-located relational database or to SAS’ proprietary SASHDAT format.

Unlike products like Datameer or IBM Big Sheets, VA does not support “schema on read”, and it lacks built-in tools for parsing unstructured text.

But wait, SAS says VA works with Hadoop.  What’s up with that?

A bit of Marketing slight-of-hand.  VA can load SASHDAT files that are stored in the Hadoop File System (HDFS); but first, you have to process the data in SAS, then load it back into HDFS.  In other words, you can’t visualize and write reports from the data that streams in from machine-generated sources — the kind of live BI that makes Hadoop really cool.  You have to batch the data, parse it, structure it, then load it with SAS to VA’s staging area.

Can VA work with streaming data?

SAS sells tools that can capture streaming data and load it to a VA data source, but VA works with structured data at rest only.

With VA, can my users track events in real time?

Don’t bet on it.   To be usable VA requires significant pre-processing before it is loaded into VA’s memory.  Moreover, once it is loaded it can’t be updated; updating the data in VA requires a full truncate and reload.   Thus, however fast VA is in responding to user requests, your users won’t be tracking clicks on their iPads in real time; they will be looking at yesterday’s data.

Does VA do predictive analytics?

Visual Analytics 6.1 can perform correlation, fit bivariate trend lines to plots and do simple forecasting.  That’s no better than Tableau.  Surprisingly, given the hype, Tableau actually supports more analysis functions.

While SAS claims that VA is better than SAP HANA because “HANA is just a database”, the reality is that SAP supports more analytics through its Predictive Analytics Library than SAS supports in VA.

Has anyone purchased VA?

A SAS executive claimed 200 customers in early 2013, a figure that should be taken with a grain of salt.  If there are that many customers for this product, they are hiding.

There are five public references, all of them outside the US:

SAS has also recently announced selection (but not implementation) by

OfficeMax has also purchased the product, according to this SAS blog.

As of January 2014, the four customers who announced selection or purchase are not cited as reference customers.

What about implementation?  This is an appliance, right?

Wrong.  SAS’ considers an implementation that takes a month to be wildly successful.  Implementation tasks include the same tasks you would see in any other BI project, such as data requirements, data modeling, ETL construction and so forth.  All of the back end feeds must be built to put data into a format that VA can load.

Bottom line, does it make sense to buy SAS Visual Analytics?

Again, you will have to decide for yourself whether the SAS VA reports look better than Tableau or the many other options in this space.  BI beauty shows are inherently subjective.

You should also demand that SAS prove its claims to performance in a competitive POC.  Despite the theoretical advantage of an in-memory architecture, actual performance is influenced by many factors.  Visitors to the recent Gartner BI Summit who witnessed a demo were unimpressed; one described it to me as “dog slow”.  She didn’t mean that as a compliment.

The high cost of in-memory platforms mean that VA and its supporting hardware will be much more expensive for any given quantity of data than Tableau or equivalent products. Moreover, its proprietary architecture means you will be stuck with a BI silo in your organization unless you are willing to make SAS your exclusive BI provider.  That makes this product very good for SAS; the question is whether it is good for you.

The early adopters for this product appear to be very SAS-centric organizations (with significant prior SAS investment).  They also appear to be fairly small.  If you have very little data, money to burn and are willing to experiment with a relatively new product, VA may be for you.

SAS and H-P Close the Curtains

Michael Kinsley wrote:

It used to be, there was truth and there was falsehood. Now there is spin and there are gaffes. Spin is often thought to be synonymous with falsehood or lying, but more accurately it is indifference to the truth. A politician engaged in spin is saying what he or she wishes were true, and sometimes, by coincidence, it is. Meanwhile, a gaffe, it has been said, is when a politician tells the truth — or more precisely, when he or she accidentally reveals something truthful about what is going on in his or her head. A gaffe is what happens when the spin breaks down.

Hence, a Kinsley gaffe means “accidentally telling the truth”.

Back in April, an H-P engineer committed a Kinsley gaffe by publishing a white paper that describes in some detail issues encountered by SAS and H-P on implementations of SAS Visual Analytics.  I blogged about this at the time here.

Some choice bits:

— “Needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.”

— “(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.”

— “Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.”

And my personal favorite:

— “The potential exists, with even as few as 4 servers, for a Data Storm to occur.”

If you’re wondering what a Data Storm is, let’s just say that its not a good thing.

Since I published the blog post, SAS has withdrawn the paper from its website.   This is not too surprising, since every other paper on “SAS and Big Data” is also hidden from view.   Fortunately, I downloaded a copy of the paper for my records.   H-P can claim copyright, so I can’t upload the whole thing, but I’ve attached a few screen shots below so you can see that this paper is real.

You might wonder why SAS feels compelled to keep its “Big Data” stories under wraps.  Keep in mind that we’re not talking about software design or any other intellectual property that warrants protection; in this case, the vendors don’t want you to know the truth about implementation because it conflicts with the hype.  As the paper’s author puts it, “this sounds very scary and expensive.”  “Very scary” and “expensive” don’t mix with “buy this product now.”

If you’re evaluating SAS Visual Analytics ask your SAS rep for a copy of Paper 466-2013.  And ask if they’ve done anything about those Data Storms.

Hp1

Hp2