Spark 1.4 Released

On June 11, the Spark team announced availability of Release 1.4.  More than 210 contributors from 70 different organizations contributed more than 1,000 patches.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-06-12 at 2.00.20 PM

Spark Core

The Spark team continues to improve Spark operability, performance and compatibility.  Key enhancements include:

  • The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
  • Also for improved performance, serialized shuffle output
  • For the Spark UI, visualization for Spark DAGs and operational monitoring
  • A REST API for application information, such as job, stage, task and storage status
  • For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
  • Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
  • Two Mesos enhancements: Docker support and cluster mode.

DataFrames and SQL

This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.

A complete list of enhancements to the DataFrame API is here.

R Interface

AMPLab released a developer version of SparkR in January 2014.  In June 2014, Alteryx and Databricks announced a partnership to lead development of this component.  In March, 2015, SparkR officially merged into Spark.

SparkR offers an interface to use Apache Spark from R.  In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets.  It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.

Machine Learning

In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API.  Additional enhancements to ML include:

There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.

Enhancements to MLLib include:

There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank.  Spark’s graph analytics capabilities are comparatively static.

Streaming

The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs.  Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.

How to Buy SAS Visual Analytics

Stories about SAS Visual Analytics are among the most widely read posts on this blog.  In the last two years I’ve received many queries from readers who complain that it’s hard to get clear answers about the software from SAS.

In software procurement, the customer has bargaining power until the deal closes; after that, power shifts to the vendor.   In this post, I’ve compiled some key questions prospective customers should resolve before signing a license agreement with SAS.

SAS Visual Analytics (VA), first launched in 2012, is now in its seventh dot release.  With a total of ~3,400 sites licensed, the most serious early release issues are resolved.  The product itself has improved.  In early releases, for example, it was impossible to join tables after loading them into VA; now you can.  SAS has gradually added features to the product, and will continue to do so.

Privately, SAS account executives describe VA as a “Tableau-Killer”; a more apt description is “Tableau for SAS Lovers.”   An experienced Tableau user will immediately notice features missing from VA.  On the other hand, SAS offers some statistical features (SAS Visual Statistics) not currently available in Tableau, for an extra license fee.

As this chart shows, Tableau is still alive:

SASVA vs Tableau

Source: Tableau Annual Report: SAS Revenue Press Release

SAS positions VA to its existing BI customers as a replacement product, and not a moment too soon; Gartner reports that organizations are rapidly pulling the plug on the legacy SAS BI product.  SAS prices VA to sell, clearly seeking to underprice Tableau and build a footprint.  Ordinarily, SAS pricing is a closely held secret, but SAS discloses its low VA pricing in the latest Gartner BI Magic Quadrant report.

Is VA the Right Solution?

VA works with SAS LASR Server, a proprietary in-memory analytic datastore, which should not be confused with in-memory databases like SAP HANA, Exasol or MemSQL.   In-memory databases have many features that are missing from LASR Server, such as ACID compliance, ANSI SQL engines and automated archiving.  Most in-memory databases can update data in real time; for LASR Server, you update a table by reloading it.  Commercial in-memory databases support many different end-user products for visualization and BI, so you aren’t locked in with a single vendor.  LASR Server supports SAS software only.

Like any other in-memory datastore, LASR Server is best for small high-value databases that will be queried by many users who require low latency.  LASR Server reads an entire table into memory and persists it there, so the amount of available memory is a limiting factor.

Since LASR Server is a distributed engine you can add more servers if you need more memory.  But keep in mind that while the cost of memory is declining, it is not free; it is still quite expensive per byte compared to disk storage.  In practice, most working in-memory databases support less than a terabyte of data.  By contrast, the smallest data warehouse appliances sold by vendors like IBM support thirty terabytes.

LASR Server’s principal selling point is speed.  The product is fast because it persists data in memory, and separates the disk I/O bottleneck from the user experience.  (You still need to load data into LASR Server, but you can do this separately, when the user isn’t waiting for a response.)

In contrast, Tableau uses a patented (e.g. proprietary) data engine that interfaces with your data source.  For extracts not already cached on the server, Tableau submits a query whose runtime depends on the data source; if the supporting database is poorly tuned, the query may take a long time to run.  In most cases, VA will be faster than Tableau, but it’s debatable how critical this is for a decision support application.

VA and LASR Server are the right solution for your business problem if all of the following conditions are true:

  • You work with less than a terabyte of data
  • You are willing to limit your visualization and BI tools to SAS software
  • You expect more than a handful of concurrent users
  • Your users require subsecond query response times

If you are thinking of using VA and LASR Server in distributed mode (implemented across more than one server), keep in mind that distributed computing is an order of magnitude more difficult to deliver.  Since SAS pitches a low-cost “Single Box Solution” as an entry-level product, most of those 3,400 customer sites run on a single server.  Before you commit to licensing the product in a multi-server configuration, you should insist on additional proof of product viability from SAS.  For example, insist on references from customers running in production in configurations at least as large as what you have in mind; and consider a full proof-of-concept (funded by SAS).

SAS’ low software pricing for VA makes it seem attractive.  However, you need to focus on the total cost of ownership, which we discuss below.

Infrastructure Costs

According to SAS’ sizing guidelines for VA, a single 16-CPU server with 256G RAM can support a 20GB table with seven heavy users.  (That’s 20 gigabytes of uncompressed data.)

For a rough estimate of the amount of hardware required:

  1. Determine the size of the largest table you plan to load
  2. Determine the total amount of data you plan to load
  3. Determine the planned number of “heavy” and “light users.  SAS defines a heavy user as “any SAS Visual Analytics Explorer user or a user who runs correlational analysis with multiple variables, box plots with four or more measures, or crosstabs with four or more class variables.”  In practice, this means every user.

In Step #4, you write a large check to your preferred hardware vendor, unless you are working with tiny data.

SAS will tell you that VA runs on commodity servers.  That is technically true, but a little misleading.  SAS does not require you to buy your servers from any specific vendor; however, the specs needed for good performance are quite different from a typical Hadoop node server.  Not surprisingly, VA requires specially configured high-memory machines, such as these from HP.

HP4VA

Node servers are just the beginning of the story. According to an HP engineer with extensive VA experience, networking is a key bottleneck in implementations.  Before you sign a license agreement for VA, check with your preferred hardware vendor to determine how much experience they have with the product.  Ask them to provide a firm quote for all of the necessary hardware, and a firm schedule for delivery and installation.

Keep in mind that SAS does not actually recommend hardware for any of its software.  While SAS will work with you to estimate volume and workload, it passes this information to the hardware vendors you specify for the actual recommended sizing and configuration.  Your hardware vendor plays a key role in the success of your implementation of this product, so it’s important that you choose a vendor that has significant experience with this software.

Implementation

SAS publishes most of its documentation on its support website.  For VA, however, SAS keeps technical documentation for installation, configuration and administration under lock and key.  The implication is that it’s not pretty.  Before you sign a license agreement, you should insist that SAS provide the documentation for your team to review.

There is more to implementing this product than software installation.  Did you notice the fine print in SAS’ Hardware Sizing Guidelines?  I quote:

“These guidelines do not address the data management resources needed outside of SAS Visual Analytics.  Getting data into SAS Visual Analytics and performing other ETL functions are solely the responsibility of the user.”  

VA’s native capabilities for data cleansing and transformation have improved since the first release, but they are still rudimentary.  So unless your source data is perfectly clean and ready to use — ha ha — you’re going to need ETL processes to prepare your data.  Unless your prospective users are ETL experts, they will need someone to build those feeds; and unless you have SAS developers sitting on the bench, you’re going to need SAS or a SAS Partner to provide developers who can do the job.

If you are thinking about licensing VA, you are almost certainly using legacy SAS products already.  You may think that will make implementation easier, but think again: VA and LASR Server are fundamentally new products with a new architecture.  Your SAS users and developers will all need training.  Moreover, your existing SAS programs may need conversion to work with the new software.

Before you sign a license agreement for VA, insist on a firm, fixed price quote from SAS for all implementation tasks, including data feeds.  Your SAS Account Executive will tell you that SAS “does not do” fixed price quotes.  Nonsense.  SAS will happily give away consulting services if they can win your software business, so don’t take “no” for an answer.

SAS will need to do an assessment, of course, before fixing the price, which is fine as long as you don’t have to pay for it.

Time to Value

When SAS first released VA, implementations ran around three months under ideal circumstances.  Many ran much longer, due to unanticipated issues with networking and infrastructure.  With more experience, SAS has a better understanding of the product’s infrastructure requirements, and can set expectations accordingly.

Nevertheless, there is no reason for you to assume the risk of delay getting the product into production.  SAS charges you for a license to use the software from the moment you sign the contract; if the implementation project runs long, it’s on your dime.

You should insist on a firm contractual commitment from SAS to get the software up and running by a date certain, with financial penalties for failure to deliver.  It’s unlikely that SAS will agree to deferred payment of the first-year fee, or an acceptance deal, since this impacts revenue recognition.  But you should be able to negotiate an extended renewal anniversary based on the date of delivery and acceptance.  You can also negotiate deferred payment of the fixed price consulting fee.

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

SAS Versus R Part Two

In a previous post, I summarized some myths about SAS and R — arguments offered by proponents of one or the other that deserve to be dismissed.

In this post, I will review some arguments that do make sense — things to consider if you are an aspiring analyst or if you are an executive making decisions about software for your organization.

(1) Every analysis technique available in SAS is available in R — plus many more

It’s fair to say that any analysis you can do in SAS you can also do in R.  The reverse, however, is not true — there are many techniques available in R that are not available in SAS.

As an open source platform, R is open to innovation, and offers few barriers to entry for new techniques.  An analyst who develops a new technique can quickly publish it in R, even if the technique has only niche appeal; it’s a great example of the long tail effect in action.

Commercial software providers like SAS, on the other hand, use product management calculus to balance the benefits of introducing a new technique against the cost to develop and support it.  The marginal revenue from adding a feature is hard to measure, while the costs are known, so conservative companies like SAS tend to lag well behind the cutting edge.  SAS also tends to bundle popular new capabilities into new products rather than enhancing the existing product, forcing customers to add more SAS software licenses to the stack if they want the capability.

Random Forests is a case in point.  Breiman and Cutler published their seminal article describing the technique in October, 2001; the following year, they published the randomForest package in R.  In December, 2012, SAS released an “experimental” version of what it calls “HP Forests” in SAS High Performance Analytics, and in 2013 included the PROC in SAS Enterprise Miner 13.1.

Ten years is a long time to wait.

(2) SAS is easier to learn and use than R

R mavens dispute this point, but they are wrong.  R is significantly harder to learn and use than SAS, at several levels, and for a number of reasons.

Bob Muenchen recently published an excellent catalogue of Things That Make R Hard to Learn.  Bob should know; he makes a living helping users cross the chasm from SAS to R.  Here is a brief except, but you should definitely read the whole thing:

R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.

There are two main reasons SAS is easier to user than R.  First, as a commercial product every element of SAS is governed by a common design that unifies the SAS programming language, user interfaces and documentation.  As a result, SAS programming syntax and documentation is generally consistent across procedures; statements generally mean the same thing whether you are working in PROC ACCESS or PROC XML.

Developers who contribute R packages, on the other hand, operate independently and without a comparable design.  While each individual package may be well or poorly written, there is no governing principle that ensures packages are consistent with one another.  While R aficionados celebrate its diversity, to the outsider it just seems messy.

SAS’ strong development tools add significant value for the user.  SAS Enterprise Guide, for example, included with Analytics Pro at no extra charge, offers a workflow interface and the ability to generate SAS or SQL code behind the scenes.  There is no equivalent code-generating tool available for R today.

(3) SAS offers an “enterprise-grade” solution

Individual analysts surveyed by Rexer last year said that cost and ease of use are the most important factors they consider when choosing analytic tools.  For enterprises, however, the selection criteria are more complex.

Technical support is a key concern for most organizations; some go so far as to adopt blanket policies banning the use of unsupported software.  SAS invests heavily in its Support organization; unlike many large software vendors, Technical Support is a career track at SAS, with low employee turnover.  With locations located in multiple countries, SAS is able to support customers globally and at enterprise scale.

When SAS licenses its software, it warrants that the software is materially free of defects.  This warranty is backed up by a contractual commitment to fix defects that surface.  Hence, SAS offers the customer a “single throat to choke” — customers know when they license SAS that a single organization is responsible for development, distribution, implementation and support of the software, and accountability is clear.

Open source R, of course, has no organic technical support.  Organizations such as Revolution Analytics offer technical support either for open source R or Revolution’s own commercial R distribution.  Third party service providers like Revolution can be highly knowledgeable and effective; however, if there is a software defect in an R package, the support provider can only notify the developer and request resolution.

(4) SAS costs more than R

“Duh!” you say; “R is free!”  True enough.  R is open source software, distributed with a free license to use; for a single analyst, the incremental TCO to download, install and use R on an existing machine is zero.   This is also true for other key components of the R ecosystem, such as RStudio, the popular development environment.   Low cost of entry is a key driver behind R’s growing popularity.

SAS, on the other hand, charges a subscription fee which consists of a term license to use the software plus technical support and maintenance into a subscription fee.  Entry costs to license the most basic package (SAS Analytics Pro) costs $8,700 (first year fee) at the SAS online store; this package includes Base SAS, SAS/STAT and SAS/Graph.  SAS renewal fees generally run 25-30% of the first year fee.  SAS bundles its analytic features into a number of separate packages, such as SAS/ETS for time series, SAS/OR for optimization and SAS/IML for matrix manipulation; if you require these capabilities, you must pay extra.  SAS also offers access engines for an assortment of data sources, each of which can be licensed individually for $3,000 each.

The version of SAS sold through the online store is for single Windows machines only.  SAS sells its software for servers through its sales force, and pricing is negotiated; “list” price depends on the computing power of the server, measured by cores or sockets.  Server pricing for Analytics Pro starts in the low six figures.

SAS offers a virtualized “University Edition” which is free but not open source.  See here for a review.

Bottom line — for the analyst

Aspiring analysts ask “should I learn SAS or R?”   I’m tempted to answer “why not both?” but that begs the question of which to learn first.

If SAS is the primary tool at your organization or university, learn it and use it.  There are still more jobs available for SAS users than R users (though the gap is narrowing); and even prospective employers who do not currently use SAS treat it as a proxy for analytics know-how.

If your organization or university supports both SAS and R, look for trends in usage.  Is the R community growing rapidly?  Are the “best and brightest” people using SAS or R?  Is your management putting out subtle (or not-so-subtle) messages promoting use of one or the other?   Take the pulse of your organization and make your choice.

If your organization or university does not already license SAS, if you aspire to free-lance consulting or you are simply unemployed, learn R.  Doing so costs you nothing, and there are plenty of low-cost options for training and self-directed learning.

Bottom line — for the enterprise

If you are making decisions about software for an analytics team or an entire organization, the calculus is more complex.

R has more analytic techniques than SAS, but what techniques to you actually need?  Take note of your team’s actual current and future analytic needs, and act accordingly.  If you are using SAS today, the chances are very good that a handful of PROCs account for 95% of current usage; the same is true for R.

SAS is easier to learn than R, but if all or most of your analysts already know R, what difference does it make?  Many younger analysts entering the workforce already know how to use R, and it is a waste of time and money to force them to learn SAS.  On the other hand, if your analysts rely on SAS, you can expect to invest considerable time and money for retraining.

Do you need an enterprise solution?  If you organization spans multiple countries, if you support more than twenty users, the chances are that the answer is “yes”.  For a larger organization, it’s hard to beat SAS’ ability to mobilize support, training and consulting resources around the world.  This is likely to change in the future, as organizations like Revolution Analytics build scale and credibility.

SAS costs more than R, but R is not free.  If you are concerned about SAS costs, carefully evaluate your spending and take note of the value offered by SAS.  Keep in mind that software licensing costs are only one component of Total Cost of Ownership (TCO); third-party support for R is not free, and neither is training and conversion.  Do the math.

In general, SAS works well for organizations that are in the middle of Tom Davenport’s maturity cycle pictured above.   These organizations have the basic data infrastructure and business cases for analytics, combined with a need for rapid scale and consistency across locations.  As organizations mature, they become less dependent on a single vendor for analytics and more willing to develop a “best-in-breed” approach; they are more interested in innovation and “cutting-edge” techniques, and the analysts they hire have the will and skill to learn R.  These organizations are adopting R at an increasing rate.

Adoption of R is most pervasive among analytic service providers, such as consultants, system integrators and marketing service providers.  These organizations are sensitive to software costs and tend to hire most highly skilled analysts, for whom R’s learning curve is not a serious issue.  Costs aside, SAS restrictions on use — designed to prevent cannibalization– are highly problematic for service providers.

SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.

Analytic Startups: Skytree

Skytree started out as an academic machine learning project developed at Georgia Tech’s Fastlab.  Leadership shopped the software to a number of software vendors prior to 2011 and, finding no buyers, launched as a standalone venture in 2012.

In April 2013, Skytree announced Series A funding of $18 million, with backing from U.S. Venture Partners, UPS, Javelin Venture Partners and Osage University Partners.   The company has 18 U.S. employees in LinkedIn.

Skytree’s public reference customers include Adconian, Brookfield Residential Property Services, CANFAR, eHarmony, SETI Institute and United States Golf Association.  This customer list did not change in 2013 despite significant investment in marketing and sales.

Skytree has formally partnered with Cloudera, Hortonworks and MapR.

Compared to its peers, Skytree reveals very little about its technology, which is generally a yellow flag.

urlSkytree’s principal product is Skytree Server, a server-based library of distributed algorithms.   Skytree claims to support the following techniques:

  • Support Vector Machines (SVM)
  • Nearest Neighbor
  • K-Means
  • Principal Component Analysis (PCA)
  • Linear Regression
  • Two-Point Correlation
  • Kernal Density Estimation (KDE)
  • Gradient Boosted Trees
  • Random Forests

Skytree does not show images or videos of its user interface anywhere on its website.  The implication is that it lacks a visual interface, and programming is required.  Skytree claims a web services interface as well as interfaces to R, Weka, C++ and Python.

For data sources, Skytree claims the ability to connect to relational databases (presumably through ODBC); Hadoop (presumably HDFS); and to consume data from flat files and “common statistical packages”.

Skytree claims the ability to deploy on commodity Linux servers in local, cluster, cloud or Hadoop configurations.  (Absent YARN support, though, the latter will be a “beside” architecture, with data movement).

A second product, Skytree Advisor, launched in Beta in September.  Skytree Advisor is mostly interesting for what it reveals about Skytree Server.  The product includes some unique capabilities, including the ability to produce an actual report, but the user interface evokes a blue screen of death.   The status of this offering seems to be in doubt, as Skytree no longer promotes it.

Analytic Startups: 0xdata (Updated May 2014)

Updated May 22, 2014

0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O).  Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then.

0xdata operates on a services business model, and does not offer commercially licensed software.  The firm has four public reference customers and claims more than 2,000 users.  0xdata has formal partnerships with Cloudera, Hortonworks, Intel and MapR.

0xdata’s H20 project is a library of distributed algorithms designed for deployment in Hadoop or free-standing clusters.  0xdata licenses H2O under the Apache 2.0 open source license.  The development team is very active; in the thirty days ended May 22, 19 contributors pushed 783 commits to the project on Git.

The roadmap is aggressive; as of May 2014 the library includes:

For Generalized Linear Models, k-Means and Gradient Boosting, H2O supports a Grid Search feature enabling users to specify multiple models for simultaneous development and comparison.   This feature is a significant timesaver when the optimal model parameters are unknown (which is ordinarily the case).

Users interact directly with the software through a web browser or REST API.  Alternatively, R users can use the H2O.R package to invoke algorithms from RStudio or an alternative R development environment.  (Video demo here).  Scala users can work with H2O through the Scalala library.

For Hadoop deployment, H2O supports CDH4.x, MapR 2.x and AWS EC2.   H2O integrates with HDFS, and is co-located within Hadoop.   At present, H2O supports CSV, Gzip-compressed CSV, MS Excel (XLS), ARRF, HIVE file format, “and others”.

Each H2O algorithm supports scoring and prediction capability.   There is currently no facility for PMML export; this is unnecessary if H2O is deployed in Hadoop (since one can simply use the native prediction capability).

In March, the Apache Mahout project announced that it will support H2O.

Strata Report: Advanced Analytics in Hadoop

Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:

0XData

Product(s)

  • H20 (open source project)
  • h2o (R package)

Description

Smart people from Stanford with VC backing and a social media program.   Services business model with open source software.  H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters;  aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests.   Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA).  They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence.  R users can interface through h2o package.  Limited customer base.  Partners with Cloudera and MapR.

Key Points

  • True open source model
  • Comprehensive roadmap
  • Limited functionality
  • Limited user base
  • Performance claims undocumented

Alpine Data Labs

Product(s)

  • Alpine 2.8

Description

Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler).   Supports a reasonably broad range of analytic features.  Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works.  (Appears to be SQL/HiveQL push-down).   In practice, most customers seem to use Alpine with Greenplum.  Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability.  Partners with Pivotal, Cloudera and MapR.

Key Points

  • Reasonable option for users already committed to Greenplum Database
  • Limited partner and user ecosystem
  • Performance and stability should be vetted thoroughly in POC

Oracle

Product(s)

Description

Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).

Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise.   Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents.  When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.

Oracle Connection to Hadoop  is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive.  As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.

Key Points

  • Good choice for Oracle-centric organizations
  • Oracle Data Mining is a mature product with an excellent user interface
  • Must move data from Hadoop to Oracle Database to leverage OAA
  • Hadoop push-down from R requires expertise in MapReduce

SAS

Products

  • SAS/ACCESS Interface to Hadoop
  • SAS Scoring Accelerator for Cloudera
  • SAS Visual Analytics/SAS LASR Server
  • SAS High Performance Analytics Server

Description

SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server.   With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees.   Update:  SAS has announced SAS/ACCESS for Impala.

Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics.  High Performance Analytics is a suite of distributed in-memory analytics.   LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware.  Partners with Cloudera and Hortonworks.

Key Points

  • Legacy SAS connects to Hadoop, does not run in Hadoop
  • SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
  • Visual Analytics cannot work with “raw” data in Hadoop
  • Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
  • High TCO, proprietary architecture for all SAS products

Skytree

Product(s)

  • Skytree Server

Description

Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013.  Server-based technology, can connect to a range of data sources, including Hadoop.  Programming interface; claims ability to run from R, Weka, C++ and Python.  Good library of algorithms.  Partners with Cloudera, Hortonworks, MapR.  Skytree is opaque about technology and performance claims.

Key Points

  • Limited customer base, no announced sales since company launch
  • Hadoop integration is a connection, not “inside” architecture
  • Performance claims should be carefully vetted

SAS and Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities.    Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.

Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area.  In March 2012, SAS announced support for Hadoop connectivity;  since then, SAS has gradually expanded the features it supports with Hadoop.

As of today, there are four primary ways that a SAS user can leverage Hadoop:

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface.  SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing.  It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user.  For more detailed information, read the manual.

SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop.  If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop.  If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps.  (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce).  SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document);  if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.

SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through.  Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”.   The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.

SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.

SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera.  Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product.   Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server.   These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture.  That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.

That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative.  In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time.  SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine.  Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode.  This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format.   To import the data into SASHDAT, you will need to license SAS Data Integration Server.

A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes.   SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations.  Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K.  Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.

While SAS has struggled to implement its in-memory software in Hadoop to date,  YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop.  Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.

SAS Visual Analytics: FAQ (Updated 1/2014)

SAS charged its sales force with selling 2,000 licenses for Visual Analytics in 2013; the jury is still out on whether they met this target.  There’s lots of marketing action lately from SAS about this product, so here’s an FAQ.

Update:  SAS recently announced 1,400 sites licensed for Visual Analytics.  In SAS lingo, a site corresponds roughly to one machine, but one license can include multiple sites; so the actual number of licenses sold in 2013 is less than 1,400.  In April 2013 SAS executives claimed two hundred customers for the product.   In contrast, Tableau reports that it added seven thousand customers in 2013 bringing its total customer count to 17,000.

What is SAS Visual Analytics?

Visual Analytics is an in-memory visualization and reporting tool.

What does Visual Analytics do?

SAS Visual Analytics creates reports and graphs that are visually compelling.  You can view them on mobile devices.

VA is now in its fifth dot release.  Why do they call it Release 6.3?

SAS Worldwide Marketing thinks that if they call it Release 6.3, you will think it’s a mature product.  It’s one of the games software companies play.

Is Visual Analytics an in-memory database, like SAP HANA?

No.  HANA is a standards-based in-memory database that runs on many different brands of hardware and supports a range of end-user tools.  VA is a proprietary architecture available on a limited choice of hardware platforms.  It cannot support anything other than the end-user applications SAS chooses to develop.

What does VA compete with?

SAS claims that Visual Analytics competes with Tableau, Qlikview and Spotfire.  Internally, SAS leadership refers to the product as its “Tableau-killer” but as the reader can see from the update at the top of this page, Tableau is alive and well.

How well does it compare?

You will have to decide for yourself whether VA reports are prettier than those produced by Tableau, Qlikview or Spotfire.  On paper, Tableau has more functionality.

VA runs in memory.  Does that make it better than conventional BI?

All analytic applications perform computations in memory.  Tableau runs in memory, and so does Base SAS.   There’s nothing unique about that.

What makes VA different from conventional BI applications is that it loads the entire fact table into memory.  By contrast, BI applications like Tableau query a back-end database to retrieve the necessary data, then perform computations on the result set.

Performance of a conventional BI application depends on how fast the back-end database can retrieve the data.  With a high-performance database the performance is excellent, but in most cases it won’t be as fast as it would if the data were held in memory.

So VA is faster?  Is there a downside?

There are two.

First, since conventional BI systems don’t need to load the entire fact table into memory, they can support usage with much larger datastores.  The largest H-P Proliant box for VA maxes out at about 10 terabytes; the smallest Netezza appliance supports 30 terabytes, and scales to petabytes.

The other downside is cost; memory is still much more expensive than other forms of storage, and the machines that host VA are far more expensive than data warehouse appliances that can host far more data.

VA is for Big Data, right?

SAS and H-P appear to be having trouble selling VA in larger sizes, and are positioning a small version that can handle 75-100 Gigabytes of data.  That’s tiny.

The public references SAS has announced for this product don’t seem particularly large.  See below.

How does data get into VA?

VA can load data from a relational database or from a proprietary SASHDAT file.  SAS cautions that loading data from a relational database is only a realistic option when VA is co-located in a Teradata Model 720 or Greenplum DCA appliance.

To use SASHDAT files, you must first create them using SAS.

Does VA work with unstructured data?

VA works with structured data, so unstructured data must be structured first, then loaded either to a co-located relational database or to SAS’ proprietary SASHDAT format.

Unlike products like Datameer or IBM Big Sheets, VA does not support “schema on read”, and it lacks built-in tools for parsing unstructured text.

But wait, SAS says VA works with Hadoop.  What’s up with that?

A bit of Marketing slight-of-hand.  VA can load SASHDAT files that are stored in the Hadoop File System (HDFS); but first, you have to process the data in SAS, then load it back into HDFS.  In other words, you can’t visualize and write reports from the data that streams in from machine-generated sources — the kind of live BI that makes Hadoop really cool.  You have to batch the data, parse it, structure it, then load it with SAS to VA’s staging area.

Can VA work with streaming data?

SAS sells tools that can capture streaming data and load it to a VA data source, but VA works with structured data at rest only.

With VA, can my users track events in real time?

Don’t bet on it.   To be usable VA requires significant pre-processing before it is loaded into VA’s memory.  Moreover, once it is loaded it can’t be updated; updating the data in VA requires a full truncate and reload.   Thus, however fast VA is in responding to user requests, your users won’t be tracking clicks on their iPads in real time; they will be looking at yesterday’s data.

Does VA do predictive analytics?

Visual Analytics 6.1 can perform correlation, fit bivariate trend lines to plots and do simple forecasting.  That’s no better than Tableau.  Surprisingly, given the hype, Tableau actually supports more analysis functions.

While SAS claims that VA is better than SAP HANA because “HANA is just a database”, the reality is that SAP supports more analytics through its Predictive Analytics Library than SAS supports in VA.

Has anyone purchased VA?

A SAS executive claimed 200 customers in early 2013, a figure that should be taken with a grain of salt.  If there are that many customers for this product, they are hiding.

There are five public references, all of them outside the US:

SAS has also recently announced selection (but not implementation) by

OfficeMax has also purchased the product, according to this SAS blog.

As of January 2014, the four customers who announced selection or purchase are not cited as reference customers.

What about implementation?  This is an appliance, right?

Wrong.  SAS’ considers an implementation that takes a month to be wildly successful.  Implementation tasks include the same tasks you would see in any other BI project, such as data requirements, data modeling, ETL construction and so forth.  All of the back end feeds must be built to put data into a format that VA can load.

Bottom line, does it make sense to buy SAS Visual Analytics?

Again, you will have to decide for yourself whether the SAS VA reports look better than Tableau or the many other options in this space.  BI beauty shows are inherently subjective.

You should also demand that SAS prove its claims to performance in a competitive POC.  Despite the theoretical advantage of an in-memory architecture, actual performance is influenced by many factors.  Visitors to the recent Gartner BI Summit who witnessed a demo were unimpressed; one described it to me as “dog slow”.  She didn’t mean that as a compliment.

The high cost of in-memory platforms mean that VA and its supporting hardware will be much more expensive for any given quantity of data than Tableau or equivalent products. Moreover, its proprietary architecture means you will be stuck with a BI silo in your organization unless you are willing to make SAS your exclusive BI provider.  That makes this product very good for SAS; the question is whether it is good for you.

The early adopters for this product appear to be very SAS-centric organizations (with significant prior SAS investment).  They also appear to be fairly small.  If you have very little data, money to burn and are willing to experiment with a relatively new product, VA may be for you.