SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.

Spark 1.1 Update

For an overview of Spark, see the Apache Spark Page.

On September 11, the Spark team announced release of Spark 1.1.   This latest version of Spark includes a number of significant enhancements:

  • As announced at the Spark Summit, Shark is now converged with Spark SQL.  Databricks has migrated its Shark workloads to Spark, and reports 2X-5X performance improvement.
  • The team has added a library of basic statistics for exploratory analysis, including correlations and hypothesis testing.  There are also new tools for stratified sampling and random generation.
  • Also new to MLLib: utilities for feature extraction for text mining and feature transformation.  Feature extraction techniques include Word2Vec and TF-IDF;  transformation techniques include normalization and scaling.
  • New MLLib algorithms include non-negative matrix factorization and singular value decomposition (SVD) using the Lanczos algorithm.  The combination of feature extraction capabilities and a robust SVD give Spark a strong foundation for text mining.
  • For Spark Streaming, the team has added support for Amazon Kinesis and a streaming linear regression algorithm.

There are also many bug fixes, as well as performance and usability improvements.  With ~175 contributors for this release, Spark continues to be one of the most active projects in the Hadoop ecosystem.

Since release of Spark 1.0, Databricks has announced certification for three additional Spark distributions:

  • Bluedata, a pioneer in big data private cloud.
  • Guavus, an operational intelligence platform.
  • Stratio, a commercially supported open source “Pure Spark” distribution.

In related news, Databricks and O’Reilly Media recently announced a certification program, which will be launched October 15-17 at Strata NY + Hadoop World.  More information here, here, here and here.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Smart Money: More Funding for Analytics

Funding for analytic ventures remained robust in January, with 17 significant funding transactions and three acquisitions.   Key themes:

  • Outcomes-based medicine and health care
  • Vertical solutions for the energy industry
  • Solutions for risk management
  • Mobile analytics, including location-based targeting and app metrics
  • Social media sentiment analysis
  • Graph engines (and solutions based on graph engines)
  • In-memory SQL engines

All funding news via Crunchbase.

Funding

Health Catalyst led the way with $41 million in Series C funding.   Health Catalyst offers a solution stack consisting of a proprietary data warehouse optimized for electronic medical records, plus analytic applications designed to support outcomes-based health care.

Other transactions greater than $1 million include:

MemSQL, provider of a high performance in-memory distributed database, raised $35 million in a Series B round.

— Still in stealth mode, marketing analytics provider OrigamiLogic closed on $15 million in Series B funding.

— Kreditech scored $15 million in debt financing.  Kreditech uses machine learning and Big Data to offer credit scoring for microlending.

— Radius closed on $13 million in Series B funding.  Radius supports B2B targeted marketing and lead generation for small businesses.

— Smart grid analytics provider AutoGrid landed $12.8 million in Series C funding.

— GNS Healthcare leverages Bayesian Networks and Monte Carlo Simulation to deliver solutions for outcomes-based medicine to hospitals, health insurance plans, pharmaceutical companies and other entities in the health care delivery chain.  GNS completed $10 million in Series B financing.

— Simple Energy raised $6 million in Series B funding.  Simple Energy offers utilities services to improve customer interactions through microtargeting and social gaming.

— Binary Fountain, provider of software integrating social sentiment analysis with BPM, raised $5.7 million.

— 4C Insights integrates social media sentiment analysis with public data to support media planning and targeting.   The firm raised $5 million in Series B funding.

— Kontagent secured $4.8 million in venture funding.  Kontagent offers mobile analytic solutions to mobile app developers and marketers.

— Offshore analytic services provider Axtria received $4.8 million in venture funding.

— Enigma Technologies raised $4.5 million in Series A funding.  Enigma provides a platform for the analysis of public data that includes a repository and directory to sources, plus tools for search, export and simple analytics.

— Lumiata raised $4 million in Series A funding.  Lumiata leverages graph engine technology to deliver evidence-based predictions to medical practitioners.

— BI vendor Chartio received $2.2 million in venture funding

Bottlenose, purveyor of dashboard and insight tools for social sentiment analysis, raised $1.1 million in debt financing.

Geofeedia, a provider of open source location-based social media mining tools, received 1.25 million in Series A funding.

Acquisitions

There were three acquisitions of note; purchase prices were not disclosed.

— yp, the corporate successor to AT&T Interactive and AT&T Advertising Solutions, acquired Sense Networks on January 6.   Sense Networks uses predictive analytics to drive location-based behavioral targeting for mobile ad platforms.

— Pinterest acquired VisualGraph on January 6.  VisualGraph, a two-man operation, has developed a distributed in-memory visual search engine.

— Apigee, an API management company, acquired InsightsOne on January 8.   InsightsOne offers cloud-based infrastructure for predictive analytics based on Hadoop, plus an in-memory graph engine.

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.

Introduction

Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.

Distribution

Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.

Ecosystem

In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.

Community

In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.

SAP and SAS Couple Up

SAS and SAP announced a “strategic partnership” today at the SAP TechEd show.

According to SAS’ press release,

SAP and SAS will partner closely to create a joint technology and product roadmap designed to leverage the SAP HANA® platform and SAS analytics capabilities. By incorporating the in-memory SAP HANA platform into SAS applications and enabling SAS’ industry-proven advanced analytics algorithms to run on SAP HANA, decision makers will have the opportunity to leverage the value of real-time data analysis within their existing SAS and SAP HANA environments.

SAS and SAP plan to execute a co-sell pilot program to engage select joint customers to validate SAS applications running on SAP HANA. The goal of this program is to build and prioritize the two firms’ joint technology throughout 2014, in particular for industries such as financial services, telecommunications, retail, consumer products and manufacturing. The applications are expected to target business areas that require a combination of advanced analytics running on an in-memory platform that will be designed to yield high value results. Such opportunities exist in customer intelligence, risk management, asset management and anti-money laundering, among others.

How soon we forget; just six months ago, SAS leadership trashed SAP HANA from the stage at SAS Global Forum.

SAS and SAP share a commitment to in-memory computing, but they have a fundamentally different approach to the technology.  SAP HANA is a standards-based persistent in-memory database, with a strong vendor ecosystem.  SAS on the other hand, builds its in-memory analytics on a proprietary architecture,  and has a vendor ecosystem of one.  HANA succeeds because it is an easy decision for SAP-centric companies to adopt the product for small high-concurrency databases with one data source.   Meanwhile, even the most loyal SAS customers choke at the TCO of SAS High Performance Analytics.

In-memory databases make economic sense when (a) you don’t have much data, and (b) usage is read-only, (c) users want small random packets of data, and (d) there are lots of users.   The NBA’s statistics website (powered by SAP HANA) is a perfect example: less than a terabyte of data, but up to 20,000 concurrent users seeking information about how many free throws Hal Greer hit in 1968 against the Celtics.   That’s a great application for BI tools, but not for high-end predictive analytics.  SAP’s HANA Predictive Analytics Library may be toylike, but it’s likely good enough for that use case.

SAS Visual Analytics makes more sense coupled to an in-memory database like HANA than to its existing LASR Server architecture.   It doesn’t do anything that can’t be done in Business Objects, but there are likely a few customers in the market who are both SAS-centric and have an all-SAP back end.

SAS and Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities.    Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.

Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area.  In March 2012, SAS announced support for Hadoop connectivity;  since then, SAS has gradually expanded the features it supports with Hadoop.

As of today, there are four primary ways that a SAS user can leverage Hadoop:

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface.  SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing.  It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user.  For more detailed information, read the manual.

SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop.  If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop.  If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps.  (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce).  SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document);  if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.

SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through.  Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”.   The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.

SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.

SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera.  Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product.   Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server.   These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture.  That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.

That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative.  In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time.  SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine.  Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode.  This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format.   To import the data into SASHDAT, you will need to license SAS Data Integration Server.

A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes.   SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations.  Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K.  Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.

While SAS has struggled to implement its in-memory software in Hadoop to date,  YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop.  Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.

SAS Visual Analytics: FAQ (Updated 1/2014)

SAS charged its sales force with selling 2,000 licenses for Visual Analytics in 2013; the jury is still out on whether they met this target.  There’s lots of marketing action lately from SAS about this product, so here’s an FAQ.

Update:  SAS recently announced 1,400 sites licensed for Visual Analytics.  In SAS lingo, a site corresponds roughly to one machine, but one license can include multiple sites; so the actual number of licenses sold in 2013 is less than 1,400.  In April 2013 SAS executives claimed two hundred customers for the product.   In contrast, Tableau reports that it added seven thousand customers in 2013 bringing its total customer count to 17,000.

What is SAS Visual Analytics?

Visual Analytics is an in-memory visualization and reporting tool.

What does Visual Analytics do?

SAS Visual Analytics creates reports and graphs that are visually compelling.  You can view them on mobile devices.

VA is now in its fifth dot release.  Why do they call it Release 6.3?

SAS Worldwide Marketing thinks that if they call it Release 6.3, you will think it’s a mature product.  It’s one of the games software companies play.

Is Visual Analytics an in-memory database, like SAP HANA?

No.  HANA is a standards-based in-memory database that runs on many different brands of hardware and supports a range of end-user tools.  VA is a proprietary architecture available on a limited choice of hardware platforms.  It cannot support anything other than the end-user applications SAS chooses to develop.

What does VA compete with?

SAS claims that Visual Analytics competes with Tableau, Qlikview and Spotfire.  Internally, SAS leadership refers to the product as its “Tableau-killer” but as the reader can see from the update at the top of this page, Tableau is alive and well.

How well does it compare?

You will have to decide for yourself whether VA reports are prettier than those produced by Tableau, Qlikview or Spotfire.  On paper, Tableau has more functionality.

VA runs in memory.  Does that make it better than conventional BI?

All analytic applications perform computations in memory.  Tableau runs in memory, and so does Base SAS.   There’s nothing unique about that.

What makes VA different from conventional BI applications is that it loads the entire fact table into memory.  By contrast, BI applications like Tableau query a back-end database to retrieve the necessary data, then perform computations on the result set.

Performance of a conventional BI application depends on how fast the back-end database can retrieve the data.  With a high-performance database the performance is excellent, but in most cases it won’t be as fast as it would if the data were held in memory.

So VA is faster?  Is there a downside?

There are two.

First, since conventional BI systems don’t need to load the entire fact table into memory, they can support usage with much larger datastores.  The largest H-P Proliant box for VA maxes out at about 10 terabytes; the smallest Netezza appliance supports 30 terabytes, and scales to petabytes.

The other downside is cost; memory is still much more expensive than other forms of storage, and the machines that host VA are far more expensive than data warehouse appliances that can host far more data.

VA is for Big Data, right?

SAS and H-P appear to be having trouble selling VA in larger sizes, and are positioning a small version that can handle 75-100 Gigabytes of data.  That’s tiny.

The public references SAS has announced for this product don’t seem particularly large.  See below.

How does data get into VA?

VA can load data from a relational database or from a proprietary SASHDAT file.  SAS cautions that loading data from a relational database is only a realistic option when VA is co-located in a Teradata Model 720 or Greenplum DCA appliance.

To use SASHDAT files, you must first create them using SAS.

Does VA work with unstructured data?

VA works with structured data, so unstructured data must be structured first, then loaded either to a co-located relational database or to SAS’ proprietary SASHDAT format.

Unlike products like Datameer or IBM Big Sheets, VA does not support “schema on read”, and it lacks built-in tools for parsing unstructured text.

But wait, SAS says VA works with Hadoop.  What’s up with that?

A bit of Marketing slight-of-hand.  VA can load SASHDAT files that are stored in the Hadoop File System (HDFS); but first, you have to process the data in SAS, then load it back into HDFS.  In other words, you can’t visualize and write reports from the data that streams in from machine-generated sources — the kind of live BI that makes Hadoop really cool.  You have to batch the data, parse it, structure it, then load it with SAS to VA’s staging area.

Can VA work with streaming data?

SAS sells tools that can capture streaming data and load it to a VA data source, but VA works with structured data at rest only.

With VA, can my users track events in real time?

Don’t bet on it.   To be usable VA requires significant pre-processing before it is loaded into VA’s memory.  Moreover, once it is loaded it can’t be updated; updating the data in VA requires a full truncate and reload.   Thus, however fast VA is in responding to user requests, your users won’t be tracking clicks on their iPads in real time; they will be looking at yesterday’s data.

Does VA do predictive analytics?

Visual Analytics 6.1 can perform correlation, fit bivariate trend lines to plots and do simple forecasting.  That’s no better than Tableau.  Surprisingly, given the hype, Tableau actually supports more analysis functions.

While SAS claims that VA is better than SAP HANA because “HANA is just a database”, the reality is that SAP supports more analytics through its Predictive Analytics Library than SAS supports in VA.

Has anyone purchased VA?

A SAS executive claimed 200 customers in early 2013, a figure that should be taken with a grain of salt.  If there are that many customers for this product, they are hiding.

There are five public references, all of them outside the US:

SAS has also recently announced selection (but not implementation) by

OfficeMax has also purchased the product, according to this SAS blog.

As of January 2014, the four customers who announced selection or purchase are not cited as reference customers.

What about implementation?  This is an appliance, right?

Wrong.  SAS’ considers an implementation that takes a month to be wildly successful.  Implementation tasks include the same tasks you would see in any other BI project, such as data requirements, data modeling, ETL construction and so forth.  All of the back end feeds must be built to put data into a format that VA can load.

Bottom line, does it make sense to buy SAS Visual Analytics?

Again, you will have to decide for yourself whether the SAS VA reports look better than Tableau or the many other options in this space.  BI beauty shows are inherently subjective.

You should also demand that SAS prove its claims to performance in a competitive POC.  Despite the theoretical advantage of an in-memory architecture, actual performance is influenced by many factors.  Visitors to the recent Gartner BI Summit who witnessed a demo were unimpressed; one described it to me as “dog slow”.  She didn’t mean that as a compliment.

The high cost of in-memory platforms mean that VA and its supporting hardware will be much more expensive for any given quantity of data than Tableau or equivalent products. Moreover, its proprietary architecture means you will be stuck with a BI silo in your organization unless you are willing to make SAS your exclusive BI provider.  That makes this product very good for SAS; the question is whether it is good for you.

The early adopters for this product appear to be very SAS-centric organizations (with significant prior SAS investment).  They also appear to be fairly small.  If you have very little data, money to burn and are willing to experiment with a relatively new product, VA may be for you.

SAS and H-P Close the Curtains

Michael Kinsley wrote:

It used to be, there was truth and there was falsehood. Now there is spin and there are gaffes. Spin is often thought to be synonymous with falsehood or lying, but more accurately it is indifference to the truth. A politician engaged in spin is saying what he or she wishes were true, and sometimes, by coincidence, it is. Meanwhile, a gaffe, it has been said, is when a politician tells the truth — or more precisely, when he or she accidentally reveals something truthful about what is going on in his or her head. A gaffe is what happens when the spin breaks down.

Hence, a Kinsley gaffe means “accidentally telling the truth”.

Back in April, an H-P engineer committed a Kinsley gaffe by publishing a white paper that describes in some detail issues encountered by SAS and H-P on implementations of SAS Visual Analytics.  I blogged about this at the time here.

Some choice bits:

— “Needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.”

— “(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.”

— “Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.”

And my personal favorite:

— “The potential exists, with even as few as 4 servers, for a Data Storm to occur.”

If you’re wondering what a Data Storm is, let’s just say that its not a good thing.

Since I published the blog post, SAS has withdrawn the paper from its website.   This is not too surprising, since every other paper on “SAS and Big Data” is also hidden from view.   Fortunately, I downloaded a copy of the paper for my records.   H-P can claim copyright, so I can’t upload the whole thing, but I’ve attached a few screen shots below so you can see that this paper is real.

You might wonder why SAS feels compelled to keep its “Big Data” stories under wraps.  Keep in mind that we’re not talking about software design or any other intellectual property that warrants protection; in this case, the vendors don’t want you to know the truth about implementation because it conflicts with the hype.  As the paper’s author puts it, “this sounds very scary and expensive.”  “Very scary” and “expensive” don’t mix with “buy this product now.”

If you’re evaluating SAS Visual Analytics ask your SAS rep for a copy of Paper 466-2013.  And ask if they’ve done anything about those Data Storms.

Hp1

Hp2

A Few Interesting Things About SAS Visual Analytics

This post is more than two years old, but remains popular.  For an updated discussion, read How to Buy SAS Visual Analytics on this blog.

Thanks to a white paper recently published by an H-P engineer, we now have a better idea about what it takes to implement SAS Visual Analytics, SAS’ in-memory BI and visualization platform.

(Note: SAS has taken down the white paper since this post was published).

(Updated again June 27:   SAS has reposted an edited version of the white paper, with interesting parts removed.  The paper currently posted at this link is not the original.)

It’s an interesting picture.

A few key points:

(1) Implementation is a science project.  

Quoting from the paper:

…too often the needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.

Someone should explain to SAS and H-P that vendors are supposed to provide customers with honest guidance about how to implement a product.  If “needed pre-planning does not occur”, it’s likely because the customer wasn’t told it was necessary.  Of course, some customers ignore vendor guidance; but if the issues described in this white paper are systematic (as the author suggests), there are only two plausible explanations: (a) the vendors don’t know how to implement the product, or (b) they’re positioning the product as an “appliance” that is “ready to run.”

Elsewhere in the paper, the author notes that a response to a key question about how to monitor this application is “evolving”.  In other words, thirteen months and three releases into production and the vendors still don’t have an answer.

(2) Pre-Release testing?  What’s that?

Quoting:

Experience on initial installations is showing that networking is proving to be one of the biggest challenges and impediments to a timely implementation.

Well, duh.  This is the sort of thing ordinarily revealed in something called “system testing” and “benchmarking”, the product of which is something called “reference architecture”.   Smart people generally think it’s a good idea to do this sort of testing and benchmarking before you release a product rather than figuring it out in the course of “initial installations”.

Pity the early adopters for this product.

(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.

Well, why are they overlooked?   Ordinarily when implementing a product one starts with something called an “architecture review” where you — I’m talking to you, SAS and H-P — tell the customer how the product works and point out these network thingies and why it’s a good idea to provision them and not leave them just hanging out there.

(3) It’s not an appliance.

The author refers to the hardware VA sits on as an “appliance”; word on the street is that SAS reps position VA as an alternative to appliances (such as IBM PureData, Teradata Aster or EMC Greenplum DCA).   Well, caveat emptor on that.  The author goes into an extended discussion of data networking, with much interesting detail on such topics as the type of cables you will need to wire this thing together (copper or glass fiber).

It seems that you need lots of cable, and for good performance you need good cable.

(4) Implement with care.

The potential exists, with even as few as 4 servers, for a Data Storm to occur.

No, he does not mean the Amiga game.    For those not hip to the lingo, a Data Storm is a Really Bad Thing that you don’t want to happen in your IT environment, and vendors generally design products that don’t set off Data Storms.

A Data Storm is to Business Intelligence what train wrecks are to travel.

(5) Infrastructure requirements may be daunting.

After an extended discussion of IP addresses and the load this product places on your network, the author writes:

Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.

And after more extended discussion about networking and cabling:

…while all this sounds very scary and expensive….

Dude, you have no idea.

…there can be assistance from vendors during the hardware ordering process that makes this simpler and clearer to comprehend.

No doubt.  Hardware vendors will be happy to explain all of this to you.

(6)  High Availability?  Not exactly.

The author opens the High Availability section with a Readiness Checklist detailing whether or not you should even attempt to cold swap a SAS Head Node.  The answer: it depends.  Left unanswered: what to do if the Readiness Checklist says Do Not Attempt.  I presume that the answer is “call H-P”, but I’m just guessing.

This leads to the next section:

How to transplant a SAS Head Node and survive the experience (you hope) in a SAS VA configuration.

Nine steps follow, closing with “Assuming that everything seems to be in working order…”

(7) Finding people to keep this running will be a challenge.

The author is a twenty-year veteran of SAS and H-P (with four acronyms after his name), and his paper is littered with words like “scary”, “problematic”  and “crisis”.