Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.

SAS and Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities.    Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.

Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area.  In March 2012, SAS announced support for Hadoop connectivity;  since then, SAS has gradually expanded the features it supports with Hadoop.

As of today, there are four primary ways that a SAS user can leverage Hadoop:

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface.  SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing.  It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user.  For more detailed information, read the manual.

SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop.  If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop.  If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps.  (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce).  SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document);  if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.

SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through.  Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”.   The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.

SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.

SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera.  Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product.   Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server.   These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture.  That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.

That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative.  In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time.  SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine.  Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode.  This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format.   To import the data into SASHDAT, you will need to license SAS Data Integration Server.

A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes.   SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations.  Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K.  Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.

While SAS has struggled to implement its in-memory software in Hadoop to date,  YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop.  Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.