Big Analytics Roundup (January 11, 2016)

With the holidays over, the weekly roundup is back.  Spark 1.6 is the biggest news, but scroll down for a look at that Yahoo! benchmark of streaming platforms.

Professional cat herder Andrew Oliver publishes 16 tips for 2016 and gets most of them right although, in the case of Spark, about nine months late.  Some of his couplings are unique, to say the least: most people familiar with YARN and Mesos, for example, would not group them together, since they do very different things.  I share Andrew’s pessimism concerning Tez, but Hortonworks will continue to flog that horse until Hive on Spark gets off the ground.

In a two-part post (here and here), Sumanth Narasappa describes how Apigee uses Spark to compute aggregates in a multi-tenant database architecture.

Spark 1.6 Released

The Apache Spark team releases Spark 1.6, with many enhancements, including:

  • Dataset API
  • Many performance enhancements (Project Tungsten)
  • API and UI enhancements for Streaming
  • New machine learning algorithms and API enhancements

Later this week, I will publish a separate post with additional detail.  Media coverage herehere, here and here.

On the Databricks blog, Michael Armbrust, Patrick Wendell and Reynold Xin summarize key enhancements, and note that in 2015 Spark doubled its contributor base.

Hortonworks offers a preview with HDP 2.3.

Yahoo! Benchmarks Streaming Platforms

On the Yahoo! Engineering blog, the Yahoo Storm Team reports the results of a project to benchmark streaming platforms.  While Apache Storm is the incumbent at Yahoo!, the authors note the emergence of Apache Apex, Apache Flink, Apache Samza, Apache Spark Streaming and Google Cloud Dataflow.

The report also includes links to recently published functional comparisons (here, here, here and here).

For this project, the authors limit the scope of testing to Storm, Flink and Spark Streaming.  The benchmark use case, published here, is an advertising application whose objective is to read JSON events from Kafka, identify relevant events associated with advertising campaigns, and store windowed campaign event counts in Redis.  The target SLA is one second.

As the authors note, this assessment is limited in scope to performance under normal operation, and does not measure processing guarantees (e.g. at least-once, exactly-once, at-most-once) or fault tolerance.  In practice, organizations place much more emphasis on fault tolerance when building streaming applications.

Moreover, while the “at-least-once” processing model may be good enough for Yahoo’s advertising metrics, it is not good enough for financial systems, where exactly-once processing is mandatory.

Think about that for a minute, then go and check your Yahoo credit card statement.

Fault tolerance and exactly-once processing affect performance and throughput, so it’s not clear why anyone would want to run a benchmark without them enabled.

Within the scope of the test, Storm and Flink delivered comparable performance.  This is just one benchmark with a single use case, but it seems to me that the results are problematic for Flink.  If Flink can’t outperform Storm, why exactly do we need it?

Finally, SAS Supports Spark

On the day that I publish a prediction that SAS will support Spark in 2016, SAS releases SAS Data Loader for Hadoop Release 2.4 with Spark support.  (For some reason, SAS did not issue a press release or flog the media as it usually does when it adds a feature.)  SAS Data Loader copies data to and from relational databases and SAS datasets to and from Hadoop; it also imports data from delimited files, performs data transformations and provides SQL-like operations.  When configured appropriately, Data Loader pushes data quality functions down to Spark.

SAS supports all five Hadoop distributions, and either Spark 1.2 or Spark 1.3, depending on the Hadoop distribution used.  It does not support free-standing Spark clusters.  The supported Spark version is almost a year old due to double release latency: the Hadoop distributor’s release cycle and SAS’ Hadoop certification and release cycle.

Two interesting points.  SAS claims that the ability to load data to SAS LASR Analytic Server is a feature of SAS Data Loader for Hadoop.  This will come as a surprise to SAS customers who were told they could load data directly from Hadoop into SAS LASR Analytic Server, without licensing another SAS product.

SAS also notes that SAS Data Loader runs SAS programs using the DS2 language.  In other words. for the majority of SAS customers who work with the standard SAS Programming Language, you will have to rewrite your SAS programs to use this software.

RDMA-Spark Launches

Ohio State’s Network-Based Computing Laboratory releases its new RDMA-based Apache Spark distribution, the aptly named RDMA-Spark.  (That’s Remote Direct Memory Access for the uninitiated.)  Some impressive performance tests published here.

Databricks Adds to Leadership Team

Announcements about changes to the executive team provoke speculation: who got fired?  In this case, nobody.  Ion Stoica, Ali Ghodsi and Patrick Wendell all move up, and Databricks adds Ron Gabrisko as SVP, Worldwide Sales.   Congratulations to all.



  • Hi Thomas; hope you’ll acknowledge these 3 things.
    1) Spark version numbers for DLH are baselines (or minimums)
    2) That DL4H loads LASR is orthogonal to that there other ways to load LASR. from DL4H might be convenient for some pipelines.
    3) DL4H generates DS2; customers dont have to rewrite anything, just upgrade to DL4H2.4

    –Paul (from SAS)

    • Paul,

      Thank you very much for reading and commenting.

      — On your first point, my post reflects the written SAS documentation, which specifies “System Requirements” and not “Minimum System Requirements”. If the SAS documentation does not reflect the true state of SAS support, I believe that you have an issue with the folks at SAS who write documentation and not with my post.

      According to the link below, SAS Technical Support will not attempt to reproduce issues on anything other than the “officially supported” releases.

      As you and I both know, a product can work in an environment that does not exactly match the published requirements. That is not relevant to a discussion about what platforms SAS supports. If SAS Technical Support will not attempt to reproduce issues on releases other than those specified in the System Requirements, it does not support the product on those releases.

      — Loading data directly from Hadoop to LASR is possible but inconvenient? That’s what I figured.

      — Most SAS users do not work with DS2. So, any existing jobs written to run in Base SAS will have to be rewritten, or rebuilt in DL4H. That’s not an issue for some customers, but it’s a show-stopper for others.

      Hope to see you at Spark Summit!



  • Not to jump into the middle of this fray, but I would like to make one clarification. DL4H does also run Base SAS programs, so there really is no reason to have to rewrite any programs.

    – Diane

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s