Big Analytics Roundup (January 11, 2016)
With the holidays over, the weekly roundup is back. Spark 1.6 is the biggest news, but scroll down for a look at that Yahoo! benchmark of streaming platforms.
Professional cat herder Andrew Oliver publishes 16 tips for 2016 and gets most of them right although, in the case of Spark, about nine months late. Some of his couplings are unique, to say the least: most people familiar with YARN and Mesos, for example, would not group them together, since they do very different things. I share Andrew’s pessimism concerning Tez, but Hortonworks will continue to flog that horse until Hive on Spark gets off the ground.
Spark 1.6 Released
- Dataset API
- Many performance enhancements (Project Tungsten)
- API and UI enhancements for Streaming
- New machine learning algorithms and API enhancements
On the Databricks blog, Michael Armbrust, Patrick Wendell and Reynold Xin summarize key enhancements, and note that in 2015 Spark doubled its contributor base.
Hortonworks offers a preview with HDP 2.3.
Yahoo! Benchmarks Streaming Platforms
On the Yahoo! Engineering blog, the Yahoo Storm Team reports the results of a project to benchmark streaming platforms. While Apache Storm is the incumbent at Yahoo!, the authors note the emergence of Apache Apex, Apache Flink, Apache Samza, Apache Spark Streaming and Google Cloud Dataflow.
For this project, the authors limit the scope of testing to Storm, Flink and Spark Streaming. The benchmark use case, published here, is an advertising application whose objective is to read JSON events from Kafka, identify relevant events associated with advertising campaigns, and store windowed campaign event counts in Redis. The target SLA is one second.
As the authors note, this assessment is limited in scope to performance under normal operation, and does not measure processing guarantees (e.g. at least-once, exactly-once, at-most-once) or fault tolerance. In practice, organizations place much more emphasis on fault tolerance when building streaming applications.
Moreover, while the “at-least-once” processing model may be good enough for Yahoo’s advertising metrics, it is not good enough for financial systems, where exactly-once processing is mandatory.
Think about that for a minute, then go and check your Yahoo credit card statement.
Fault tolerance and exactly-once processing affect performance and throughput, so it’s not clear why anyone would want to run a benchmark without them enabled.
Within the scope of the test, Storm and Flink delivered comparable performance. This is just one benchmark with a single use case, but it seems to me that the results are problematic for Flink. If Flink can’t outperform Storm, why exactly do we need it?
Finally, SAS Supports Spark
On the day that I publish a prediction that SAS will support Spark in 2016, SAS releases SAS Data Loader for Hadoop Release 2.4 with Spark support. (For some reason, SAS did not issue a press release or flog the media as it usually does when it adds a feature.) SAS Data Loader copies data to and from relational databases and SAS datasets to and from Hadoop; it also imports data from delimited files, performs data transformations and provides SQL-like operations. When configured appropriately, Data Loader pushes data quality functions down to Spark.
SAS supports all five Hadoop distributions, and either Spark 1.2 or Spark 1.3, depending on the Hadoop distribution used. It does not support free-standing Spark clusters. The supported Spark version is almost a year old due to double release latency: the Hadoop distributor’s release cycle and SAS’ Hadoop certification and release cycle.
Two interesting points. SAS claims that the ability to load data to SAS LASR Analytic Server is a feature of SAS Data Loader for Hadoop. This will come as a surprise to SAS customers who were told they could load data directly from Hadoop into SAS LASR Analytic Server, without licensing another SAS product.
SAS also notes that SAS Data Loader runs SAS programs using the DS2 language. In other words. for the majority of SAS customers who work with the standard SAS Programming Language, you will have to rewrite your SAS programs to use this software.
Ohio State’s Network-Based Computing Laboratory releases its new RDMA-based Apache Spark distribution, the aptly named RDMA-Spark. (That’s Remote Direct Memory Access for the uninitiated.) Some impressive performance tests published here.
Databricks Adds to Leadership Team
Announcements about changes to the executive team provoke speculation: who got fired? In this case, nobody. Ion Stoica, Ali Ghodsi and Patrick Wendell all move up, and Databricks adds Ron Gabrisko as SVP, Worldwide Sales. Congratulations to all.