Big Analytics Roundup (June 6, 2016)

We have a slightly abbreviated roundup this week due to travel to the Spark Summit. Spark 2.0 is the top story; I will do a full roundup when the release goes GA.

Also, Bob Muenchen publishes another snippet from the long-awaited Rexer survey of working data miners. This one focuses on satisfaction with tools. KNIME and R look good; SAS and SPSS Statistics, not so much.

Forrester publishes its 2016 Big Data Hadoop-Optimized Systems report. Everyone’s either a Leader or a Strong Performer, just like in Lake Wobegon. You can buy the report here, or just look at the picture below. Teradata is really excited to be #2, although Big Data Hadoop-Optimized Systems cannibalize the rest of their product line.


Spark 2.0

— Spark 2.0 is in preview release. It’s available on Databricks, or directly from the Apache site.

— Jules Damji rounds up a slew of links on Spark 2.0.

— Alex Giamas is so excited about Spark 2.0 that he misunderstands the status of the machine learning libraries. No, MLlib is not deprecated — not yet, anyway. Spark may deprecate MLlib in the future, when ML gets to feature parity. Xiangrui Meng suggests that may happen in Spark 2.2. Update: Alex has corrected his article.

— Microsoft announces major new commitment to Spark. The specific products cited in the press release were all announced previously, with the possible exception of PowerBI on Spark Streaming.

Top Reads

— Three from Adrian Colyer:

— Paul Smaldino and Richard McElreath on the theory of bad science.

Benchmarks That Don’t Suck

— The Transaction Processing Council announces release of TPCx-BB, a benchmark designed to measure the performance of analytic data processing, queries and machine learning across thirty use cases. In Datanami, George Leopold reports.


— Joseph Bradley explains machine learning model persistence in Spark 2.0.

— Taylor Goetz explains new features in Apache Storm 1.0.

— Jordan Volz explains how to analyze fantasy basketball stats with Spark.

— Ian Pointer explains differences between Apache Storm and Heron, Twitter’s recently open sourced streaming engine.

— Suresh Thalamati explains how to use the Spark Netezza connector, so you can move the data when you decommission that old box.


— Alex Woodie pooh-poohs Lambda, touts Kappa.

— Joel Shore is excited about streaming analytics.

Commercial Announcements

— Google announces BigQuery 1.11, with “standard” SQL support.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.