Big Analytics Roundup (June 6, 2016)
We have a slightly abbreviated roundup this week due to travel to the Spark Summit. Spark 2.0 is the top story; I will do a full roundup when the release goes GA.
Also, Bob Muenchen publishes another snippet from the long-awaited Rexer survey of working data miners. This one focuses on satisfaction with tools. KNIME and R look good; SAS and SPSS Statistics, not so much.
Forrester publishes its 2016 Big Data Hadoop-Optimized Systems report. Everyone’s either a Leader or a Strong Performer, just like in Lake Wobegon. You can buy the report here, or just look at the picture below. Teradata is really excited to be #2, although Big Data Hadoop-Optimized Systems cannibalize the rest of their product line.
— Spark 2.0 is in preview release. It’s available on Databricks, or directly from the Apache site.
— Jules Damji rounds up a slew of links on Spark 2.0.
— Alex Giamas is so excited about Spark 2.0 that he misunderstands the status of the machine learning libraries. No, MLlib is not deprecated — not yet, anyway. Spark may deprecate MLlib in the future, when ML gets to feature parity. Xiangrui Meng suggests that may happen in Spark 2.2. Update: Alex has corrected his article.
— Microsoft announces major new commitment to Spark. The specific products cited in the press release were all announced previously, with the possible exception of PowerBI on Spark Streaming.
— Three from Adrian Colyer:
- Productivity in open source projects.
- Sequence to sequence learning with neural networks.
- Semi-supervised sequence learning.
— Paul Smaldino and Richard McElreath on the theory of bad science.
Benchmarks That Don’t Suck
— The Transaction Processing Council announces release of TPCx-BB, a benchmark designed to measure the performance of analytic data processing, queries and machine learning across thirty use cases. In Datanami, George Leopold reports.
— Joseph Bradley explains machine learning model persistence in Spark 2.0.
— Taylor Goetz explains new features in Apache Storm 1.0.
— Jordan Volz explains how to analyze fantasy basketball stats with Spark.
— Ian Pointer explains differences between Apache Storm and Heron, Twitter’s recently open sourced streaming engine.
— Suresh Thalamati explains how to use the Spark Netezza connector, so you can move the data when you decommission that old box.
— Alex Woodie pooh-poohs Lambda, touts Kappa.
— Joel Shore is excited about streaming analytics.
— Google announces BigQuery 1.11, with “standard” SQL support.