Big Analytics Roundup (May 9, 2016)
The big news this week: Teradata’s CEO Mike Keough walks the plank. TDC stock rises 21% on dismal numbers, which demonstrates how much Wall Street values leadership.
CRN releases its fourth annual Big Data 100 in listicle form to maximize clicks. Criteria for inclusion are “editor’s picks”, so whatever. I got through the As before giving up.
Dave Ramel details five leading Apache Big Data projects: Spark, Tez, Bigtop, REEF and Storm. What? It’s a nice summary of each, but Ramel is a slave to Apache’s silly classifications.
Here are four rules for benchmarks.
- Use a standard test protocol, such as TPC-DS.
- When there is no available standard, test multiple use cases. Make a decent effort to try a variety of workloads.
- Communicate with sponsors for all benchmarked software, or communicate with none of them.
- Publish your code and your data. (There’s this thing called GitHub….)
The ironically named Mammoth Data (current headcount: 15) violates all four rules in a Google-commissioned “study,” which concludes that Cloud Dataflow runs one use case faster than Spark. Professional cat herder Andrew Oliver replaces his Mammoth CEO hat with his analyst hat and touts the results.
Go to the back of the class, Andrew. Run more use cases, discuss results with the Spark team as well as the Google team, then let us know what you learned. I don’t doubt that Dataflow is a nifty tool, and look forward to seeing a benchmark we can trust.
— Adrian Colyer focuses on time series:
- Gorilla: a fast, scalable in-memory time series database.
- BTrDB (Berkeley Tree Database), optimized storage for time series processing.
- The Tarzan algorithm, a technique that discovers surprising patterns in a time series database. (Fixed link — h/t Oliver Vagner).
— On BrightTalk, Databricks’ Reynold Xin explains the new bits in Spark 2.0, to be released soon.
— On the DataRobot blog, Quantopian’s Thomas Wiecki explains how to predict out-of-sample performance for trading algorithms.
— Indeed.com’s Preetha Appan explains algorithms and architecture for recommendation engines.
— In a webcast, Sean Owen and Yann Delacourt explain real-time analytics with Spark.
— Microsoft’s Lixun Zhang explains the differences among open source R, Microsoft R Open and Microsoft R Server.
— In Datanami, George Leopold profiles DataRobot, a machine learning startup. One point he gets wrong, DataRobot runs on Hadoop in the cloud and it runs on Hadoop on premises.
— On the Google Cloud blog, Tyler Akidau offers Google’s perspective on why they moved Cloud Dataflow development to Apache Beam. DataArtisans chirps support. Here’s what OpenHub has to say about Apache Beam:
— In WSJ’s CIO Journal, Steven Norton interviews Airbnb’s Mike Curtis, who name-drops Apache Spark. In the same venue, Clint Boulton previously reported that Airbnb uses Spark in its Aerosolve project.
— Jim O’Reilly offers a summary of the differences among AWS, Azure and Google Cloud.
— On the Qubole blog, Monique Chmiel tries to summarize the pros and cons of Python, R and Scala for Big Data, and largely fails. None of the three is suitable for Big Data on its own, so you have to evaluate them for their APIs to scalable platforms like Spark. As of today, the Spark APIs for Scala and Python are clearly superior to the R API.
News from commercial software providers, as well as commercial vendors that operate on an open source software model.
— Hortonworks announces that it lost $1.59 for every dollar it sold in Q1, which is slightly better than the $1.85 it lost in Q1 of 2015. At that rate, look for HDP to break even in 2018 or so, unless they run out of cash first. Wall Street drives stock down 18%.
— Teradata fires CEO, Wall Street celebrates. Don’t party too hard, guys; the numbers still stink.
Stuff I Really Don’t Care About
— Basho releases Riak TS to open source.