Big Analytics Roundup (May 23, 2016)
Google announces that it has designed an application-specific integrated circuit (ASIC) expressly for deep neural nets. Tech press goes bananas. The chips, branded Tensor Processing Units (TPUs) require fewer transistors per operation, so Google can fit more operations per second into the chip. In about a year of operation, Google has achieved an order of magnitude improvement in performance per watt for machine learning.
Google’s Felipe Hoffa summarizes Mark Litwintschik’s work benchmarking different platforms with the New York City Taxi and Limo Commission’s public dataset of 1.1 billion trips. So far, Mark has tested PostgreSQL on AWS, ElasticSearch on AWS, Spark on AWS EMR, Redshift, Google BigQuery, Presto on AWS and Presto on Cloud Dataproc. Results make Google look good, but you should read Mark’s original posts.
Open Data Science Conference
The second annual Open Data Science Conference (ODSC) East met in Boston over the weekend. Attendance doubled from last year, to 2,400.
Registration was a snafu, because the conference organizers did not accurately predict walk-in traffic or staffing needs. The jokes write themselves.
Content was excellent. Keynoters included Stefan Karpinski (Julia co-creator), Kirk Borne of Booz Allen Hamilton, Ingo Mierswa, CTO of RapidMiner and Lukas Biewald, CEO of Crowdflower. Track leaders included JJ Allaire and Joe Cheng of RStudio, Usama Fayyad of Barclays and John Thompson of the US Census Bureau. Sponsors included Basis Technology, CartoDB, CrowdFlower, Dataiku, DataRobot, Dato, Exaptive, Facebook, H2O.ai, MassMutual, McKinsey, Metis, Microsoft, RapidMiner, SFL Scientific and Wayfair.
Prompted by a tweet, I stopped at the Dataiku table. The conversation went like this:
- Me: What does Dataiku do, in 25 words or less?
- Dataiku: DataRobot.
- Me: What?
- Dataiku: We do what DataRobot does.
At this point, it was clear to me that Mr. Dataiku either did not know what DataRobot does, or thought I don’t know what DataRobot does. So I changed the subject.
The next ODSC event is in October, in London.
— Michael Armbrust and Tathagata Das explain Structured Streaming in Spark 2.0
— Adrian Colyer goes 5 for 5 for the week:
- End to end analysis of the spam value chain.
- The business model of spam-based advertising in pharmaceuticals.
- Understanding and detecting malicious web advertising.
- Malvertising: ad-injecting browser extensions.
- The landscape of domain-name typosquatting.
— Tim Hunter, Hossein Falaki and Joseph Bradley explain HyperLogLog and Quantiles in Spark.
— Microsoft’s Raymond Laghaeian explains how to use Azure ML predictions in Google Spreadsheet.
— Sam Dean celebrates Drill’s first anniversary.
— Taylor Goetz delivers a brief history of Apache Storm.
Open Source Announcements
— MongoDB releases a new Spark Connector.
— Apache Tajo announces Release 0.11.3, with five bug fixes.
— Apache Mahout announces Release 0.12.1, a maintenance release that resolves an issue with Flink integration.
— RedPoint Global snags a $12 million “C” round.
— TIBCO announces something called Accelerator for Apache Spark, a bundle of tools that connect TIBCO products with open source packages. While TIBCO refers to this component as open source, the software is available only to TIBCO customers, which means it isn’t Free and Open Source.
— MapR applauds itself.