Big Analytics Roundup (July 11, 2016)
Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.
Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.
Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.
The State of Fast Data
OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.
As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however. Key bits:
- In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.” The rest chose definitions in the minutes and even hours.
- Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
- Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
- Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
- HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
- A few diehards (9%) do not use open source software. 47% exclusively use open source.
- 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.
It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.
Alex Woodie summarizes the report. (Fixed broken link).
Top Read of the Week
— Jake Vanderplas explains why Python is slow.
— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.
— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.
— Madison J. Myers explains how to get started with Apache SystemML.
— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.
— In a video, Evan Sparks explains what KeystoneML does.
— John Russell explains what pbdR is, and why you should care (if you use R).
— Manny Puentes of Altitude Digital explains how to invest in a big data platform.
— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.
— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.
— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.
— Raj Kosaraju opines on the impact of machine learning on everyday life.
— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.
— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.
Open Source Announcements
— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.
— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.