Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s