Big Analytics Roundup (June 8, 2015)

With Spark Summit 2015 coming up in San Francisco next week, expect lots of announcements in the coming week from vendors seeking to catch the wave.

In HBR, Narrative Sciences CEO Stuart Frankel argues that the companies driving big salaries for data scientists are stupid.  Okay, he doesn’t actually say they’re stupid, but clearly thinks that data scientists aren’t worth the money, because “they don’t scale.”  Mr. Frankel, who is an accountant and lawyer by training, falls into the solipsism of assuming that practices he does not understand must be irrational.   His argument is tendentious, the latest in a litany of claims from software vendors that if you buy their software you can fire your analysts.  Note that after six rounds of funding, the best Mr. Frankel’s venture can muster is a $10 million “D” round, which suggests that he hasn’t exactly harnessed a unicorn.

By way of themorningpaper, McSherry et al study the behavior of scalable systems and note that in some cases your laptop computer will outperform massively parallel architectures.

Quant recruiter Linda Burtch dives deeper into her SAS versus R poll.  Key bits:

  • R is strong in academia, healthcare and tech, and is the overwhelming choice among analysts with Data Scientist titles
  • R preference increases with education, and decreases with experience

So in other words, if you want a younger and better educated team, standardize on R.

Reynold Xin’s excellent deck Apache Spark in 2015 and Beyond from Apache Con is now available here.

There are too many good posts on the MapR blog to enumerate.  Go there, and read.

Some of the decks from last weekend’s Open Data Science Conference in Boston are uploaded here.

Amazon Machine Learning

In TechRepublic, Nick Heath summarizes Amazon Web Services’ new machine learning capability, noting complaints that users cannot export  models from AML or import models to the scoring service.

Analytic Startups

Analytic services provider DataScience announces a $4.5 million Series A round.

Plotly, a cloud-based collaboration and visualization platform, lands a $5.5 million A round.

Apache Flink

Here is a piece about Flink at work at Bouygues Telecom.

Apache Kylin

From HBaseCon, an excellent deck on Kylin’s extreme OLAP engine.

Apache Spark

On the O’Reilly Data Show podcast, Ben Lorica interviews Spark release manager Patrick Wendell.

Huawei announces that it is now a certified Spark distributor.

BlueData announces support for Hadoop and Spark on Docker containers.

Training vendor Edureka publishes a nice Spark backgrounder on Slideshare.

MemSQL CEO Eric Frenkiel touts the troika of Kafka, Spark and MemSQL for real-time analytics.

In the WSJ’s CIO Journal, Clint Boulton and Steven Norton report on Spark use by UnderArmour’s MyFitnessPal unit.

On the Databricks developer blog, Burak Yavuz and Reynold Xin preview enhanced statistical and mathematical functions with DataFrames planned for Spark 1.4.

Apache Zeppelin

On the Hortonworks Dev blog, part one of a series about data science with Spark.  The first part gets you started with Zeppelin, the open source notebook.


DataTorrent announces plans to open source its the existing version of its RTS real-time engine as it releases a new version.  DataTorrent claims (without evidence) to offer 10X to 100X faster throughput than Spark or Storm.  Daniel Gutierrez passes along DataTorrent content under his byline here.


On KDnuggets, Bhavya Geethika Peddibhotla measures Python machine learning packages by commits and contributors and discovers that scikit-learn is in the lead, by a wide margin.  Check the scales on that graph.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.