Big Analytics Roundup (June 8, 2015)
With Spark Summit 2015 coming up in San Francisco next week, expect lots of announcements in the coming week from vendors seeking to catch the wave.
In HBR, Narrative Sciences CEO Stuart Frankel argues that the companies driving big salaries for data scientists are stupid. Okay, he doesn’t actually say they’re stupid, but clearly thinks that data scientists aren’t worth the money, because “they don’t scale.” Mr. Frankel, who is an accountant and lawyer by training, falls into the solipsism of assuming that practices he does not understand must be irrational. His argument is tendentious, the latest in a litany of claims from software vendors that if you buy their software you can fire your analysts. Note that after six rounds of funding, the best Mr. Frankel’s venture can muster is a $10 million “D” round, which suggests that he hasn’t exactly harnessed a unicorn.
Quant recruiter Linda Burtch dives deeper into her SAS versus R poll. Key bits:
- R is strong in academia, healthcare and tech, and is the overwhelming choice among analysts with Data Scientist titles
- R preference increases with education, and decreases with experience
So in other words, if you want a younger and better educated team, standardize on R.
Reynold Xin’s excellent deck Apache Spark in 2015 and Beyond from Apache Con is now available here.
There are too many good posts on the MapR blog to enumerate. Go there, and read.
Some of the decks from last weekend’s Open Data Science Conference in Boston are uploaded here.
Amazon Machine Learning
In TechRepublic, Nick Heath summarizes Amazon Web Services’ new machine learning capability, noting complaints that users cannot export models from AML or import models to the scoring service.
Analytic services provider DataScience announces a $4.5 million Series A round.
Plotly, a cloud-based collaboration and visualization platform, lands a $5.5 million A round.
Here is a piece about Flink at work at Bouygues Telecom.
From HBaseCon, an excellent deck on Kylin’s extreme OLAP engine.
On the O’Reilly Data Show podcast, Ben Lorica interviews Spark release manager Patrick Wendell.
Huawei announces that it is now a certified Spark distributor.
BlueData announces support for Hadoop and Spark on Docker containers.
Training vendor Edureka publishes a nice Spark backgrounder on Slideshare.
MemSQL CEO Eric Frenkiel touts the troika of Kafka, Spark and MemSQL for real-time analytics.
In the WSJ’s CIO Journal, Clint Boulton and Steven Norton report on Spark use by UnderArmour’s MyFitnessPal unit.
On the Databricks developer blog, Burak Yavuz and Reynold Xin preview enhanced statistical and mathematical functions with DataFrames planned for Spark 1.4.
On the Hortonworks Dev blog, part one of a series about data science with Spark. The first part gets you started with Zeppelin, the open source notebook.
DataTorrent announces plans to open source its the existing version of its RTS real-time engine as it releases a new version. DataTorrent claims (without evidence) to offer 10X to 100X faster throughput than Spark or Storm. Daniel Gutierrez passes along DataTorrent content under his byline here.
On KDnuggets, Bhavya Geethika Peddibhotla measures Python machine learning packages by commits and contributors and discovers that scikit-learn is in the lead, by a wide margin. Check the scales on that graph.