Big Analytics Roundup (June 1, 2015)

The Open Data Science Conference launched successfully in Boston this past weekend, attracting more than 1,200 attendees.  Sponsors included Booz Allen, Continuum Analytics, DataRobot, McGraw Hill Education and RStudio, among others.  Organizers plan additional events this year in Boston and San Francisco.

Mary Meeker releases her latest Internet Trends Report.

In Forbes, Louis Columbus rounds up analyst coverage of the Big Analytics space, making this blog a meta-roundup of sorts.

Evil Mad Scientist Paco Nathan reports from his travels to data science conferences around the world.  To read it, sign up for his newsletter here.


In ZDNet, Doug Henschen reports from Alteryx’ Inspire15 conference in Boston, taking note of plans for in-database and Spark integration.

Apache Drill

On the MapR blog (your go-to source for all things Drill), Nitin Bandugula publishes the third of a three-parter on analysis with Drill.  Also, Tomer Shiran explains how to deploy Drill and connect to BI tools.

The Drill team promises Drill in ten minutes.

Apache Flink

On Slideshare, Robert Metzger explains how to run, tune and debug Flink apps, but not why you would want to.

Apache Mahout

The Mahout team quietly releases 0.10.1, which includes bug fixes and such.

Apache Spark

On the kdnuggets blog, Gregory Piatetsky-Shapiro interviews Spark creator Matei Zaharia.  Zaharia shreds claims from Storm and Flink devotees who say that “pure” stream processing is better than Spark’s micro-batch streaming.

Nick Amato demos how to build a simple real-time dashboard using Spark with MapR.

Daoyuan Wang and Jie Huang explain how to tune Java garbage collection on the Databricks developer blog.

In Datanami, Alex Woodie reports on Basho Technologies’ Data Platform, which includes Spark.

Alex Woodie, again, reports on AMPLab’s alpha release of KeystoneML which, according to the announcement, is similar to the package for building machine learning pipelines, only way better.  Which raises the question: if it’s so much better, why isn’t it included in Spark?  I’m guessing that right now it breaks the Spark API and can’t be fixed before Spark 1.4.  Either that, or the developer isn’t on speaking terms with Spark’s leadership team.


Tom Phelan, Bluedata’s Chief Architect, makes some interesting points about data locality for Hadoop.

Graph Analytics

In the morning paper, a graphapalooza.  (h/t Hadoop Weekly)    For starters, there is this paper on Pregel, the Google project that inspired Apache Giraph, followed by two papers on GraphLab.  The first is a general introduction to the GraphLab framework for graph-parallel computing; the second provides more detail on machine learning with GraphLab in the cloud.

There is also a 2012 paper from graph maven Joseph Gonzalez on PowerGraph, a project subsequently rolled into Dato Create.  Dato is the oddly named commercial venture that supports GraphLab.  Two years ago, GraphLab led the race to replace Mahout, but since then Spark has left it in the dust.  GraphLab isn’t just for graph analytics — it supports a variety of machine learning capabilities.

Speaking of which, there is this more recent paper on Spark GraphX, also by Joseph Gonzalez.  GraphX is still fairly rudimentary relative to GraphLab, but the Spark leadership team believes that machine learning use cases often require mixing data-parallel and graph-parallel operations.  Think of Lotus 1-2-3 versus Microsoft Excel; Lotus had the better spreadsheet, but Excel was packaged with Word and PowerPoint, and buried it.

Predixion Software

In news that is likely to be misleading or a distraction, Predixion announces that it is short-listed for the 2015 Red Herring Top 100 North America Award.  They didn’t actually make it into the top 100 yet, they’re just short-listed.

The Milwaukee Brewers are dead last in their division, but may be among the top one hundred baseball teams in North America.


At his r4stats blog, Bob Muenchin reports results from the recent poll of data mining software use.  In the poll, R leads the pack; R, Python and Spark all show strong growth from last year.  Most commercial tools declined in share relative to open source, with several vendors showing steep declines.  (This may be due in part to kdnuggets’ crackdown on ballot-stuffing, which artificially inflated some vendors’s share in previous polls.)   The biggest loser: Predixion Software.


  • Robert Stratton

    Your talk on automation at the ODSC was interesting and I thought the conference was generally running at a pretty high standard. On the (off) topic of automated predictive analytics – I personally think a necessary next step is in addressing the challenge of incorporating domain knowledge through some kind of AI representation of relevant knowledge. This would make decisions at a meta level -decisions that would normally be made by the human in the process and sit above the level of the analytical software. Along the lines of Ross King’s Robot Scientist….

    • Robert,

      Thanks for reading and commenting. We can use embedded expertise to narrow the search space of a brute force search — that leads to better solutions in less time.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.