Big Analytics Roundup (June 1, 2015)
The Open Data Science Conference launched successfully in Boston this past weekend, attracting more than 1,200 attendees. Sponsors included Booz Allen, Continuum Analytics, DataRobot, McGraw Hill Education and RStudio, among others. Organizers plan additional events this year in Boston and San Francisco.
Mary Meeker releases her latest Internet Trends Report.
In Forbes, Louis Columbus rounds up analyst coverage of the Big Analytics space, making this blog a meta-roundup of sorts.
Evil Mad Scientist Paco Nathan reports from his travels to data science conferences around the world. To read it, sign up for his newsletter here.
In ZDNet, Doug Henschen reports from Alteryx’ Inspire15 conference in Boston, taking note of plans for in-database and Spark integration.
On the MapR blog (your go-to source for all things Drill), Nitin Bandugula publishes the third of a three-parter on analysis with Drill. Also, Tomer Shiran explains how to deploy Drill and connect to BI tools.
The Drill team promises Drill in ten minutes.
On Slideshare, Robert Metzger explains how to run, tune and debug Flink apps, but not why you would want to.
The Mahout team quietly releases 0.10.1, which includes bug fixes and such.
On the kdnuggets blog, Gregory Piatetsky-Shapiro interviews Spark creator Matei Zaharia. Zaharia shreds claims from Storm and Flink devotees who say that “pure” stream processing is better than Spark’s micro-batch streaming.
Nick Amato demos how to build a simple real-time dashboard using Spark with MapR.
Daoyuan Wang and Jie Huang explain how to tune Java garbage collection on the Databricks developer blog.
In Datanami, Alex Woodie reports on Basho Technologies’ Data Platform, which includes Spark.
Alex Woodie, again, reports on AMPLab’s alpha release of KeystoneML which, according to the announcement, is similar to the Spark.ml package for building machine learning pipelines, only way better. Which raises the question: if it’s so much better, why isn’t it included in Spark? I’m guessing that right now it breaks the Spark API and can’t be fixed before Spark 1.4. Either that, or the developer isn’t on speaking terms with Spark’s leadership team.
Tom Phelan, Bluedata’s Chief Architect, makes some interesting points about data locality for Hadoop.
In the morning paper, a graphapalooza. (h/t Hadoop Weekly) For starters, there is this paper on Pregel, the Google project that inspired Apache Giraph, followed by two papers on GraphLab. The first is a general introduction to the GraphLab framework for graph-parallel computing; the second provides more detail on machine learning with GraphLab in the cloud.
There is also a 2012 paper from graph maven Joseph Gonzalez on PowerGraph, a project subsequently rolled into Dato Create. Dato is the oddly named commercial venture that supports GraphLab. Two years ago, GraphLab led the race to replace Mahout, but since then Spark has left it in the dust. GraphLab isn’t just for graph analytics — it supports a variety of machine learning capabilities.
Speaking of which, there is this more recent paper on Spark GraphX, also by Joseph Gonzalez. GraphX is still fairly rudimentary relative to GraphLab, but the Spark leadership team believes that machine learning use cases often require mixing data-parallel and graph-parallel operations. Think of Lotus 1-2-3 versus Microsoft Excel; Lotus had the better spreadsheet, but Excel was packaged with Word and PowerPoint, and buried it.
In news that is likely to be misleading or a distraction, Predixion announces that it is short-listed for the 2015 Red Herring Top 100 North America Award. They didn’t actually make it into the top 100 yet, they’re just short-listed.
The Milwaukee Brewers are dead last in their division, but may be among the top one hundred baseball teams in North America.
At his r4stats blog, Bob Muenchin reports results from the recent kdnuggets.com poll of data mining software use. In the poll, R leads the pack; R, Python and Spark all show strong growth from last year. Most commercial tools declined in share relative to open source, with several vendors showing steep declines. (This may be due in part to kdnuggets’ crackdown on ballot-stuffing, which artificially inflated some vendors’s share in previous polls.) The biggest loser: Predixion Software.