Just three main stories this week: possible trouble for a pair of analytic startups; Google releases TensorFlow to open source; and H2O delivers new capabilities at its annual meeting.
Two items of note from the Databricks blog:
— Darin McBeath describes Elsevier’s Spark use case and introduces spark-xml-utils, a Spark package contributed by his team. The package enables the Spark user to filter documents based on an Path expression, return specific nodes for an Path/XQuery expression and transform documents using an XLST stylesheet.
— Rachit Agarwal and Anurag Khandelwal of Berkeley’s AMPLab introduce Succinct, a distributed datastore for queries on compressed data. They announce release of Succinct Spark, a Spark package that enables search, count, range and random access queries on compressed RDDs. The authors claim a 75X performance advantage over native Spark using Succinct as a document store,
Three interesting stories on streaming data:
- In a podcast, Data Artisans CTO Stephan Ewen discusses Flink, Spark and the Kappa architecture.
- Techalpine’s Kaushik Pal compares Spark and Flink for streaming data.
- Will McGinnis helps you get started with Python and Flink.
(1) Analytic Startups in Trouble
In The Information, Steve Nellis and Peter Schulz explain why startups return to the funding well frequently — and why those that don’t may be in trouble. Venture funding isn’t a perfect indicator of success, but is often the only indicator available. On the list: Skytree Software and Alpine Data Labs.
(2) Google Releases TensorFlow for Machine Learning
On the Google Research blog, Google announces open source availability of TensorFlow. TensorFlow is Google’s second generation machine learning system; it supports Deep Learning as well as any computation that can be expressed as a flow graph. Read this white paper for details of the system. At present, there are Python and C++ APIs; Google notes that the C++ API may offer some performance advantages.
Video intro here.
On Slate, Will Oremus feels the buzz.
On his eponymous blog, Sachin Joglekar explains how to do k-means clustering with TensorFlow.
Separately, in VentureBeat, Jordan Novet rounds up open source frameworks for Deep Learning.
(3) H2O.ai Releases Steam
It’s not a metaphor. At its second annual H2O World event, H2O releases Steam, an open source data science hub that bundles model selection, model management and model scoring into a single container for elastic deployment.
— Michal Malohlava of H2O and Richard Garris of Databricks explain how to run H2O on Databricks Cloud. Separately, Michal demonstrates Sparkling Water, a Spark package that enables a Spark user to call H2O algorithms; Nidhi Mehta leads a hands-on with PySparkling Water; and Xavier Tordoir of Data Fellas exhibits Interactive Genomes Clustering with Sparkling Water on the Spark Notebook.
— Szilard Pafka of Epoch summarizes his work to date benchmarking R, Python, Vowpal Wabbit, H2O, xgboost and Spark MLLib. As reported previously, Pafka’s benchmarks show that H2O and xgboost are the best performers; they are faster and deliver more accurate models.
As reported in last week’s roundup, H2O.ai also announces a $20 million “B” round.