Big Analytics Roundup (April 18, 2016)
In hard news this week, Storm hits a milestone with Release 1.0, Google releases TensorFlow 0.8 with distributed computing support, and DataStax announces DataStax Enterprise Graph. And, following on NVIDIA’s DGX-1 announcement last week there are a number of items on Deep Learning featured below.
— Adrian Colyer summarizes a paper that summarizes 900 other papers on Deep Learning.
— Data Science Central compiles a slew of links on Deep Learning.
— Nicole Hemsoth interviews NVIDIA Veep Marc Hamilton, who ruminates on the convergence of supercomputing and Deep Learning.
— On the Pivotal Big Data blog, Alexey Grischchenko explains what’s up with Apache Hawq, the SQL-on-Hadoop-and-Greenplum engine that is now an Apache Incubator project. According to OpenHub, there’s a lot of activity on Hawq, and contributions are up sharply since it went Apache.
— In KDnuggets, Microsoft’s Brandon Rohrer publishes a handy pocket guide to data science.
— Nicholas A. Perez explains custom streaming sources in Spark.
— Ian Pointer explains Apache Beam, and how it aspires to be the uber-API.
— Abie Reifer explains Microsoft Azure HDInsight.
— Yong Feng of IBM’s Spark Technology Center explains results of a test run with Spark on Mesos.
— Gopal Wunnava explains geospatial intelligence with SparkR on Amazon EMR.
— IBM’s Fred Reiss explains SystemML, for those who missed his presentation at Spark Summit East.
— For masochistic sabremetricians, Nick Amato explains baseball statistics with Hive and Pig.
— Serdar Yegulalp reviews Apache Storm 1.0. He likes it.
— DataArtisans’ Kostas Tzoumas explains counting in streams, then touts Flink.
— Timothy Prickett Morgan reports on HPE’s efforts to put Spark on a Superdome. Results are interesting. But as with IBM running Spark on a mainframe, such efforts overlook a key benefit of Hadoop and Spark: the ability to avoid dealing with the likes of HPE and IBM.
— Katharine Kearnan interviews Nick Pentreath, one of the two Spark Committers IBM has hired. He predicts that in Spark 2.0, the ML pipeline API approaches parity with the MLlib API. Interestingly, he doesn’t expect a lot from SparkR.
— In Forbes, Chris Wilder recaps his visit to Google Cloud Platform NEXT 2016.
— Andrew Brust summarizes Hortonworks’ recent announcements, sees an emerging duopoly of Cloudera and Hortonworks. I’m not inclined to dismiss MapR and AWS so easily.
— Craig Stedman comments on Pivotal’s exit from the Hadoop distribution market, quotes some old guy wondering how much longer IBM will keep BigInsights alive. My take on Pivotal: honestly, I thought they exited a year ago.
— Cloud platform Altiscale’s Raymie Stata surveys Hadoop’s history, sees movement to the cloud.
— James Nunns wonders if the top Hadoop distributors can steal the show from Spark at Hadoop Summit 2016. If you count the number of times the word “Spark” appears in Hortonworks’ announcement, the answer is no.
— Ajay Khanna opines that absent data quality and metadata management, your data lake will turn into a data swamp.
— Nick Bishop interviews MSFT’s research chief, who assures him that AI is too stupid to wipe us out. I worry more about the chemtrails.
Open Source Announcements
— Google announces TensorFlow 0.8, with distributed computing support and new libraries for user-defined distributed models.
— Apache Mahout announces release of Mahout 0.12.0, with Flink bindings to the Samsara engine. Contributors from DataArtisans did most of the work, as most other contributors have long since exited this project.
— DataStax announces DataStax Enterprise Graph (DSE Graph), built on Apache Cassandra and Apache Tinkerpop (a graph computing framework.) A year ago, Datastax acquired Aurelius, the commercial venture behind Titan, an open source distributed graph database; Titan uses Cassandra as a back end. DSE Graph includes extensions found in DataStax Enterprise, including security, search, analytics and monitoring tools. Alex Handy reports.
— Databricks announces new content for its Community Edition:
- Lectures and labs from last year’s Machine Learning with Spark MOOC.
- Sample application: Million Songs analysis pipeline in R and Scala.
- Third party Notebook: Golden State Warrior pass analysis.
— Hortonworks previews HDP 2.4.2. Key bits:
- Spark 1.6.1.
- Spark SQL certified with ODBC.
- Bug fixes for Spark/Oozie connection for Kerberos-enabled clusters.
- Spark Streaming with Apache Kafka in a Kerberos-enabled cluster.
- Spark SQL with ORC performance improvements.
- Final technical preview of Apache Zeppelin with Kerberos, LDAP and identity propagation.
— Hortonworks also announces that Pivotal HDP is officially dead. Pivotal announces nothing.
— Teradata announces that its Think Big subsidiary is expanding its data lake and managed service offerings using Apache Spark. This is good news for the eight consultants at Think Big with Spark credentials, as it means less time spent on the bench. Meanwhile, Think Big contributes a distributed K-Modes in PySpark to open source, the first such contribution since 2014. For some reason, they did not contribute it to Spark packages.
— Atigeo, a “compassionate technology company”, announces that is has added Spark 1.6 to its xPatterns platform.
— Lucidworks announces release of Lucidworks View, a component that simplifies development of applications on Solr and Spark.
— DataRPM, “Cognitive Data Science” company with very little money announces partnership with Tamr, a data integration company with lots of money.