Big Analytics Roundup (April 20, 2015)
Top news this week: a couple of Spark maintenance releases, some interesting new Apache projects, an announcement from Hortonworks and some interesting content from Databricks and Teradata.
Also in the news this week, North Bridge and Black Duck Software release their ninth annual Future of Open Source survey. Meanwhile, Hortonworks, IBM and Pivotal announce ODP harmonization, round up endorsements from their own executives. It’s touching to see such excitement.
Also, the Open Data Science Conference has released the schedule for its Boston events in May.
If you haven’t bookmarked Andrea Mostosi’s incredibly comprehensive catalog of Big Data technologies, you should.
On the MapR blog, Kirk Borne touts Drill with seemingly exaggerated claims for something still in Release 0.8.
Also on the MapR blog — one senses a trend — Andries Engelbrecht offers a guide to social media analysis with Drill and MicroStrategy.
On Slideshare, Ted Dunning introduces Kylin, an Apache incubator project for OLAP cubes on Hadoop.
For an overview of Spark, see my Apache Spark page.
The Spark team releases two double-dot releases, Spark 1.2.2 and Spark 1.3.1. The former includes bug fixes in Spark Core and PySpark; the latter includes bug fixes for Spark Core, PySpark, Spark SQL and Spark Streaming. Ninety developers contributed to the two releases.
Huawei’s global big data team guest-posts on the Databricks blog, summarizes the newly added FP-Growth and Power Iteration Clustering algorithms. The article includes performance comparison of FP-Growth in Spark versus a similar algorithm in Mahout. Spoiler: Spark is a lot faster.
Bob DuCharme uses Spark’s GraphX library to build a graph from the U.S. Library of Congress’ subject headings.
Michael Armbrust and colleagues dive deeply into Spark SQL’s Catalyst optimizer.
Hortonworks announces GA for Spark 1.2.1 in HDP 2.2.4. Horton’s announcement includes ORC file support for Spark and Ambari integration and an endorsement for Apache Zeppelin, a notebook for data science. Horton also announces that it has “worked with the community to ensure that Spark runs on a Kerberos-enabled cluster.” I don’t know what that means, exactly — you either support a feature or you don’t — but it sounds positive.
Saptak Sen offers a hands-on tour of Spark in the Hortonworks Sandbox.
Loraine Lawson asks whether Apache Spark is enterprise-ready, which is kind of ironic given the seven previous items.
Databricks publishes two primers, one for Apache Spark and the other for Databricks Cloud.
On the Databricks blog, CEO Ion Stoica touts the Jobs feature in Databricks Cloud
IBM InfoSphere BigInsights