Big Analytics Roundup (April 20, 2015)

Top news this week: a couple of Spark maintenance releases, some interesting new Apache projects, an announcement from Hortonworks and some interesting content from Databricks and Teradata.

Also in the news this week, North Bridge and Black Duck Software release their ninth annual Future of Open Source survey.  Meanwhile, Hortonworks, IBM and Pivotal announce ODP harmonization, round up endorsements from their own executives.  It’s touching to see such excitement.

Also, the Open Data Science Conference has released the schedule for its Boston events in May.

If you haven’t bookmarked Andrea Mostosi’s incredibly comprehensive catalog of Big Data technologies, you should.

Apache Drill

On the MapR blog, Kirk Borne touts Drill with seemingly exaggerated claims for something still in Release 0.8.

Also on the MapR blog — one senses a trend — Andries Engelbrecht offers a guide to social media analysis with Drill and MicroStrategy.

Apache Kylin

On Slideshare, Ted Dunning introduces Kylin, an Apache incubator project for OLAP cubes on Hadoop.

Apache Spark

For an overview of Spark, see my Apache Spark page.

The Spark team releases two double-dot releases, Spark 1.2.2 and Spark 1.3.1.   The former includes bug fixes in Spark Core and PySpark; the latter includes bug fixes for Spark Core, PySpark, Spark SQL and Spark Streaming.  Ninety developers contributed to the two releases.

Huawei’s global big data team guest-posts on the Databricks blog, summarizes the newly added FP-Growth and Power Iteration Clustering algorithms.  The article includes performance comparison of FP-Growth in Spark versus a similar algorithm in Mahout.  Spoiler:  Spark is a lot faster.

Bob DuCharme uses Spark’s GraphX library to build a graph from the U.S. Library of Congress’ subject headings.

Michael Armbrust and colleagues dive deeply into Spark SQL’s Catalyst optimizer.

Talend adds Apache Spark scenario to its Big Data Sandbox for Cloudera.

Hortonworks announces GA for Spark 1.2.1 in HDP 2.2.4.  Horton’s announcement includes ORC file support for Spark and Ambari integration and an endorsement for Apache Zeppelin, a notebook for data science.   Horton also announces that it has “worked with the community to ensure that Spark runs on a Kerberos-enabled cluster.”  I don’t know what that means, exactly — you either support a feature or you don’t — but it sounds positive.

Saptak Sen offers a hands-on tour of Spark in the Hortonworks Sandbox.

Loraine Lawson asks whether Apache Spark is enterprise-ready, which is kind of ironic given the seven previous items.


Databricks publishes two primers, one for Apache Spark and the other for Databricks Cloud.

On the Databricks blog, CEO Ion Stoica touts the Jobs feature in Databricks Cloud

Databricks announces that Boston-based Celtra has implemented its self-service ad platform in Databricks Cloud.  Case study here.

IBM InfoSphere BigInsights



Teradata Aster releases a couple of videos, one on Aster Analytics, the other on Aster R.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.