Big Analytics Roundup (December 21, 2015)

With the holidays approaching, we still have some hard news; plus, some explainers and end of 2015 roundups.  I’ll post my own roundup of 2015 later this week.

On the BlueData blog, Anant Chintamaneni delivers an excellent overview of Hadoop virtualization, and the trend toward decoupling compute and storage. (h/t Hadoop Weekly)

Quick Hits

  • In InfoWorld,’s Sri Ambati delivers a well-written practical introduction to machine learning.
  • There’s a new look at RTInsights, a site that aggregates interesting content on real time analytics.
  • At Amigo Bulls Giulio Prisco wonders what Facebook’s open sourcing of Big Sur means for Facebook stock.
  • Also on Slideshare,’s Matt Dowle celebrates clean data.


  • On the AWS Big Data blog, Nick Corbett explains how to tune your Titan Graph database on AWS.
  • On the Confluent blog, Liquan Pei explains how to build ETL with Kafka Connect.
  • Rick Van Der Lans delivers an excellent guide to SQL syntax with Apache Drill.
  • On DZone, Henryk Konsek explains how to connect Apache Camel with Apache Spark.  FWIW.
  • On the Slalom blog, Kevin Feit and Oliver Asmus explain how to get started with Microsoft Azure Machine Learning.  Not that much explaining is required; AML is very easy to use.

Best of 2015

  • Eric Knorr summarizes the year 2015 in cloud.  Key bits:  AWS pulls ahead; machine learning moves to the cloud; Microsoft’s hybrid cloud.
  • On the Apache Flink blog, Robert Metzger updates the community on the year 2015 in Flink.  More on Slideshare.


(1) Time Series Analytics for Spark

On the Cloudera Engineering Blog, Sandy Ryza introduces Spark-TS, a library of time series functions that fills a major gap in Spark functionality.  The library includes Scala and Python APIs.

(2) FuxiSort Smashes Sort Records

Here’s a story from October that I missed.  A team from Alibaba demolishes the sort speed records in four categories with the unfortunately named FuxiSort.

(3) Qubole Adds Google Cloud Platform Support

Big Data-as-a-service provider Qubole announces Spark service for Google Cloud Platform.  Qubole Data Service offers persistent Spark notebooks and automatic provisioning for Spark Clusters.  QDS is now available on the three leading Cloud platforms.

(4) TPC Releases Benchmark Standard for Big Data

The Transaction Processing Council (TPC) releases TPC-DS 2.1, an industry standard benchmark for SQL-based Big Data systems.  The standard models the complete decision support process, measuring query response time, throughput, data integration performance and data load for a given system configuration.  Details here.

(5) New Stuff for Microsoft Azure Machine Learning

Microsoft announces GA for free Excel add-in to use with Web services published from Azure Machine Learning.  Details here.

In other AML news, Azure customers can now apply Azure Machine Learning models as a function on streaming data.  On Microsoft’s Machine Learning blog, Sudhesh Suresh reports.  Gary Ericson delivers a cheat sheet.

(6) Small Improvements for BigInsights

IBM announces BigInsights 4.1 Fix Pack 2, which adds support for SLES and Spark 1.5.1, plus enhancements to Big SQL and Text Analytics.

(7) New Bits for Drill

The Drill team announces Release 1.4, with minor enhancements.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.