Big Analytics Roundup (August 3, 2015)

This week: IBM pours new wine into old bottles; priorities for the newly formed R Consortium; insight into Spark Streaming and Spark ML pipelines; and the usual snark.

The Linux Foundation’s Apache Big Data conference to be held in Budapest in September has already posted slides featuring Spark, Ignite, S2Graph, Kylin and WSO2.

Greta Roberts of Talent Analytics wants you to stop hiring data scientists.

R Consortium

Oh his r4stats blog, Bob Muenchin outlines areas that should be top priority for the newly formed R Consortium.  They include:

  1. Package Selection
  2. Package Accuracy
  3. Package Longevity
  4. Generic Functions
  5. Output to Microsoft Office
  6. Graphical User Interface

It’s hard to argue with Bob’s list of priorities, but I doubt the R Consortium will have much impact:

— Items 1-3 are all symptomatic of R’s heterodox approach to development, which is either a bug or a feature depending on your perspective.  There is a demand for curated and supported R environments in enterprises, which can and should be satisfied by service providers.

— Microsoft Office integration is a likely outcome of Microsoft’s acquisition of Revolution Analytics.

There is no future in an R GUI:

— Alteryx already offers a workflow-based GUI that can run R code snippets separately developed and tested.

— The missing link in the market today is a GUI-based R code generator that can accept graphical user input and generate R code on the fly.

— Building such an application is a major development project for a very narrow market.

— If the end user does not want to code, you might as well build the application in C or Java (e.g. Alteryx).

— Hence, the market consists of users who do not want to code, but insist that the tool output R code.  That audience can meet at a table for four in your local Starbucks.

Current membership in the R Consortium includes the R Foundation, Microsoft, RStudio, TIBCO, Alteryx, Google, HP and Oracle.  Conspicuously absent from the group while touting R capability: IBM, Pivotal and Teradata.

Apache Drill

Tugdual Grall explains how to create a new function in Drill.

On TechTarget, Jack Vaughan opines about SQL-on-Hadoop, including Drill.

Apache Druid

In ADTMag, David Ramel reports that folks at Yahoo like Druid for scalable OLAP on Hadoop.  This page summarizes Druid’s capabilities, with comparisons to other platforms.  Conspicuously missing:

Apache Flink

On Slideshare, Alijoscha Krettek of Data Artisans publishes a summary of new features planned for Flink 0.10.  Key bits:

  • High availability of master node
  • Live monitoring
  • Improvements to event-time, watermarks and windowing

On IndianWeb2, a survey of eight open source Big Data projects to watch, including Flink and Zeppelin.

Flink is also listed among eight cool new Python tools in a story on the Galvanize blog.

Apache Mesos

announces Mesos 0.23.0, which has number of enhancements.

On InfoQ, Netflix’ Diptanu Choudhury summarizes distributed scheduling with Mesos.

Apache Spark

Four pieces on the Databricks blog:

  • Tathagata Das, Matei Zaharia and Patrick Wendell dive into Spark Streaming’s execution model.  They take on the “pure streaming versus micro-batching” controversy, if you can call it that, and put it to rest.
  • Joseph Bradley, Xiangrui Meng and Burak Yavuz summarize Spark 1.4’s new features for machine learning pipelines.  Key bits: a dozen new feature transformers, extension of the pipeline and Python APIs to include more machine learning algorithms and the ability to customize pipelines.
  • Tao Wang touts the new SequoiaDB Connector for Spark.
  • Burak Yavuz explains Spark packages and Maven libraries.

IBM successfully plants a piece about Bluemix on Forbes.

Hortonworks offers a tutorial on using IPython notebook with Spark.

On the Sigmoid blog, Arush Kharbanda looks under Spark’s hood.

New Wine in Old Bottles Department: When IBM announced plans to support Spark, I joked to a friend that somebody in the bowels of the company will propose putting Spark on a mainframe. Well, LOL.


Danny Bickson expresses excitement about two new GraphLab features.  Or are they Dato features?  With this company/project, it’s hard to tell.


On Slideshare, Erin Ladell reviews the state of machine learning in medicine.

In this video, Hank Roark discusses use of H2O and R with data from the New York Citi Bike service.

Julian Hillebrand explains how to predict social network influence with R and H2O for ensemble learning.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.