Big Analytics Roundup (August 17, 2015)

Catching up from vacation last week.  Top stories: results of a SQL-on-Hadoop evaluation at Pearson; Google launches Dataflow (giving Flink a boost); while IBM shoehorns Spark onto a mainframe, Vertica gets the jump on IBM PureData with native Spark integration.

Kaggle announces two new competitions:

  • Springleaf Financial, an Indiana credit union founded in 1920, has rebranded to target millenials. They want you to help them target their direct marketing.  Contest ends October 19.
  • Dato, the oddly-named analytic software company, co-sponsors a contest with StumbleUpon to classify web pages as “sponsored” or “organic”.  You don’t have to use Dato software, but you’re eligible for a bonus if you do.  Contest ends October 6.

Alex Woodie asks if Scala will take over the Big Data world, fails to answer the question.  The correct answer is “no”.  Data scientists prefer to work in higher-level languages like R and Python, and Python works well for application development.


On the Qubole blog, Sumit Arora summarizes results from a recent evaluation of SQL-on-Hadoop by Pearson, the global learning company.  Arora’s team evaluated Spark SQL and Presto with Text, Avro, Parquet and ORC formats.  They excluded Impala from testing due to lack of support for complex query types and Amazon S3; they did not consider Drill or Hive on Tez.

Pearson selected Spark SQL and Parquet file format, for reasons detailed in the article.

Apache Drill

Marketing services provider Harte-Hanks announces selection of MapR and Drill  for its open CRM platform.

Jim Scott of MapR proposes rethinking SQL for Big Data with Drill.  Video here.

Apache Flink

Kostas Tzoumas of Data Artisans describes Flink’s low-latency and exactly-once stream processing architecture.

On KDnuggets, Tzoumas and Stephan Ewen argue the case for Flink for stream processing.

On, Nezih Yigitbasi explains how to crunch Parquet files with Flink.

Apache Spark

IBM continues to pour new wine into old bottles.

While IBM tries to run Spark on mainframes and Pivotal messes around with Hawq, HP gets a jump on both, announces plans for integration of Vertica with Spark.  David Ramel summarizes the benefits.

In case you haven’t heard about Spark, Andy Patrizio explains its appeal.

Two interesting items on the MapR blog:

  • On the MapR blog, Joseph Blue asks if Harper Lee wrote To Kill a Mockingbird.  He describes using Spark and Lucene to compare the first chapter of Go Set a Watchman with To Kill a Mockingbird; the analysis suggests two different authors.
  • Nitin Bandugula details real-time use cases for Spark on Hadoop.

On the Cloudera blog, guest authors Sam Savage and Harry Powell describe the use case for Spark at Barclays.  Note to SAS: when you’ve lost the banks….

Databricks guest blogger Olivier Girardot compares Pandas and Spark DataFrames.


announces an upgrade of the Databricks platform, loosely called “Databricks 2.0”.  (As a SaaS offering, Databricks maintains a two-week release cadence.)  Key bits:

  • Support for Spark 1.4
  • Improved security and access control
  • Developer notebook versioning
  • Support for multi-tenancy

David Ramel reports.  More here,  here and here.  It seems that Databricks has hired a PR agency.


posts videos from the 2015 Data Science Summit and Dato Conference.

On the Dato blog, Susan Romero explains how to use Dato Distributed to run pipelines developed in GraphLab Create in a distributed environment.  Interesting to note that GraphLab Create does not run in a distributed environment itself, so you need to license Dato Distributed to put your models to work in Hadoop.


On Slideshare, Hank Roark publishes an intro to data science with H2O.

Google Dataflow

Google releases its Dataflow hosted cloud servicee, announces partnerships with ClearStory Data, Salesforce, SpringML and Tamr, plus SDKs from DataArtisans and Cloudera.   Dataflow enables users to build data pipelines that integrate batch and streaming sources with a unified programming model.


The R core team releases R 3.2.2; the Revolutions blog reports.

Andrie de Vries compares the network structure of CRAN and BioConductor, concludes they are different.  Nice graphics, though I’m missing the practical significance.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.