Big Analytics Roundup (August 31, 2015)

Top stories for the penultimate week of summer: an excellent SQL-on-Hadoop benchmark; a couple of stories about Gelly, Flink’s graph engine; Apache Ignite goes top-level; a preview of Spark 1.5; and new stuff from RStudio.

Also, on Slideshare, evil mad scientist Paco Nathan presents on “Uber for Education.”

SQL on Hadoop

I missed this story in June, but better late than never.  The folks at, a Warsaw-based collaborative, published results of an excellent benchmark of SQL-on-Hadoop technologies.  Scope of the analysis included Hive on MapReduce (the “control”), Hive on Tez, Presto, Impala, Drill and Spark SQL.  (The authors note that they wanted to evaluate Hive on Spark, but could not make it work.)

The Allegro team first evaluated Kerberos support, YARN deployment and query fault tolerance, the available UI, JDBC support, UDF and view support as well as support for each of CSV, JSON, AVRO and Parquet formats.  For benchmarking, they used 11 HiveQL queries testing a mix of typical analytic tasks.

Some key findings:

  • Hive on Tez: ran all queries with stable and satisfactory performance
  • Spark SQL: better than average performance overall, but could not run two queries
  • Presto: convenient to use, but performance was disappointing
  • Impala: fastest overall, but could not run one of the queries
  • Drill: very fast, but could not run three queries

Apache Flink/Data Artisans

On Slideshare, Vasia Kalavri presents on overview of Gelly, Flink’s graph engine.  More about Gelly here.

Apache Ignite/GridGain

The Apache Software Foundation promotes Ignite to top-level project status.  SD Times reports.  Ignite is a high-performance integrated and distributed in-memory platform.  Ignite is the open source version of GridGain‘s commercial product.

Apache Lens

ASF also promotes Lens to top-level status.  Apache Lens is a “Unified Analytics Platform”, whatever that is.  (h/t Hadoop Weekly)

Apache Spark/Databricks

Patrick Wendell of Databricks presented a preview of Spark 1.5 last Thursday.    Spark 1.5 will be available in mid-September (exact timing depends on Apache voting process).  Developers from more than 50 companies contributed to the build.  A preview is available in Databricks now.  Key enhancements:

  • Execution concepts will be exposed: tracking memory usage, visualizing DataFrame execution tree
  • Project Tungsten will be on by default: binary processing for memory management, code generation for CPU efficiency
  • Performance optimizations in SQL/DataFrames: Metadata discovery, predicate pushdown in Parquet, outer joins and window functions
  • First class UDAF support
  • Improved interoperability with Hive
  • Read Parquet files encoded by Hive, Impala, Pig, Avro, Thrift, Spark SQL
  • Additional Python interfaces for Spark Streaming
  • R bindings for linear models
  • Python bindings for Power Iteration Clustering
  • New algorithms and transforms for ML Pipelines

There will also be some new packages available concurrently with the 1.5 release, including support for AWS Redshift, Magellan support for spatial analytics and a convex solver package.

On Datanami, George Leopold covers the story.

Alex Woodie interviews some Spark users and discovers that they often use it together with Hadoop.

Jessica Twentyman notes that Spark looks set to replace MapReduce, inquires into the pace, scope and scale of replacement.  She finds a lot of smart people who are optimistic and a few who urge caution, citing Spark’s immaturity.

Darryl Taft explains how Spark transforms Big Data processing and development.  Spoiler: it’s faster.

In readwrite, Peter Schlampp provides six reasons that Apache Spark isn’t flickering out, thereby answering a question nobody is asking.  For the record, his reasons are: advanced analytics, simplification, support for multiple languages, faster results, Hadoop distribution agnosticism and high-growth adoption.

On the Cloudera blog, Jeff Palmucci of TripAdvisor describes how his team uses Spark.

Google Cloud

announces a new release of BigQuery with UDF support.

On HomeAI, Arno Candel presents a Deep Learning Webinar.


RStudio adds a new starter plan for, a cloud service for Shiny apps.  Roger Oberg reports.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.