Big Analytics Roundup (August 31, 2015)
Top stories for the penultimate week of summer: an excellent SQL-on-Hadoop benchmark; a couple of stories about Gelly, Flink’s graph engine; Apache Ignite goes top-level; a preview of Spark 1.5; and new stuff from RStudio.
Also, on Slideshare, evil mad scientist Paco Nathan presents on “Uber for Education.”
SQL on Hadoop
I missed this story in June, but better late than never. The folks at Allegro.tech, a Warsaw-based collaborative, published results of an excellent benchmark of SQL-on-Hadoop technologies. Scope of the analysis included Hive on MapReduce (the “control”), Hive on Tez, Presto, Impala, Drill and Spark SQL. (The authors note that they wanted to evaluate Hive on Spark, but could not make it work.)
The Allegro team first evaluated Kerberos support, YARN deployment and query fault tolerance, the available UI, JDBC support, UDF and view support as well as support for each of CSV, JSON, AVRO and Parquet formats. For benchmarking, they used 11 HiveQL queries testing a mix of typical analytic tasks.
Some key findings:
- Hive on Tez: ran all queries with stable and satisfactory performance
- Spark SQL: better than average performance overall, but could not run two queries
- Presto: convenient to use, but performance was disappointing
- Impala: fastest overall, but could not run one of the queries
- Drill: very fast, but could not run three queries
Apache Flink/Data Artisans
The Apache Software Foundation promotes Ignite to top-level project status. SD Times reports. Ignite is a high-performance integrated and distributed in-memory platform. Ignite is the open source version of GridGain‘s commercial product.
ASF also promotes Lens to top-level status. Apache Lens is a “Unified Analytics Platform”, whatever that is. (h/t Hadoop Weekly)
Patrick Wendell of Databricks presented a preview of Spark 1.5 last Thursday. Spark 1.5 will be available in mid-September (exact timing depends on Apache voting process). Developers from more than 50 companies contributed to the build. A preview is available in Databricks now. Key enhancements:
- Execution concepts will be exposed: tracking memory usage, visualizing DataFrame execution tree
- Project Tungsten will be on by default: binary processing for memory management, code generation for CPU efficiency
- Performance optimizations in SQL/DataFrames: Metadata discovery, predicate pushdown in Parquet, outer joins and window functions
- First class UDAF support
- Improved interoperability with Hive
- Read Parquet files encoded by Hive, Impala, Pig, Avro, Thrift, Spark SQL
- Additional Python interfaces for Spark Streaming
- R bindings for linear models
- Python bindings for Power Iteration Clustering
- New algorithms and transforms for ML Pipelines
There will also be some new packages available concurrently with the 1.5 release, including support for AWS Redshift, Magellan support for spatial analytics and a convex solver package.
On Datanami, George Leopold covers the story.
Alex Woodie interviews some Spark users and discovers that they often use it together with Hadoop.
Jessica Twentyman notes that Spark looks set to replace MapReduce, inquires into the pace, scope and scale of replacement. She finds a lot of smart people who are optimistic and a few who urge caution, citing Spark’s immaturity.
Darryl Taft explains how Spark transforms Big Data processing and development. Spoiler: it’s faster.
In readwrite, Peter Schlampp provides six reasons that Apache Spark isn’t flickering out, thereby answering a question nobody is asking. For the record, his reasons are: advanced analytics, simplification, support for multiple languages, faster results, Hadoop distribution agnosticism and high-growth adoption.
On the Cloudera blog, Jeff Palmucci of TripAdvisor describes how his team uses Spark.
…announces a new release of BigQuery with UDF support.
On HomeAI, Arno Candel presents a Deep Learning Webinar.