Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Spark Summit 2014 Roundup

Key highlights from the 2014 Spark Summit:

  • Spark is the single most active project in the Hadoop ecosystem
  • Among Hadoop distributors, Cloudera and MapR are clear leaders with Spark
  • SAP now offers a certified Spark distribution and integration with HANA
  • Datastax has delivered a Cassandra connector for Spark
  • Databricks plans to offer a cloud service for Spark
  • Spark SQL will absorb the Shark project for fast SQL
  • Cloudera, MapR, IBM and Intel plan to port Hive to Spark
  • Spark MLLIb will double its supported algorithms in the next release

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event.  Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three-day affair.

It’s always ironic when manual registration at a tech conference produces long lines:

SS4

Databricks CTO Matei Zaharia kicked off the keynotes with his recap of Spark progress since the last summit.   Zaharia enumerated Spark’s two big goals: a unified platform for Big Data applications combined with a standard library for analytics.  CEO Ion Stoica followed with a Databricks update, including an announcement of the SAP alliance and an impressive demo of Databricks Cloud, currently in private beta.  Separately, Databricks announced $33 million in Series B funding.

Spark Release Manager Patrick Wendell delivered an overview of planned development over the next several releases.   Wendell confirmed Spark’s commitment to stable APIs; patches that break the API fail the build.   The project will deliver dot releases every three months beginning in August 2014, and maintenance releases as needed.   Development focus in the near future will be in the libraries:

  • Spark SQL: optimization, extensions (toward SQL 92), integration (NoSQL, RDBMS), incorporation of Shark
  • MLLib : rapid expansion of algorithms (including descriptive statistics, NMF. Sparse SVM, LDA), tighter integration with R
  • Streaming: new data sources, tighter Flume integration
  • GraphX: optimizations and API stability

Mike Franklin of Berkeley’s AMPLab summarized new developments in the Berkeley Data Analytics Stack (“BadAss”), including significant new work in genomics and energy, as well as improvements to Tachyon and MLBase.  Dave Patterson elaborated on AMPLab’s work in genomics, providing examples showing how Spark has markedly reduced both cost and runtime for genomic analysis.

Cloudera, Datastax, MapR and SAP demonstrated that the first rule of success is to show up:

  • Mike Olson of Cloudera responded to Hortonworks’ snark by confirming Cloudera’s commitment to Impala as well as Hive on Spark.  Olson drew a round of applause when he invited Horton to join the Hive on Spark consortium.
  • Martin van Ryswyk of Datastax announced immediate availability of a Cassandra driver for Spark, a component that exposes Cassandra tables as Spark RDDs.  Datastax continues to work on tighter integration with Spark, including support for Spark SQL, Streaming and GraphX libraries.  In the breakouts, Datastax delivered a deeper briefing on integration with Spark Streaming.
  • M.C. Srivas of MapR highlighted Spark benefits realized by four MapR customers, including Cisco, a health insurer, an ad platform and a pharma company.  MapR continues to claim support for Shark as a differentiator, a point mooted by the announcement that Spark SQL will soon absorb Shark.
  • Aiaz Kazi of SAP seemed pleased that most of the audience has heard of SAP HANA, and delivered an overview of SAP’s integration with Spark.

IBM wasted a Platinum sponsorship by sending some engineers to talk about “System T”, IBM’s text mining application, with passing references to Spark.  Although IBM Infosphere BigInsights is a certified Spark distribution, IBM appears uncommitted to Spark; the lack of executive presence at the Summit stood out in sharp contrast to Cloudera and MapR.

Silver sponsors Hortonworks and Pivotal hosted tables in the vendor area, but did not present anything.

Neuroscientist Jeremy Freeman, back by popular demand from the 2013 Spark Summit, presented latest developments in his team’s research into animal brains using Spark as an analytics platform.  Freeman’s presentations are among the best demonstrations of applied analytics that I’ve seen in any forum.

A number of vendors in the Spark ecosystem delivered presentations showing how their applications leverage Spark, including:

The most significant change from the 2013 Spark Summit is the number of reported production users for Spark.  While the December conference focused on Spark’s potential, I counted several dozen production users among the presentations I attended.

Also among the sellout crowd: a SAS executive checking to see if there is anything to this open source and vendor-neutral stuff.  Apparently, he did not get Jim Goodnight’s message that “Big Data is hype manufactured by media“.