Big Analytics Roundup (November 9, 2015)

My roundup of the Spark Summit Europe is here.

Two important events this week:

  • H2O World starts today and runs through Wednesday at the Computer History Museum in Mountain View CA.   Yotam Levy summarizes here and here.
  • Open Data Science Conference meets November 14-15 at the Marriott Waterfront in SFO

Five backgrounders and explainers:

  • At HUG London, Apache’s Ufuk Celebi delivers a nice intro to Flink.
  • On the Databricks blog, Yesware’s Justin Mills explains how his team migrates Spark applications from concept through prototype through production.
  • On Slideshare, Alpine’s Holden Karau delivers an overview of Spark with Python.
  • Chloe Green wakes from a three year slumber and discovers Spark.
  • On the Cloudera Engineering blog, Madhu Ganta explains how to build a CEP app with Spark and Drools.

Third quarter financials drive the news:

(1) MapR: We Grew 160% in Q3

MapR posts its biggest quarter ever.

(2) HDP: We Grew 168% in Q3

HDP loses $1.33 on every dollar sold, tries to make it up on volume.  Stock craters.

(3) Teradata: We Got A Box of Steak Knives in Q3

Teradata reports more disappointing sales as customers continue to defer investments in big box solutions for data warehousing.  This is getting to be a habit with Teradata; the company missed revenue projections for 2014 as well as the first and second quarters of this year.  Any company can run into headwinds, but a management team that consistently misses targets clearly does not understand its own business and needs to go.

Full report here.

(4) “B” Round for H2O.ai

Machine learning software developer H2O.ai announces a $20 million Series B round led by Paxion Capital Partners.  H2O.ai leads development of H2O, an open source project for distributed in-memory machine learning.  The company reports 25 new support customers this year.

(5) Fuzzy Logix Lands Funds

In-database analytics vendor Fuzzy Logix announces a $5 million “A” round from New Science Ventures.  Fuzzy offers a library of analytic functions that run in a number of high-performance databases and in HiveQL.

(6) New Optimization Package for Spark

On the Databricks blog, Aaron Staple announces availability of Spark TFOCS, an optimization package based on the eponymous Matlab package.  (TFOCS=Templates for First Order Conic Solvers.)

(7) WSO2 Delivers IoT App on Spark 

IoT middleware vendor WSO2 announces Release 3.0 of its open source Data Analytics Server (DAS) platform.   DAS collects data streams and applies batch, real-tim or interactive analytics; predictive analytics are in the roadmap.  For streaming data sources, DAS supports java agents, javascript clients and 100+ connectors.  The software runs on Spark and Lucene.

(8) Hortonworks: We Aren’t Irrelevant

On the Hortonworks blog, Vinay Shukla and Ram Sriharsha tout Hortonworks’ contributions to Spark, including ORC support, an Ambari stack definition for Spark, tighter integration between Hive and Spark, minor enhancements to ML and user-facing documentation.  Looking at the roadmap, they discuss Magellan for geospatial and Zeppelin notebooks. (h/t Hadoop Weekly).

(9) Apache Drill Delivers Fast SQL-on-Laptop

On the MapR blog, Mitsutoshi Kiuchi offers a case study in how to run a silly benchmark.

Comparing the functionality of Drill and Spark SQL, Kiuchi argues that Drill “supports” NoSQL databases but Spark does not, relegating Spark’s packages to a footnote.  “Support” is a loaded word with open source software; technically, nothing is supported unless you pay for it, in which case the scope of support is negotiated as part of the SLA.  It’s also worth noting that MongoDB developed Spark’s interface to MongoDB (for example), which provides a certain amount of confidence.

Kiuchi does not consider other functional areas, such as security, YARN support, query fault tolerance, the user interface, metastore management and view support, where Drill comes up short.

In a previously published performance test of five SQL engines, Spark successfully ran nine out of eleven queries, while Drill ran eight out of ten.  On the eight queries both engines ran, Drill was slightly faster on six.  For this benchmark, Kiuchi runs three queries on his laptop with a tiny dataset.

As a general rule, one should ignore SQL-on-Hadoop benchmarks unless they run industry standard queries (e.g. TPC) with large datasets in a distributed configuration.

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.