Spark Summit Europe Roundup
The 2015 Spark Summit Europe met in Amsterdam October 27-29. Here is a roundup of the presentations, organized by subject areas. I’ve omitted a few less interesting presentations, including some advertorials from sponsors.
State of Spark
— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount. (Video here). Enhancements expected for Spark 1.6:
- Dataset API
- DataFrame integration for GraphX, Streaming
- Project Tungsten: faster in-memory caching, SSD storage, improved code generation
- Additional data sources for Streaming
— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development. New user-facing developments in the past twelve months include:
- Data source API
- R binding and machine learning pipelines
Back-end developments include:
- Project Tungsten
- Sort-based shuffle
- Netty-based network
Of these, Xin covers DataFrames and Project Tungsten in some detail. Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work. Video here.
Getting Into Production
— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them. Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads. Video here.
— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization. They tout Sentry, Cloudera’s preferred security platform. Video here.
Spark for the Enterprise
— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality. He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit. His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark. Video here.
— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others). Video here. Key insights:
- Prediction is the most popular use case
- Hive is most frequently co-installed, followed by HBase, Impala and Solr.
- Customers want security and performance comparable to leading relational databases combined with simplicity.
Data Sources and File Systems
— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API. (Video unavailable).
Spark SQL and DataFrames
— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark. More detail about the use case and solution here. The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata. Video here.
Data Science and Machine Learning
— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project. Video here.
— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR. He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages. The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API. Finally, he demos SparkR and explains how to get started. Video here.
— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib. Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment? Bradley demonstrates how to do this with Spark, using sentiment analysis as an example. Video here.
— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra. He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.
— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation. Video here.
— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine. Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL. Video here.
Use Cases and Applications
— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases: (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform. Video here.
— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics. SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration. Video here.
— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure. Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS. With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior. Video here.
— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time. Video here.