Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Big Analytics Roundup (April 25, 2016)

Mesosphere wins the internet this week with its announcement that it has open sourced DC/OS, its datacenter virtualization project built around Apache Mesos. While not an “analytics” project per se, DC/OS has the potential to transform how organizations provision and deploy their analytics platforms.

In a nutshell, Apache Mesos distributes workloads across physical IT resources. DC/OS adds a container orchestration platform; installation, management and monitoring tools; and improvements to networking, security, load balancing, security and other areas. For more details about DC/OS and why it matters, read this white paper by Benjamin Hindman and Edward Hsu of Mesosphere.

Mesosphere has assembled an alliance of 61 launch partners, including tech vendors, systems integrators and potential users. Big brands include Accenture, Capgemini, Cisco, EMC, HPE, Microsoft, MapR, Microsoft and Verizon. Notable startups include Alluxio, Canonical, Confluent, Lightbend and MemSQL.

Analysts chime in:

  • Gavin Clarke thinks Google forced Mesosphere’s hand by open sourcing Kubernetes.
  • Mike Wheatley, notes that many of the components were already open source.
  • On TechCrunch, Frederic Lardinois reports and comments.
  • In Computerworld, John Ribeiro reports.
  • Janakiram MSV wonders if DC/OS will emerge as an alternative to Kubernetes.
  • Sam Dean surveys the project and interviews Ben Hindman.
  • George Leopold notes the scope of the DC/OS ecosystem.
  • Joao Lima reports.

DC/OS ships with more than 30 open source packages ready to install as DC/OS services. Notable among them: Cassandra, Elasticsearch, Kafka, MemSQL, Spark, Storm and Zeppelin.

Explainers

— Andrie de Vries explains how he scraped CRAN to trace the growth in R packages.

— On the Cloudera Engineering blog, David Alves explains how to use Impala and Kudu for analytic workloads.

— Michael Hunger and William Lyon explain how they analyzed the Panama Papers with Neo4j.

— On the Microsoft Azure blog, Liam Cavanagh explains how to optimize document search in Azure.

— Adrian Colyer of the morning paper summarizes five papers on word vectors, reviews Global Vectors for Word Representation, delivers an overview of Deep Learning and covers ImageNet classification with deep convolutional neural networks.

— Mario Inchiosa and Roni Burd explain how Microsoft R Server delivers an R interface to Spark in HDInsight.

Perspectives

— In MIT Technology review, Tom Simonite interviews Google’s Jeff Dean, contributor to Spanner, Translate, BigTable, MapReduce, Google Brain. LevelDB and TensorFlow. They discuss the future of machine learning.

— David Weldon went to Strata and interviewed some people:

  • Ali Hodroj of GigaSpaces, a cloud enabling company. Hodroj is bullish on cloud.
  • H2O.ai’s Arno Candel, who is surprised that so many people are talking about Spark.
  • Nikita Ivanov of GridGain, who says that people are excited about in-memory computing.
  • DataArtisans’ Kostas Tzoumas, who thinks that more people would use Flink if they were better educated.

— Alex Woodie touts Apache Beam, the open source implementation of Google’s Cloud Dataflow, which aspires to unify everything.

— James Nunns surveys ten Big Analytics startups: Confluent, H2O.ai, AtScale, Interana, Tamr, Wavefront, BlueTalon, Cazena, DataTorrent and Databricks.

— In Silicon Angle, Wikibon’s Paul Gillin interviews Wikibon’s George Gilbert, who is bullish on Spark.

— John Leonard ruminates on Hadoop, noting the proliferation of cute animal logos, and the challenges of the open source business model.

— Sam Dean notices that there are quite a few new open source tools for machine learning.

— Jack Vaughan summarizes the educational challenges posed by machine learning.

Commercial Announcements

— Dataiku announces availability of Data Science Studio on Microsoft Azure.

— GridGain announces availability of a support package for Apache Ignite that includes its Professional Edition — essentially the same as Apache Ignite, with more frequent maintenance releases and some LGPL libraries.

— MemSQL announces closing on a $36 million “C” round. All existing investors participated, plus two new investors.

Big Analytics Roundup (April 18, 2016)

In hard news this week, Storm hits a milestone with Release 1.0, Google releases TensorFlow 0.8 with distributed computing support, and DataStax announces DataStax Enterprise Graph. And, following on NVIDIA’s DGX-1 announcement last week there are a number of items on Deep Learning featured below.

Deep Learning

— Adrian Colyer summarizes a paper that summarizes 900 other papers on Deep Learning.

— Data Science Central compiles a slew of links on Deep Learning.

— Nicole Hemsoth interviews NVIDIA Veep Marc Hamilton, who ruminates on the convergence of supercomputing and Deep Learning.

Explainers

— On the Pivotal Big Data blog, Alexey Grischchenko explains what’s up with Apache Hawq, the SQL-on-Hadoop-and-Greenplum engine that is now an Apache Incubator project. According to OpenHub, there’s a lot of activity on Hawq, and contributions are up sharply since it went Apache.

— In KDnuggets, Microsoft’s Brandon Rohrer publishes a handy pocket guide to data science.

— Nicholas A. Perez explains custom streaming sources in Spark.

— Ian Pointer explains Apache Beam, and how it aspires to be the uber-API.

— Abie Reifer explains Microsoft Azure HDInsight.

— Yong Feng of IBM’s Spark Technology Center explains results of a test run with Spark on Mesos.

— Gopal Wunnava explains geospatial intelligence with SparkR on Amazon EMR.

— IBM’s Fred Reiss explains SystemML, for those who missed his presentation at Spark Summit East.

— For masochistic sabremetricians, Nick Amato explains baseball statistics with Hive and Pig.

Perspectives

— Serdar Yegulalp reviews Apache Storm 1.0. He likes it.

— DataArtisans’ Kostas Tzoumas explains counting in streams, then touts Flink.

— Timothy Prickett Morgan reports on HPE’s efforts to put Spark on a Superdome. Results are interesting. But as with IBM running Spark on a mainframe, such efforts overlook a key benefit of Hadoop and Spark: the ability to avoid dealing with the likes of HPE and IBM.

— Katharine Kearnan interviews Nick Pentreath, one of the two Spark Committers IBM has hired. He predicts that in Spark 2.0, the ML pipeline API approaches parity with the MLlib API. Interestingly, he doesn’t expect a lot from SparkR.

— In Forbes, Chris Wilder recaps his visit to Google Cloud Platform NEXT 2016.

— Andrew Brust summarizes Hortonworks’ recent announcements, sees an emerging duopoly of Cloudera and Hortonworks. I’m not inclined to dismiss MapR and AWS so easily.

— Craig Stedman comments on Pivotal’s exit from the Hadoop distribution market, quotes some old guy wondering how much longer IBM will keep BigInsights alive. My take on Pivotal: honestly, I thought they exited a year ago.

— Cloud platform Altiscale’s Raymie Stata surveys Hadoop’s history, sees movement to the cloud.

— James Nunns wonders if the top Hadoop distributors can steal the show from Spark at Hadoop Summit 2016. If you count the number of times the word “Spark” appears in Hortonworks’ announcement, the answer is no.

— Ajay Khanna opines that absent data quality and metadata management, your data lake will turn into a data swamp.

— Nick Bishop interviews MSFT’s research chief, who assures him that AI is too stupid to wipe us out. I worry more about the chemtrails.

Open Source Announcements

— Apache Storm announces Release 1.0.0, with many enhancements. According to OpenHub, Storm is picking up steam, with 127 active contributors in the past 12 months.

— Google announces TensorFlow 0.8, with distributed computing support and new libraries for user-defined distributed models.

— Apache Mahout announces release of Mahout 0.12.0, with Flink bindings to the Samsara engine. Contributors from DataArtisans did most of the work, as most other contributors have long since exited this project.

Commercial Announcements

— DataStax announces DataStax Enterprise Graph (DSE Graph), built on Apache Cassandra and Apache Tinkerpop (a graph computing framework.) A year ago, Datastax acquired Aurelius, the commercial venture behind Titan, an open source distributed graph database; Titan uses Cassandra as a back end. DSE Graph includes extensions found in DataStax Enterprise, including security, search, analytics and monitoring tools. Alex Handy reports.

— Databricks announces new content for its Community Edition:

— Hortonworks previews HDP 2.4.2. Key bits:

  • Spark 1.6.1.
  • Spark SQL certified with ODBC.
  • Bug fixes for Spark/Oozie connection for Kerberos-enabled clusters.
  • Spark Streaming with Apache Kafka in a Kerberos-enabled cluster.
  • Spark SQL with ORC performance improvements.
  • Final technical preview of Apache Zeppelin with Kerberos, LDAP and identity propagation.

— Hortonworks also announces that Pivotal HDP is officially dead. Pivotal announces nothing.

— Teradata announces that its Think Big subsidiary is expanding its data lake and managed service offerings using Apache Spark. This is good news for the eight consultants at Think Big with Spark credentials, as it means less time spent on the bench. Meanwhile, Think Big contributes a distributed K-Modes in PySpark to open source, the first such contribution since 2014. For some reason, they did not contribute it to Spark packages.

— Atigeo, a “compassionate technology company”, announces that is has added Spark 1.6 to its xPatterns platform.

— Lucidworks announces release of Lucidworks View, a component that simplifies development of applications on Solr and Spark.

— DataRPM, “Cognitive Data Science” company with very little money announces partnership with Tamr, a data integration company with lots of money.

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (October 12, 2015)

Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics.  Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics.  StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.

EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica.  It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances.  Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.

EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf.  Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.

What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms.  But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.

In other news, the Apache Software Foundation announces three interesting software releases:

  • Apache AccumuloRelease 1.6.4, a maintenance release.
  • Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
  • Apache Kafka: Release 0.8.2.2, a maintenance release.

Spark

On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.

Spark Performance

Dave Ramel summarizes a paper he thinks is too long for you to read.  That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads.  As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.

Spark Appliances

Ordinarily I don’t link sponsored content, but this article from Numascale is interesting.  Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.

Spark on Amazon EMR

On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR.  The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.

Use Cases

In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.

Stitch Fix offers personalized style recommendations to its customers.  Jas Khela describes how the startup uses Spark.   (h/t Hadoop Weekly)

SQL/OLAP/BI

Apache Drill

MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data.  Without a trace of irony, he explains how to bring SQL to NoSQL datastores.

Apache Hawq

On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib.  Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library.   MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so.  In Datanami, George Leopold reports.

In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.

At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive.  The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.

Apache Phoenix

At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase.  Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver.  SQL support is broad enough to run TPC benchmark queries.  Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.

Data Blending

On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.

Presto

On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR.  Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.

Machine Learning

Apache MADLib

MADLib is an open source project for machine learning in SQL.  Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community.  Machine learning functionality is quite rich.  Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq.  In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)

Apache Spark (SparkR)

On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.

Apache Spark (MLLib)

On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.

Apache Zeppelin

At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.

H2O and Spark (Sparkling Water)

In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.

How Yahoo Does Deep Learning on Spark

Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark.  Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node.  The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.

Screen Shot 2015-10-06 at 10.32.39 AM

Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC).  In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)

Streaming Analytics

Apache Flink

On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.

Apache Kafka

Cloudera’s Gwen Shapira presents Kafka “worst practices”: true stories that happened to Kafka clusters. (h/t Hadoop Weekly)

Apache Spark Streaming

On the MapR blog, Jim Scott offers a guide to Spark Streaming.