Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.

Introduction

Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.

Distribution

Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.

Ecosystem

In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.

Community

In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.