Spark 2.0 Released

The Apache Spark team announces the production release of Spark 2.0.0.  Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.

Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.

The Spark team guarantees API stability for all production releases in the Spark 2.X line.

Highlights

Spark Summit: Matei Zaharia summarizes highlights of the release. Slides here.

— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.

— Reynold Xin explains technical details of Spark 2.0.

SQL Processing

Key Changes

New and updated APIs:

  • In Scala and Java, the DataFrame and DataSet APIs are unified.
  • In Python and R, DataFrame is the main programming interface (due to lack of type safety).
  • For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
  • Enhancements to the Accumulator and Aggregator APIs.

Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:

  • Native SQL parser supports ANSI SQL and HiveQL.
  • Native DDL command implementations.
  • Subquery support.
  • View canonicalization support.

Additional new features:

  • Native CSV support
  • Off-heap memory management for caching and runtime.
  • Hive-style bucketing.
  • Approximate summary statistics.

Performance enhancements:

  • Speedups of 2X-10X for common SQL and DataFrame operators.
  • Improved performance with Parquet and ORC.
  • Improvements to Catalyst query optimizer for common workloads.
  • Improved performance for window functions.
  • Automatic file coalescing for native data sources.

Explainers

Spark Summit: Andrew Or explains memory management in Spark 2.0+. Slides here.

Spark Summit: Databrick’s Michael Armbrust explains structured analysis in Spark: DataFrames, Datasets, and Streaming. Slides here.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— On KDnuggets, Paige Roberts explains Project Tungsten.

 Sameer Agarwal, Davies Liu, and Reynold Xin dive deeply into Spark 2.0’s second generation Tungsten engine. This paper inspired Tungsten’s design.

Spark Summit: Yin Huai dives deeply into Catalyst, the Spark optimizer. Slides here.

— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.

Spark Summit: AMPLab’s Ankur Dave explains GraphFrames for graph queries in Spark SQL. Slides here.

Spark Streaming

Key Changes

Spark 2.0 includes an experimental release of Structured Streaming.

Explainers

Spark Summit: Tathagata Das explains Structured Streaming. Slides here.

— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.

— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.

Machine Learning

Key Changes

The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.

ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.

Additional techniques supported vary by API:

  • DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
  • PySpark: LDA, GMM, Generalized linear regression
  • SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.

Explainers

Spark Summit: Joseph Bradley previews machine learning in Spark 2.0. Slides here.

— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.

— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.

SparkR

Key Changes

SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.

As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.

Explainers

Spark Summit: Xiangrui Meng explains the latest developments in SparkR. Slides here.

— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.

— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.

Spark 1.5 Released

On September 9, the Spark team announced availability of Release 1.5.  (Release notes here.)  230 developers contributed more than 1,400 commits, the largest release to date.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-09-09 at 8.06.28 PM

On the Databricks blog, Reynold Xin and Patrick Wendell summarize the key new bits:  Some highlights:

  • Project Tungsten, a set of major changes to Spark’s internal architecture will be on by default.  Spark 1.5 includes binary processing and a new code generation framework, with more than 100 built-in functions for common tasks.
  • Other performance enhancements include improved Parquet support (with predicate push-down and a faster metadata lookup path), and improved joins.
  • Usability enhancements include visualization of the SQL and DataFrame query plans in the web UI; the ability to connect to multiple versions of Hive metastores and the ability to read several Parquet variants.
  • Spark Streaming adds stability features, backpressure support, load balancing and several Python APIs.
  • The R interface is expanded to include Generalized Linear Models
  • New machine learning features include eight new transformers, three new estimators (naive Bayes, k-means and isotonic regression) plus three new algorithms (multilayer perceptron classifier, PrefixSpan for sequential pattern mining and FP-Growth for association rule learning)
  • Enhancements to existing algorithms include improvements to LDA, decision tree and ensemble features, an improved Pregel API for GraphX plus an ability to distribute matrix inversions for Gaussian Mixture Models (GMM).
  • Other new machine learning features include model summaries for linear and logistic regression, a splitting tool to define train and validation samples and a multiclass classification evaluator.

GraphX development has flatlined since the component graduated from Alpha in Spark 1.2.

Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks all participated in release testing.  Note that IBM, for all its marketing hoopla, contributes little or nothing to the project.

Big Analytics Roundup (August 31, 2015)

Top stories for the penultimate week of summer: an excellent SQL-on-Hadoop benchmark; a couple of stories about Gelly, Flink’s graph engine; Apache Ignite goes top-level; a preview of Spark 1.5; and new stuff from RStudio.

Also, on Slideshare, evil mad scientist Paco Nathan presents on “Uber for Education.”

SQL on Hadoop

I missed this story in June, but better late than never.  The folks at Allegro.tech, a Warsaw-based collaborative, published results of an excellent benchmark of SQL-on-Hadoop technologies.  Scope of the analysis included Hive on MapReduce (the “control”), Hive on Tez, Presto, Impala, Drill and Spark SQL.  (The authors note that they wanted to evaluate Hive on Spark, but could not make it work.)

The Allegro team first evaluated Kerberos support, YARN deployment and query fault tolerance, the available UI, JDBC support, UDF and view support as well as support for each of CSV, JSON, AVRO and Parquet formats.  For benchmarking, they used 11 HiveQL queries testing a mix of typical analytic tasks.

Some key findings:

  • Hive on Tez: ran all queries with stable and satisfactory performance
  • Spark SQL: better than average performance overall, but could not run two queries
  • Presto: convenient to use, but performance was disappointing
  • Impala: fastest overall, but could not run one of the queries
  • Drill: very fast, but could not run three queries

Apache Flink/Data Artisans

On Slideshare, Vasia Kalavri presents on overview of Gelly, Flink’s graph engine.  More about Gelly here.

Apache Ignite/GridGain

The Apache Software Foundation promotes Ignite to top-level project status.  SD Times reports.  Ignite is a high-performance integrated and distributed in-memory platform.  Ignite is the open source version of GridGain‘s commercial product.

Apache Lens

ASF also promotes Lens to top-level status.  Apache Lens is a “Unified Analytics Platform”, whatever that is.  (h/t Hadoop Weekly)

Apache Spark/Databricks

Patrick Wendell of Databricks presented a preview of Spark 1.5 last Thursday.    Spark 1.5 will be available in mid-September (exact timing depends on Apache voting process).  Developers from more than 50 companies contributed to the build.  A preview is available in Databricks now.  Key enhancements:

  • Execution concepts will be exposed: tracking memory usage, visualizing DataFrame execution tree
  • Project Tungsten will be on by default: binary processing for memory management, code generation for CPU efficiency
  • Performance optimizations in SQL/DataFrames: Metadata discovery, predicate pushdown in Parquet, outer joins and window functions
  • First class UDAF support
  • Improved interoperability with Hive
  • Read Parquet files encoded by Hive, Impala, Pig, Avro, Thrift, Spark SQL
  • Additional Python interfaces for Spark Streaming
  • R bindings for linear models
  • Python bindings for Power Iteration Clustering
  • New algorithms and transforms for ML Pipelines

There will also be some new packages available concurrently with the 1.5 release, including support for AWS Redshift, Magellan support for spatial analytics and a convex solver package.

On Datanami, George Leopold covers the story.

Alex Woodie interviews some Spark users and discovers that they often use it together with Hadoop.

Jessica Twentyman notes that Spark looks set to replace MapReduce, inquires into the pace, scope and scale of replacement.  She finds a lot of smart people who are optimistic and a few who urge caution, citing Spark’s immaturity.

Darryl Taft explains how Spark transforms Big Data processing and development.  Spoiler: it’s faster.

In readwrite, Peter Schlampp provides six reasons that Apache Spark isn’t flickering out, thereby answering a question nobody is asking.  For the record, his reasons are: advanced analytics, simplification, support for multiple languages, faster results, Hadoop distribution agnosticism and high-growth adoption.

On the Cloudera blog, Jeff Palmucci of TripAdvisor describes how his team uses Spark.

Google Cloud

announces a new release of BigQuery with UDF support.

H2O.ai

On HomeAI, Arno Candel presents a Deep Learning Webinar.

RStudio

RStudio adds a new starter plan for shinyapps.io, a cloud service for Shiny apps.  Roger Oberg reports.

Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

Spark is Too Big to Fail

Reacting to growing interest in Apache Spark, there is a developing contrarian meme:

  • David Ramel asks: are Spark and Hadoop friends or foes?
  • Jack Vaughan compares Spark to the PDP-11, dismisses it as “just processing.”
  • Doug Henschen praises Spark, pans Databricks
  • Nicole Laskowski complains that Spark Summit East “felt like a Databricks show.”
  • Andrew Oliver thinks Spark needs to grow up
  • Andrew Brust worries that vendors are ahead of customers on Spark
  • IBM’s James Kobelius characterizes Spark as “the shiny new thing”
  • Gartner’s Nick Heudecker asserts that Spark is “not enterprise ready”

Spark skepticism falls into three broad categories:

  • Hadoop Purism: Spark deviates from the MapReduce/HDFS framework, and some people aren’t happy about that
  • Backseat Driving: Some analysts argue that Spark is great but Databricks, the commercial venture behind Spark, should do X, Y or Z
  • FUD: Spark’s competitors — commercial and open source — plant “issues” and “concerns” about Spark with industry analysts

Let’s examine each in turn.

“Spark Competes With Hadoop”

Spark does not compete with Hadoop; it competes with MapReduce.  Hadoop is an ecosystem of projects; there are a few components included in all commercial distributions (e.g. Hive, Pig, Hbase), but these  aren’t used at every site.  The ability to mix and match components is a strength for Hadoop.

Some software, like Spark, can run co-located in a Hadoop cluster or on clustered machines outside of Hadoop.  This should not surprise anyone; clustering and distributed computing existed before Hadoop.  Why does it matter if a software component can run both ways?  Users and use cases will drive implementation, and if Spark works better with Cassandra than with HDFS, or if a Spark user does not need the other Hadoop bits, so be it.

While there are reports of organizations that have abandoned MapReduce, most organizations will use Spark together with MapReduce; if users are happy with existing MapReduce jobs, there is no need to rewrite them.  For new applications, however, some users will choose Spark over MapReduce for a variety of reasons; for better runtime performance, more efficient programming, more built-in features or simply because it’s the latest thing.  Isn’t competition a wonderful thing?

Organizations using standalone instances of Spark likely never considered using MapReduce for the application in question.  For these use cases, Spark competes with SAS, Skytree, H2O, Graphlab or some other machine learning software.

Databricks Envy

Sniping at Databricks is equally unwarranted. (Note: I’m not on the payroll.)  There are only so many ways to build a viable open source business model.   Offering a commercial product with additional bits is one way to do so; that is how Cloudera and MapR operate.  Databricks offers a hosted service for Spark with a few extra bits; if you don’t like Databricks’ offering, you can implement on-premises yourself or get Spark as a service through Amazon Web Services, BlueData, Qubole or elsewhere.

And if you really must have a notebook for Spark, try Zeppelin.

Of course, it’s true that Hortonworks open sources everything.  HDP loses $3.76 for every dollar they sell.  They hope to make it up on volume.

Databricks contributes heavily to the open source Spark project, supporting developers whose sole job is to improve Spark.  Most importantly, Databricks provides leadership and release management, which inspires confidence that Spark will not turn into a muddled mess like Mahout.

The complaint that Spark Summit East “felt like a Databricks show” is odd — one rarely hears complaints that Oracle World “feels like an Oracle show.”  There were thirty-nine presentations on the agenda at Spark Summit East, and one — Ion Stoica’s keynoter — highlighted Databricks Cloud.   In contrast, sponsored sessions accounted for a third of the sessions at the 2015 Strata + Hadoop World in Santa Clara.

“Spark Is Not Enterprise-Ready”

Some of the criticism is silly.   Andrew Oliver is shocked to discover that Release 1.0 of Databricks Cloud’s notebook, currently still in beta release, isn’t as slick as Tableau.  Also, a process he was watching timed out.  But wait!  That might be due to slow hotel wi-fi…

Meanwhile, SecurityTracker reports a major security flaw in IBM’s BigSQL.

Is Spark “enterprise ready?”  The same question could be asked about Hadoop, and conservative enterprises will answer “no” in both cases.  There is no single threshold that determines when a piece of software is “enterprise-ready”.  Use cases matter; the standard for software that will run your ATMs is not the same as the standard for software to be used for genomics research.

According to Gartner’s Heudecker, “actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearData Story and Paxata, which uses Spark for data preparation. Other companies primarily use Spark to power dashboards.”  Interesting to hear Gartner dismiss the dashboard market; but enterprises are currently using Spark for more than dashboards.  A top global bank uses Spark today for Basel reporting and stress testing; if you’re not familiar with stress testing, suffice to say that a bank that gets this application wrong is in a heap of trouble.

It’s true that vendors are ahead of customers on Spark  This is hardly out of the ordinary with new technology; one could have said the same thing about Hive in 2010.  Vendors are always ahead of customers; it’s their job.

Spark is Too Big to Fail 

What are the alternatives to Spark?  Gartner’s Heudecker correctly notes that Spark excels at iterative processing, where MapReduce performance is sandbagged by its need to persist after each pass through the data.  High-performance advanced analytics must run in memory; there are commercial products available from SAS and Skytree, but for open source distributed analytics there are few alternatives to Spark.  Flink and Tez lack Spark’s analytic libraries; Impala can support SQL but lacks capabilities for machine learning, streaming analytics and graph analytics.

Whether or not Spark is fully buttoned down in Release 1.3 is irrelevant; at this point it is a settled matter that Spark is superior to MapReduce for advanced analytics applications.

I am not suggesting that Spark is free of bugs or issues.  Like every other commercial and open source software project, Spark has bugs; unlike some of the commercial products Gartner rates as “Leaders”, the Spark team is transparent about issues and fixes them quickly.   It’s also fair to say that this time next year Spark will have more features than it has today; the community of users and contributors will determine what features need to be added.

Unlike some other open source projects, Spark has strong leadership, a disciplined approach to development and an impressive release cadence.  People build software, and the people behind Spark have proven that they know what they are doing.

The list of Spark users is strong and growing.  I’ve attended every Spark Summit since the first one in 2013 and there is noticeable growth in the number and sophistication of the applications presented.  This is not hype; it is real progress by users who are accomplishing bigger and better things with Spark than they could have accomplished without it.

Spark has already achieved a level of commercial support that ensures it will live up to its promise.  Available in every commercial Hadoop distribution and with Datastax, endorsed by SAP and Oracle, it is inconceivable that these players will let Spark fail.  This is partly because reputations are at stake, and also because there are few other options for open source high-performance advanced analytics inside or outside of Hadoop.

Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.

Big Analytics Roundup (March 2, 2015)

Here is a roundup of some recent Big Analytics news and analysis.

General

  • SiliconAngle covers the Big Data money trail.

Apache Spark

  • Curt Monash writes about Databricks and Spark on his DBMS2 blog.
  • On the Databricks blog, Dave Wang summarizes Spark highlights from Strata + Hadoop World.
  • In this post, Hammer Lab describes how to monitor Spark with Graphite and Grafana.
  • Cloudera announces Hive on Spark beta.
  • InfoWorld covers Spark’s planned support for R in Release 1.3.
  • Qubole announces Spark as a Service.

 Dato/GraphLab

  • Dato announces new version of GraphLab Create.

 H2O

  • From Strata + Hadoop World, Prithvi Pravu talks about using H2O.
  • Also from Strata, here is Cliff Click’s presentation on H2O, Spark and Python.
  • On the H2O blog, Arno Candel publishes a performance tuning guide for H2O Deep Learning.