IBM and Spark (Updated)

Updated March 8, 2016.  After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version.  Updates are in bold red italics.

IBM also provided the low-resolution image.

IBM has a good story to tell — one out of ten contributors to Spark 1.6 were IBM employees.  But IBM does not tell its story effectively.  Nobody cares that IBM invented the punch card and the floppy disk.  Nobody cares that IBM is so big it can’t tell a straight product story.  Bigness is IBM’s problem.

On June 15, 2015, IBM announced a major commitment to Spark.  As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing.  Also, simply by endorsing Spark, IBM has changed the conversation.  In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.”  We haven’t heard a peep from this crowd since IBM’s announcement.  For that alone, IBM should get some kind of prize.  🙂

In its announcement, IBM detailed six initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015.  IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014.  The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are DataStage InfoSphere, DB2, Netezza PureData System for Analytics, Cognos IBM Business Analytics and SPSS IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector.  Want to access DB2 with Spark?  It’s a science project.  Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years.  An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Update: IBM has subsequently added the one-way single-threaded Netezza connector to Spark Packages.  It’s also available on Git and the IBM Developer site.

I would emphasize that a one-way single-threaded connector is useful once, when you decommission your Netezza box and move the data elsewhere.  Netezza developed a native multi-threaded connector for SAS in a matter of months, so it’s not clear why it takes IBM so long to deliver something comparable for Spark.  

 

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data”  — claimed that DataStage InfoSphere supports Spark now.  Searching documentation for InfoSphere 11.3, the most current version, produces this:

Screen Shot 2016-02-13 at 2.48.21 PM

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown.  That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move.  Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results.  Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source.  Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack.  For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.Screen Shot 2016-02-13 at 3.11.35 PM

Update: at my request, an IBM executive shared a list of products that IBM says it has rebuilt around Spark to date.  I’m publishing it verbatim for reference, but note that the list includes:

  1. Double and triple counting (“Watson Content Analytics integrates with Dataworks which integrates with Spark as a Service”).
  2. Products that do not seem to exist (Spark connectors to DB2 and Informix appear in IBM documentation as generic JDBC connections).
  3. Aspirational products (Spark on Z/OS).
  4. Projects that are not products (Watson Discovery Advisor ran a POC with Spark).
  5. Capabilities that require little or no contribution from IBM (Spark runs under Platform Symphony EGO YARN Service). 

Other than that, it’s a good list.

  • IBM BigInsights  ( Version 4.0 included Spark 1.3.1, version 4.1 includes Spark 1.4.1 – GA’ed August 25th
  • EHAAS (BigInsights on Cloud)  – Includes Spark version 1.3.1 – GA’ed June  2015
  • Analytic for Hadoop – Includes Spark version 1.3.1 – Beta . Will be replaced by Pay-go and include Spark 1.4.1
  • Spark-as-a Service  – Beta in July . GA – Oct. Currently uses Spark version 1.3.1. Will move to Spark 1.4.1
  • Dataworks (Only Cloud) – Beta  – Integrated with Spark Service. Uses Spark Version 1.3.1 
  • SPSS Analytic Server and SPSS Modeler – SPSS Modeler will support Spark version 1.4.1. GA planned for end of Q3, 2015
  • Cloudant  – Cloudant includes Spark Connector. This product is already GA.
  • Omni Channel Pricing – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Dynamic Pricing –  Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Mark Down Optimization –   Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Nimbus ETL – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Journey Analytics – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA end of 2015
  • IBM Twitter CDE  On Cloud – Internal only
  • IBM Insights for Twitter Service – On Cloud , Externally Available
  • Internet of Things Real time Analytics on Cloud.  Integrating with Spark 1.3.1  – Open Beta
  • Platform Symphony EGO Service – Integrated with Spark Service for Resource Scheduling and Management. Can also be used with Spark bundled with IBM Open Platform.
  • DB2 Spark Connector
  • Netezza Spark Connector
  • Watson Content Analytics  -> Integrated with Dataworks which is integrating with Spark-as-a-Service
  • Watson Content Services – > Planning to use Spark for Data Ingestion and Enrichment. 
  • Spark on Zlinux  -> Spark enabled on zLinux
  • Spark on ZOS -> Will be available by end of year
  • No response
  • GPFS -GPFS – Spark, as part of BI, runs out of the box on GPFS. In our GPFS Ambari RPM package we changed the Spark service dependency from HDFS to GPFS and added the GPFS Hadoop connector jar to the classpath
  • Informix Connector
  • Watson Discovery Advisor  –  This is a product within Watson Health.  In a small POC that this team has done. they observed that using Scala and Spark , they can reduce their lines of code from 1000s to few hundred lines.” Looking to integrate Spark in early 2016.
  • Cognos – Team indicated that they will be able to submit SPARK SQL queries for getting the results for the data. They would connect to Spark using the JDBC/ODBC driver and then be able to execute Spark SQL queries to generate results for the report . This is planned for 2016.
  • Streams –   Spark MLLib Toolkit for InfoSphere Streams 
  • Watson Research team – Developed a Geospatial RDD on Spark.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark.  Okay.  Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential.  Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms.  That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed.  Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage.  It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4.  Although two dot releases behind, that is competitive with Qubole Data Service.  Databricks is still the best bet if you want the most up to date release.

Update: An IBM executive tells me that Bluemix now uses Spark 1.6.  In the meantime, however, IBM has removed the Spark release version from its Bluemix documentation.  

IBM People to Spark Projects

3,500 researchers and developers.  Wow!  That’s a lot of butts in seats.  Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward.  Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code.  While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Satheesh Bandaram from the IBM Spark Technology Center replies: 26 people from STC contributed to Spark 1.6, with about 80 code commits.

Additional IBM response: Since June 2015. whenIBM announced Spark Technology Center (STC), engineers in STC have actively contributed to Spark releases:  v1.4.x, v1.5.x, v1.6.0, as well as releases v1.6.1 and v.2.0 (in progress.)  

As of today March 2, IBM STC has contributed to over 237 JIRAs and counting.  About 50% are answers to major JIRAs reported in Apache Spark.  

What’s in those 237 contributions…..
·        103 out of 237 (43%) are deliverables in Spark SQL area
·        56 of them (23%) are in MLlib module
·        37 (16%) are in PySpark module  

These top 3 areas of focus from IBM STC made up 82% of the total contributions as of today.   The rest are in the documentation, Spark Core and Streaming etc. modules.

You can track progress onthis live dashboard on github http://jiras.spark.tc/

Specific to Spark 1.6, IBM team members have over 80 commits – Majority of them from STC.  A total of 28 team members contributed to the release (25 of them from STC). Each contributing engineer is a credited contributor in the release notes of Spark 1.6.

For SparkSQL, we contributed:
·        enhancements and fixes in the new DataSet API
·        DataFrame API, Data type
·        UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support coalesce and re-partition etc.
·         We have added support for column datatype of CHAR
·         Fixed the type extractor failures for complex data types
·         Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues
·        Fixed the limitation in Order by clause to comply with standard.  
·        Contributed to a number of UDF code fixes in completion of Stddev support.  

For Machine Learning, the STC team met with key influencers and stakeholders in the Spark community to jointly work on items on the roadmap Most of the roadmap items discussed went into 1.6.  The implementation of LU Decomposition algorithm is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:
·        We greatly improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
·        Enhanced the Word2Vec algorithm.
·        We added optimized first through fourth order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
·        We greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
·        We made a performance enhancement to the Linear Data Generator which is critical for unit testing in Spark ML.

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, IBM STC also added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. 

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

  • AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format.  IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.
  • DataCamp does not offer Spark training.
  • MetiStream offers public and private Spark training with a defined curriculum and service offering.  The training program is certified by Databricks.
  • Galvanize does not offer Spark training.
  • Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least.  Interestingly, MetiStream developed the second of the two BDU courses.  So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.