IBM and Spark (Updated)

Updated March 8, 2016.  After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version.  Updates are in bold red italics.

IBM also provided the low-resolution image.

IBM has a good story to tell — one out of ten contributors to Spark 1.6 were IBM employees.  But IBM does not tell its story effectively.  Nobody cares that IBM invented the punch card and the floppy disk.  Nobody cares that IBM is so big it can’t tell a straight product story.  Bigness is IBM’s problem.

On June 15, 2015, IBM announced a major commitment to Spark.  As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing.  Also, simply by endorsing Spark, IBM has changed the conversation.  In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.”  We haven’t heard a peep from this crowd since IBM’s announcement.  For that alone, IBM should get some kind of prize.  🙂

In its announcement, IBM detailed six initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015.  IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014.  The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are DataStage InfoSphere, DB2, Netezza PureData System for Analytics, Cognos IBM Business Analytics and SPSS IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector.  Want to access DB2 with Spark?  It’s a science project.  Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years.  An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Update: IBM has subsequently added the one-way single-threaded Netezza connector to Spark Packages.  It’s also available on Git and the IBM Developer site.

I would emphasize that a one-way single-threaded connector is useful once, when you decommission your Netezza box and move the data elsewhere.  Netezza developed a native multi-threaded connector for SAS in a matter of months, so it’s not clear why it takes IBM so long to deliver something comparable for Spark.  

 

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data”  — claimed that DataStage InfoSphere supports Spark now.  Searching documentation for InfoSphere 11.3, the most current version, produces this:

Screen Shot 2016-02-13 at 2.48.21 PM

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown.  That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move.  Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results.  Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source.  Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack.  For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.Screen Shot 2016-02-13 at 3.11.35 PM

Update: at my request, an IBM executive shared a list of products that IBM says it has rebuilt around Spark to date.  I’m publishing it verbatim for reference, but note that the list includes:

  1. Double and triple counting (“Watson Content Analytics integrates with Dataworks which integrates with Spark as a Service”).
  2. Products that do not seem to exist (Spark connectors to DB2 and Informix appear in IBM documentation as generic JDBC connections).
  3. Aspirational products (Spark on Z/OS).
  4. Projects that are not products (Watson Discovery Advisor ran a POC with Spark).
  5. Capabilities that require little or no contribution from IBM (Spark runs under Platform Symphony EGO YARN Service). 

Other than that, it’s a good list.

  • IBM BigInsights  ( Version 4.0 included Spark 1.3.1, version 4.1 includes Spark 1.4.1 – GA’ed August 25th
  • EHAAS (BigInsights on Cloud)  – Includes Spark version 1.3.1 – GA’ed June  2015
  • Analytic for Hadoop – Includes Spark version 1.3.1 – Beta . Will be replaced by Pay-go and include Spark 1.4.1
  • Spark-as-a Service  – Beta in July . GA – Oct. Currently uses Spark version 1.3.1. Will move to Spark 1.4.1
  • Dataworks (Only Cloud) – Beta  – Integrated with Spark Service. Uses Spark Version 1.3.1 
  • SPSS Analytic Server and SPSS Modeler – SPSS Modeler will support Spark version 1.4.1. GA planned for end of Q3, 2015
  • Cloudant  – Cloudant includes Spark Connector. This product is already GA.
  • Omni Channel Pricing – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Dynamic Pricing –  Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Mark Down Optimization –   Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Nimbus ETL – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Journey Analytics – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA end of 2015
  • IBM Twitter CDE  On Cloud – Internal only
  • IBM Insights for Twitter Service – On Cloud , Externally Available
  • Internet of Things Real time Analytics on Cloud.  Integrating with Spark 1.3.1  – Open Beta
  • Platform Symphony EGO Service – Integrated with Spark Service for Resource Scheduling and Management. Can also be used with Spark bundled with IBM Open Platform.
  • DB2 Spark Connector
  • Netezza Spark Connector
  • Watson Content Analytics  -> Integrated with Dataworks which is integrating with Spark-as-a-Service
  • Watson Content Services – > Planning to use Spark for Data Ingestion and Enrichment. 
  • Spark on Zlinux  -> Spark enabled on zLinux
  • Spark on ZOS -> Will be available by end of year
  • No response
  • GPFS -GPFS – Spark, as part of BI, runs out of the box on GPFS. In our GPFS Ambari RPM package we changed the Spark service dependency from HDFS to GPFS and added the GPFS Hadoop connector jar to the classpath
  • Informix Connector
  • Watson Discovery Advisor  –  This is a product within Watson Health.  In a small POC that this team has done. they observed that using Scala and Spark , they can reduce their lines of code from 1000s to few hundred lines.” Looking to integrate Spark in early 2016.
  • Cognos – Team indicated that they will be able to submit SPARK SQL queries for getting the results for the data. They would connect to Spark using the JDBC/ODBC driver and then be able to execute Spark SQL queries to generate results for the report . This is planned for 2016.
  • Streams –   Spark MLLib Toolkit for InfoSphere Streams 
  • Watson Research team – Developed a Geospatial RDD on Spark.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark.  Okay.  Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential.  Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms.  That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed.  Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage.  It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4.  Although two dot releases behind, that is competitive with Qubole Data Service.  Databricks is still the best bet if you want the most up to date release.

Update: An IBM executive tells me that Bluemix now uses Spark 1.6.  In the meantime, however, IBM has removed the Spark release version from its Bluemix documentation.  

IBM People to Spark Projects

3,500 researchers and developers.  Wow!  That’s a lot of butts in seats.  Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward.  Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code.  While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Satheesh Bandaram from the IBM Spark Technology Center replies: 26 people from STC contributed to Spark 1.6, with about 80 code commits.

Additional IBM response: Since June 2015. whenIBM announced Spark Technology Center (STC), engineers in STC have actively contributed to Spark releases:  v1.4.x, v1.5.x, v1.6.0, as well as releases v1.6.1 and v.2.0 (in progress.)  

As of today March 2, IBM STC has contributed to over 237 JIRAs and counting.  About 50% are answers to major JIRAs reported in Apache Spark.  

What’s in those 237 contributions…..
·        103 out of 237 (43%) are deliverables in Spark SQL area
·        56 of them (23%) are in MLlib module
·        37 (16%) are in PySpark module  

These top 3 areas of focus from IBM STC made up 82% of the total contributions as of today.   The rest are in the documentation, Spark Core and Streaming etc. modules.

You can track progress onthis live dashboard on github http://jiras.spark.tc/

Specific to Spark 1.6, IBM team members have over 80 commits – Majority of them from STC.  A total of 28 team members contributed to the release (25 of them from STC). Each contributing engineer is a credited contributor in the release notes of Spark 1.6.

For SparkSQL, we contributed:
·        enhancements and fixes in the new DataSet API
·        DataFrame API, Data type
·        UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support coalesce and re-partition etc.
·         We have added support for column datatype of CHAR
·         Fixed the type extractor failures for complex data types
·         Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues
·        Fixed the limitation in Order by clause to comply with standard.  
·        Contributed to a number of UDF code fixes in completion of Stddev support.  

For Machine Learning, the STC team met with key influencers and stakeholders in the Spark community to jointly work on items on the roadmap Most of the roadmap items discussed went into 1.6.  The implementation of LU Decomposition algorithm is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:
·        We greatly improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
·        Enhanced the Word2Vec algorithm.
·        We added optimized first through fourth order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
·        We greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
·        We made a performance enhancement to the Linear Data Generator which is critical for unit testing in Spark ML.

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, IBM STC also added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. 

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

  • AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format.  IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.
  • DataCamp does not offer Spark training.
  • MetiStream offers public and private Spark training with a defined curriculum and service offering.  The training program is certified by Databricks.
  • Galvanize does not offer Spark training.
  • Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least.  Interestingly, MetiStream developed the second of the two BDU courses.  So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”

Advertisement

11 thoughts on “IBM and Spark (Updated)”

  1. Just in time for the launch of Beam

    From: The Big Analytics Blog <comment-reply@wordpress.com>
    Reply-To: The Big Analytics Blog <comment+_ind7fmp4ty05vr1i2k0a-t@comment.wordpress.com>
    Date: Saturday, February 13, 2016 at 1:47 PM
    To: Sunil Rawat <sunil.rawat@crescendant.com>
    Subject: [New post] IBM and Spark

    Thomas W. Dinsmore posted: “On June 15, 2015, IBM announced a major commitment to Spark. As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June. Before we start, I’d like to note that”

  2. I am the leader for IBM Spark Technology Center at San Francisco. I want to answer some of your questions… 26 people from STC contributed to Spark 1.6, with about 80 code commits. If you want, I can provide list of people and the JIRA numbers 🙂 Let me know if I can help answer anything else.

  3. If you spoke to Vijay, he is the real head of STC! My job also includes STC, but I also have other teams, like our Hadoop development team and BigInsights organization for Cloud and on-premise deployments.

  4. On the connectors, STC is developing a DB2 connector, Neteeza connector along with a SWIFT object store connector. All of these should be coming out in the next few weeks, though first version of Neteeza connector is already available in the community. We are also rolling out support for Spark 1.6 in the Spark Cloud service on IBM Bluemix very soon.

  5. Does IBM Info Sphere – Data Stage support building Spark-API Underneath ?
    Why is it losing it’s competition in the space of Data Integration against products like TALEND ?


    Lokesh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.