IBM and Spark (Updated)

Updated March 8, 2016.  After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version.  Updates are in bold red italics.

IBM also provided the low-resolution image.

IBM has a good story to tell — one out of ten contributors to Spark 1.6 were IBM employees.  But IBM does not tell its story effectively.  Nobody cares that IBM invented the punch card and the floppy disk.  Nobody cares that IBM is so big it can’t tell a straight product story.  Bigness is IBM’s problem.

On June 15, 2015, IBM announced a major commitment to Spark.  As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing.  Also, simply by endorsing Spark, IBM has changed the conversation.  In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.”  We haven’t heard a peep from this crowd since IBM’s announcement.  For that alone, IBM should get some kind of prize.  🙂

In its announcement, IBM detailed six initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015.  IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014.  The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are DataStage InfoSphere, DB2, Netezza PureData System for Analytics, Cognos IBM Business Analytics and SPSS IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector.  Want to access DB2 with Spark?  It’s a science project.  Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years.  An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Update: IBM has subsequently added the one-way single-threaded Netezza connector to Spark Packages.  It’s also available on Git and the IBM Developer site.

I would emphasize that a one-way single-threaded connector is useful once, when you decommission your Netezza box and move the data elsewhere.  Netezza developed a native multi-threaded connector for SAS in a matter of months, so it’s not clear why it takes IBM so long to deliver something comparable for Spark.  

 

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data”  — claimed that DataStage InfoSphere supports Spark now.  Searching documentation for InfoSphere 11.3, the most current version, produces this:

Screen Shot 2016-02-13 at 2.48.21 PM

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown.  That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move.  Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results.  Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source.  Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack.  For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.Screen Shot 2016-02-13 at 3.11.35 PM

Update: at my request, an IBM executive shared a list of products that IBM says it has rebuilt around Spark to date.  I’m publishing it verbatim for reference, but note that the list includes:

  1. Double and triple counting (“Watson Content Analytics integrates with Dataworks which integrates with Spark as a Service”).
  2. Products that do not seem to exist (Spark connectors to DB2 and Informix appear in IBM documentation as generic JDBC connections).
  3. Aspirational products (Spark on Z/OS).
  4. Projects that are not products (Watson Discovery Advisor ran a POC with Spark).
  5. Capabilities that require little or no contribution from IBM (Spark runs under Platform Symphony EGO YARN Service). 

Other than that, it’s a good list.

  • IBM BigInsights  ( Version 4.0 included Spark 1.3.1, version 4.1 includes Spark 1.4.1 – GA’ed August 25th
  • EHAAS (BigInsights on Cloud)  – Includes Spark version 1.3.1 – GA’ed June  2015
  • Analytic for Hadoop – Includes Spark version 1.3.1 – Beta . Will be replaced by Pay-go and include Spark 1.4.1
  • Spark-as-a Service  – Beta in July . GA – Oct. Currently uses Spark version 1.3.1. Will move to Spark 1.4.1
  • Dataworks (Only Cloud) – Beta  – Integrated with Spark Service. Uses Spark Version 1.3.1 
  • SPSS Analytic Server and SPSS Modeler – SPSS Modeler will support Spark version 1.4.1. GA planned for end of Q3, 2015
  • Cloudant  – Cloudant includes Spark Connector. This product is already GA.
  • Omni Channel Pricing – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Dynamic Pricing –  Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Mark Down Optimization –   Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Nimbus ETL – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Journey Analytics – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA end of 2015
  • IBM Twitter CDE  On Cloud – Internal only
  • IBM Insights for Twitter Service – On Cloud , Externally Available
  • Internet of Things Real time Analytics on Cloud.  Integrating with Spark 1.3.1  – Open Beta
  • Platform Symphony EGO Service – Integrated with Spark Service for Resource Scheduling and Management. Can also be used with Spark bundled with IBM Open Platform.
  • DB2 Spark Connector
  • Netezza Spark Connector
  • Watson Content Analytics  -> Integrated with Dataworks which is integrating with Spark-as-a-Service
  • Watson Content Services – > Planning to use Spark for Data Ingestion and Enrichment. 
  • Spark on Zlinux  -> Spark enabled on zLinux
  • Spark on ZOS -> Will be available by end of year
  • No response
  • GPFS -GPFS – Spark, as part of BI, runs out of the box on GPFS. In our GPFS Ambari RPM package we changed the Spark service dependency from HDFS to GPFS and added the GPFS Hadoop connector jar to the classpath
  • Informix Connector
  • Watson Discovery Advisor  –  This is a product within Watson Health.  In a small POC that this team has done. they observed that using Scala and Spark , they can reduce their lines of code from 1000s to few hundred lines.” Looking to integrate Spark in early 2016.
  • Cognos – Team indicated that they will be able to submit SPARK SQL queries for getting the results for the data. They would connect to Spark using the JDBC/ODBC driver and then be able to execute Spark SQL queries to generate results for the report . This is planned for 2016.
  • Streams –   Spark MLLib Toolkit for InfoSphere Streams 
  • Watson Research team – Developed a Geospatial RDD on Spark.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark.  Okay.  Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential.  Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms.  That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed.  Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage.  It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4.  Although two dot releases behind, that is competitive with Qubole Data Service.  Databricks is still the best bet if you want the most up to date release.

Update: An IBM executive tells me that Bluemix now uses Spark 1.6.  In the meantime, however, IBM has removed the Spark release version from its Bluemix documentation.  

IBM People to Spark Projects

3,500 researchers and developers.  Wow!  That’s a lot of butts in seats.  Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward.  Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code.  While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Satheesh Bandaram from the IBM Spark Technology Center replies: 26 people from STC contributed to Spark 1.6, with about 80 code commits.

Additional IBM response: Since June 2015. whenIBM announced Spark Technology Center (STC), engineers in STC have actively contributed to Spark releases:  v1.4.x, v1.5.x, v1.6.0, as well as releases v1.6.1 and v.2.0 (in progress.)  

As of today March 2, IBM STC has contributed to over 237 JIRAs and counting.  About 50% are answers to major JIRAs reported in Apache Spark.  

What’s in those 237 contributions…..
·        103 out of 237 (43%) are deliverables in Spark SQL area
·        56 of them (23%) are in MLlib module
·        37 (16%) are in PySpark module  

These top 3 areas of focus from IBM STC made up 82% of the total contributions as of today.   The rest are in the documentation, Spark Core and Streaming etc. modules.

You can track progress onthis live dashboard on github http://jiras.spark.tc/

Specific to Spark 1.6, IBM team members have over 80 commits – Majority of them from STC.  A total of 28 team members contributed to the release (25 of them from STC). Each contributing engineer is a credited contributor in the release notes of Spark 1.6.

For SparkSQL, we contributed:
·        enhancements and fixes in the new DataSet API
·        DataFrame API, Data type
·        UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support coalesce and re-partition etc.
·         We have added support for column datatype of CHAR
·         Fixed the type extractor failures for complex data types
·         Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues
·        Fixed the limitation in Order by clause to comply with standard.  
·        Contributed to a number of UDF code fixes in completion of Stddev support.  

For Machine Learning, the STC team met with key influencers and stakeholders in the Spark community to jointly work on items on the roadmap Most of the roadmap items discussed went into 1.6.  The implementation of LU Decomposition algorithm is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:
·        We greatly improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
·        Enhanced the Word2Vec algorithm.
·        We added optimized first through fourth order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
·        We greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
·        We made a performance enhancement to the Linear Data Generator which is critical for unit testing in Spark ML.

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, IBM STC also added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. 

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

  • AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format.  IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.
  • DataCamp does not offer Spark training.
  • MetiStream offers public and private Spark training with a defined curriculum and service offering.  The training program is certified by Databricks.
  • Galvanize does not offer Spark training.
  • Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least.  Interestingly, MetiStream developed the second of the two BDU courses.  So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)