With its customary PR blitz, IBM announces that it has added Spark integration to several products, including SPSS. IBM gets a small pat on the head for adding Spark support to its Analytics Server software, under the premise that something is better than nothing.
There is a very narrow pool of SPSS users who will benefit from this enhancement. Spark integration is only available to the subset of SPSS users who license SPSS Modeler; most SPSS users work with SPSS Statistics. Users must also license SPSS Analytics Server, a product that only runs on Hortonworks HDP or IBM BigInsights.
So, if you’re using the high-end version of the second most popular commercial analytic server, and you’re willing to pay extra to integrate with the third and fourth ranked Hadoop distributions, you’re in luck today.
Analytics Server is a software middle layer installed on Hortonworks or BigInsights; it selectively supports SPSS Modeler operations in Hadoop. Previous versions ran through MapReduce only; IBM claims that the latest version runs through Spark when available, although the product documentation is surprisingly quiet on the subject. There is no reference to Spark in IBM’s Release Notes, Installation Guide or User’s Guide. Spark is mentioned deep in the Administrator Guide, under Troubleshooting; so the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”
Analytics Server 2.1 partially supports most Modeler record and field operations. Out of Modeler’s 37 data mining nodes, Analytic Server fully supports 8, partially supports 5 and does not support 24. Among the missing:
- Logistic Regression
- Support Vector Machines
- Feature Selection
- Anomaly Detection
Everyone understands that software engineering takes time, but IBM’s priorities are muddled. Logistic regression, k-means, SVM and PCA are all available today in Spark’s open source library; I suspect that IBM figures they can’t justify additional license fees if they point to algorithms that anyone can use for free (*). Clustering, PCA, feature selection and anomaly detection are precisely the kind of analyses users want to run on all of the data, not a sample extracted back to a server.
(*) IBM is mistaken on that point, of course. There are a lot of business users who want the power of Spark but don’t want to mess with a programming API. These users would happily pay for a nice business user front end like SPSS Modeler, and they won’t care what happens in the back end.
Assuming that this product actually works — not guaranteed, given the sloppy and incomplete documentation — it is better than the previous version of Analytics Server, but that is a low bar. Spark or no, IBM is way behind SAS in this space; I’m not a great believer in SAS’ proprietary approach to distributed in-memory analytics, but compared to IBM’s offering SAS wins on depth of features and breadth of platform support. There are no published benchmarks, but I suspect that SAS wins on performance as well.
Also, SAS knows how to write documentation, which seems to be a problem for IBM.
To its credit, IBM’s Analytic Server offers more Spark capability than current offerings by Alpine, Alteryx and RapidMiner; but H2O and Skytree offer richer and better engines for serious machine learning.
As for the majority of SPSS users, wouldn’t it be great if SPSS could just connect to a Spark DataFrame? Or if Spark could ingest SPSS datasets?