IBM Adds Spark Support to Analytics Server

With its customary PR blitz, IBM announces that it has added Spark integration to several products, including SPSS.   IBM gets a small pat on the head for adding Spark support to its Analytics Server software, under the premise that something is better than nothing.

There is a very narrow pool of SPSS users who will benefit from this enhancement.  Spark integration is only available to the subset of SPSS users who license SPSS Modeler; most SPSS users work with SPSS Statistics.  Users must also license SPSS Analytics Server, a product that only runs on Hortonworks HDP or IBM BigInsights.

So, if you’re using the high-end version of the second most popular commercial analytic server, and you’re willing to pay extra to integrate with the third and fourth ranked Hadoop distributions, you’re in luck today.

Analytics Server is a software middle layer installed on Hortonworks or BigInsights; it selectively supports SPSS Modeler operations in Hadoop.  Previous versions ran through MapReduce only;  IBM claims that the latest version runs through Spark when available, although the product documentation is surprisingly quiet on the subject.  There is no reference to Spark in IBM’s Release NotesInstallation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting; so the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

Analytics Server 2.1 partially supports most Modeler record and field operations.  Out of Modeler’s 37 data mining nodes, Analytic Server fully supports 8, partially supports 5 and does not support 24.  Among the missing:

  • Logistic Regression
  • k-Means
  • Support Vector Machines
  • PCA
  • Feature Selection
  • Anomaly Detection

Everyone understands that software engineering takes time, but IBM’s priorities are muddled. Logistic regression, k-means, SVM and PCA are all available today in Spark’s open source library; I suspect that IBM figures they can’t justify additional license fees if they point to algorithms that anyone can use for free  (*).  Clustering, PCA, feature selection and anomaly detection are precisely the kind of analyses users want to run on all of the data, not a sample extracted back to a server.

(*) IBM is mistaken on that point, of course.  There are a lot of business users who want the power of Spark but don’t want to mess with a programming API.  These users would happily pay for a nice business user front end like SPSS Modeler, and they won’t care what happens in the back end.

Assuming that this product actually works — not guaranteed, given the sloppy and incomplete documentation — it is better than the previous version of Analytics Server, but that is a low bar.  Spark or no, IBM is way behind SAS in this space; I’m not a great believer in SAS’ proprietary approach to distributed in-memory analytics, but compared to IBM’s offering SAS wins on depth of features and breadth of platform support.  There are no published benchmarks, but I suspect that SAS wins on performance as well.

Also, SAS knows how to write documentation, which seems to be a problem for IBM.

To its credit, IBM’s Analytic Server offers more Spark capability than current offerings by Alpine, Alteryx and RapidMiner; but H2O and Skytree offer richer and better engines for serious machine learning.

As for the majority of SPSS users, wouldn’t it be great if SPSS could just connect to a Spark DataFrame?  Or if Spark could ingest SPSS datasets?

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.