IBM Adds Spark Support to Analytics Server

With its customary PR blitz, IBM announces that it has added Spark integration to several products, including SPSS.   IBM gets a small pat on the head for adding Spark support to its Analytics Server software, under the premise that something is better than nothing.

There is a very narrow pool of SPSS users who will benefit from this enhancement.  Spark integration is only available to the subset of SPSS users who license SPSS Modeler; most SPSS users work with SPSS Statistics.  Users must also license SPSS Analytics Server, a product that only runs on Hortonworks HDP or IBM BigInsights.

So, if you’re using the high-end version of the second most popular commercial analytic server, and you’re willing to pay extra to integrate with the third and fourth ranked Hadoop distributions, you’re in luck today.

Analytics Server is a software middle layer installed on Hortonworks or BigInsights; it selectively supports SPSS Modeler operations in Hadoop.  Previous versions ran through MapReduce only;  IBM claims that the latest version runs through Spark when available, although the product documentation is surprisingly quiet on the subject.  There is no reference to Spark in IBM’s Release NotesInstallation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting; so the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

Analytics Server 2.1 partially supports most Modeler record and field operations.  Out of Modeler’s 37 data mining nodes, Analytic Server fully supports 8, partially supports 5 and does not support 24.  Among the missing:

  • Logistic Regression
  • k-Means
  • Support Vector Machines
  • PCA
  • Feature Selection
  • Anomaly Detection

Everyone understands that software engineering takes time, but IBM’s priorities are muddled. Logistic regression, k-means, SVM and PCA are all available today in Spark’s open source library; I suspect that IBM figures they can’t justify additional license fees if they point to algorithms that anyone can use for free  (*).  Clustering, PCA, feature selection and anomaly detection are precisely the kind of analyses users want to run on all of the data, not a sample extracted back to a server.

(*) IBM is mistaken on that point, of course.  There are a lot of business users who want the power of Spark but don’t want to mess with a programming API.  These users would happily pay for a nice business user front end like SPSS Modeler, and they won’t care what happens in the back end.

Assuming that this product actually works — not guaranteed, given the sloppy and incomplete documentation — it is better than the previous version of Analytics Server, but that is a low bar.  Spark or no, IBM is way behind SAS in this space; I’m not a great believer in SAS’ proprietary approach to distributed in-memory analytics, but compared to IBM’s offering SAS wins on depth of features and breadth of platform support.  There are no published benchmarks, but I suspect that SAS wins on performance as well.

Also, SAS knows how to write documentation, which seems to be a problem for IBM.

To its credit, IBM’s Analytic Server offers more Spark capability than current offerings by Alpine, Alteryx and RapidMiner; but H2O and Skytree offer richer and better engines for serious machine learning.

As for the majority of SPSS users, wouldn’t it be great if SPSS could just connect to a Spark DataFrame?  Or if Spark could ingest SPSS datasets?

Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.

Gartner Advanced Analytics Magic Quadrant 2015

Gartner’s latest Magic Quadrant for Advanced Analytics is out; for reference, the 2014 report is here; analysis from Doug Henschen here.  Key changes from last year:

  • Revolution Analytics moves from Visionary to Niche
  • Alpine and Microsoft move from Niche to Visionary
  • Oracle, Actuate and Megaputer drop out of the analysis
Gartner 2015 Magic Quadrant, Advanced Analytics
Gartner 2015 Magic Quadrant, Advanced Analytics

Gartner changed its evaluation criteria this year to reflect only “native” (e.g. proprietary) functionality; as a result, Revolution Analytics dropped from Visionary to Niche.   Other vendors, it seems, complained to Gartner that the old criteria were “unfair” to those who don’t leverage open source functionality.  If Gartner applies this same reasoning to other categories, it will have to drop coverage of Hortonworks and evaluate Cloudera solely on the basis of Impala.  🙂

Interestingly, Gartner’s decision to ignore open source functionality did not impact its evaluation of open source vendors RapidMiner and KNIME.

Based on modest product enhancements from Version 4.0 to Version 5.0, Alpine jumped from Niche to Visionary.   Gartner’s inclusion criteria for the category mandate that “a vendor must offer advanced analytics functionality as a stand-alone product…”; this appears to exclude Alpine, which runs in Pivotal Greenplum database (*).  Gartner’s criteria are flexible, however, and I’m sure it’s purely coincidental that Gartner analyst Gareth Herschel flacks for Alpine.

(*) Yes, I know — Alpine supports other databases and Hadoop as well.   The number of Alpine customers who use it in anything other than Pivotal can meet in Starbucks at one of the little tables in the back.

Gartner notes that Alpine “still lacks depth of functionality. Several model techniques are either absent or not fully developed within its tool.”  Well, yes, that does seem important.   Alpine’s promotion to Visionary appears to rest on its Chorus collaboration capability (originally developed by Greenplum).  It seems, however, that customers don’t actually use Chorus very much; as Gartner notes, “adoption is currently slow and the effort to boost it may divert Alpine’s resources away from the core product.”

Microsoft’s reclassification from Niche to Visionary rests purely on the basis of Azure Machine Learning (AML), a product still in beta at the time of the evaluation.  Hardly anyone uses MSFT’s “other” offering for analytics (SQL Server Analytic Services, or SSAS), apparently for good reason:

  • “The 2014 edition of SSAS lacks breadth, depth and usability, in comparison with the Leaders’ offerings.”
  • “Microsoft received low scores from SSAS customers for its willingness to incorporate their feedback into future versions of the product.”
  • “SSAS is a low-performing product (with poor features, little data exploration and questionable usability.”

On paper, AML is an attractive product, though it maxes out at 10GB of data; however, it seems optimistic to rate Microsoft as “Visionary” purely on the basis of a beta product.  “Visionary” is a stretch in any case — analytic software that runs exclusively in the cloud is by definition a niche product, as it appeals only to a certain segment of the market.  AML’s most attractive capabilities are its ability to run Python and R — and, as we noted above — these no longer carry any weight with Gartner.

Dropping Actuate and Megaputer from the MQ simply recognizes the obvious.  It’s not clear why these vendors were included last year in the first place.

It appears that Oracle chose not to participate in the MQ this year.  Analytics that run in a single database platform are by definition niche products — you can’t use Oracle Advanced Analytics if you don’t have Oracle Database, and few customers will choose Oracle Database because it has Oracle Advanced Analytics.

 

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.