Distributed Analytics: A Primer
Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.
The question is important for four reasons:
- Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
- In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
- Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
- Vendors make misleading claims about distributed analytics in the platforms they promote.
First, a quick definition of terms. We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.
The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say). Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability. In principle, distributed computing can scale out without limit.
The ability to parallelize a task is inherent in the definition of the task itself. Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel. A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.
A second class of tasks requires a little more effort to parallelize. For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker. For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means. We call these tasks linear parallel.
There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way. We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data. For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.
While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider. For a task to be data parallel, the data must be organized in chunks that align with the business problem. Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency. The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred. This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.
There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all. In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence. Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.
Software for complex machine learning tasks must be expressly designed and coded to support distributed processing. While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster. For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node. You may be able to use those results in some way, but you will have to program the combination yourself.
Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing. SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine; nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics). IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.
The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).
Several observations about the contents of this table:
(1) There is currently no software for distributed analytics that runs on all distributed platforms.
(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases. Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.
(3) Some products, such as Netezza and Oracle, aren’t portable at all.
(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.
To summarize key points:
— The ability to parallelize a task is inherent in the definition of the task itself.
— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.
— Running a piece of software on a distributed platform is not the same as running it in distributed mode. Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.
Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.