Still More Comments on Microsoft and Revolution Analytics

Three full business days post-announcement, and stories continue to roll in.

Stephen Sowyer of TDWI writes an excellent summary of what Microsoft will likely do with Revolution Analytics.  He correctly notes, for example, that Microsoft is unlikely to develop a business user interface for R with code-generating capabilities (comparable to SAS Enterprise Guide, for example).  This is difficult to do, and the demand is low; people who care about R tend to like working in a programming environment, and value the ability to write their own code.  Business users, on the other hand, tend to be indifferent about the underlying code generated by the application.

Since Revolution’s Windows-based IDE requires some investment to keep it competitive, the most likely scenario is that Microsoft will add R to the Visual Studio suite.

Mr. Sowyer also notes that popular data warehouses (such as Oracle, IBM Netezza and Teradata Aster) can run R scripts in-database.  While this is true, what these databases cannot do is run R scripts in distributed mode, which limits the capability to embarrassingly parallel tasks.  Enabling R scripts to run in distributed databases — necessary for Big Data — is a substantial development project, which is why Revolution Analytics completed only two such ports (one to Hadoop and one to Teradata).

While Microsoft’s deep pockets give Revolution Analytics the means to support more platforms, they still need the active collaboration of database vendors.  Oracle and Pivotal have their own strategies for R, so partnerships with those vendors is unlikely.

For some time now, commercial database vendors have attempted to differentiate their product by including machine learning engines.  Teradata was the first, in 1987, followed by IBM DB2 in 1992; SQL Server followed in the late 1990s, and Oracle acquired what was left of Thinking Machines in 1999 primarily so it could build Darwin software for predictive analytics into Oracle database.  None of these efforts has gained much traction with working analysts. for several reasons: (1) database vendors generally sell to the IT organization and not to an organization’s end users; (2) as a result, most organizations do not link the purchase decision for databases and analytics; (3) users for predictive analytics tend to be few in number compared to SQL and BI users, and their needs tend to get overlooked.

Bottom line: I think it is doubtful that Microsoft will pursue enabling R to run in relational databases other than SQL Server, and they will drop Revolution’s “Write Once Deploy Anywhere” tagline, as it is impossible to deliver.

Elsewhere, Mr. Dan Woods doubles down on his argument that Microsoft should emulate Tibco, which is like arguing that the Seattle Seahawks should emulate the Jacksonville Jaguars.  Sorry, JAX; it just wasn’t your year.

 

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Latest Forrester Analytics “Wave”

Forrester’s latest assessment of predictive analytics vendors is available here;  news reports summarizing the report are herehere and here.

A few comments:

(1) While the “Wave” analysis purports to be an assessment of “Predictive Analytics Solutions for Big Data”, it is actually an assessment of vendor scale.  You can save yourself $2,495 by rank-ordering vendors by revenue and get similar results.

(2) The assessment narrowly focuses on in-memory tools, which arbitrarily excludes some of the most powerful analytic tools in the market today.    Forrester claims that in-database analytics “tend to be oriented toward technical users and require programming or SQL”.  This is simply not true.  Oracle Data Mining and Teradata Warehouse Miner have excellent user interfaces, and IBM SPSS Modeler provides an excellent front-end to in-database analytics across a number of databases (including IBM Netezza, DB2, Oracle and Teradata).  Alpine Miner is a relatively new entrant that also has an excellent UI.

(3) Forrester exaggerates SAS’ experience and market presence in Big Data.  Most SAS customers work primarily with foundation products that do not scale effectively to Big Data; this is precisely why SAS developed the high performance products demonstrated in the analyst beauty show.   SAS has exactly one public reference customer for its new in-memory high-performance analytics software.

(4) SAS Enterprise Miner may be “easy to learn”, but it is a stretch to say that it has the capability to “run analytics in-database or distributed clusters to handle Big Data”.

(5) Designation of SAP as a “leader” in predictive analytics is also a stretch.  SAP’s Predictive Analytics Library is a new product with some interesting capabilities; however, it is largely unproven and SAP lacks a track record in this area.

(6) The omission of Oracle Data Mining from this report makes no sense at all.

(7) Forrester’s scorecard gives every vendor the same score for pricing and licensing.  That’s like arguing that a Bentley and a Chevrolet are equally suitable as family cars, but the Bentley is preferable because it has leather seats.    TCO matters.  As I reported here, a firm that tested SAS High Performance Analytics and reported good results did not actually buy the product because, as a company executive notes, “this stuff is expensive.”

Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

Advanced Analytics in Hadoop, Part One

This is the first of a two-part post on the current state of advanced analytics in Hadoop.  In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout.  In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.

For starters, a few definitions.

I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data.   Discovery means the articulation of patterns as rules or mathematical expressions;  deployment means the mobilization of discovered patterns to improve a business process.  Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance.  Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.

By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.

Analysts can and do code advanced analytics directly in MapReduce.  For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.

The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics.   External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore.  This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach.  Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both.  The end result is an extended analytic discovery-to-deployment cycle.

Eliminating data movement radically reduces analytic cycle time.  This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place.  This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.

Ideal use cases for advanced analytics in Hadoop have the following profile:

  • Source data is already in Hadoop
  • Applications that consume the analytics are also in Hadoop
  • Business need to use all of available data (e.g. sampling is not acceptable)
  • Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself

The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly.  These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.

Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm.   Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases.   Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.

As of Release 0.7 (June 16, 2012), the following algorithms are implemented:

Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models

Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down

Association: Parallel FP-Growth

Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition

Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization

Lexical Analysis: Collocations

For a clever introduction to machine learning and Mahout, watch this video.

For more detail, review this presentation on Slideshare.

There are no recently released books on Mahout.  This book is two releases out of date, but provides a good introduction to the project.

Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others.   Check the Powered by Mahout page for an extended list.

Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.

Fact-Check: SAS and Greenplum

Does SAS run “inside” Greenplum?  Can existing SAS programs run faster in Greenplum without modification?  Clients say that their EMC rep makes such claims.

The first claim rests on confusion about EMC Greenplum’s product line.  It’s important to distinguish between Greenplum Database and Greenplum DCA.  Greenplum DCA is a rack of commodity blade servers which can be configured with Greenplum Database running on some of the blades and SAS running on the other blades.  For most customers, a single DCA blade provides insufficient computing power to support SAS, so EMC and SAS typically recommend deployment on multiple blades, with SAS Grid Manager implemented for workload management.   This architecture is illustrated in this white paper on SAS’ website.

As EMC’s reference architecture clearly illustrates, SAS does not run “inside” Greenplum database (or any other database); it simply runs on server blades that are co-located in the same physical rack as the database.  The SAS instance installed on the DCA rack works just like any other SAS instance installed on freestanding servers.  SAS interfaces with Greenplum Database through a SAS/ACCESS interface, which is exactly the same way that SAS interacts with other databases.

Does co-locating SAS and the database in the same rack offer any benefits?  Yes, because when data moves back and forth between SAS and Greenplum Database, it does so over a dedicated 10GB Ethernet connection.   However, this is not a unique benefit — customers can implement a similar high-speed connection between a free-standing instance of SAS and any data warehouse appliance, such as IBM Netezza.

To summarize, SAS does not run “inside” Greenplum Database or any other database; moreover, SAS’  interface with Greenplum is virtually the same as SAS’ interface with any other supported database.  EMC offers customers the ability to co-locate SAS in the same rack of servers as the Greenplum Database, which expedites data movement between SAS and the database, but this is a capability that can be replicated cheaply in other ways.

The second claim — that SAS programs run faster in Greenplum DCA without modification — requires more complex analysis.   For starters, though, keep in mind that SAS programs always require at least some modification when moved from one SAS instance to another, if only to update SAS libraries and adjust for platform-specific options.  Those modifications are small, however, so let’s set them aside and grant EMC some latitude for sales hyperbole.

To understand how existing SAS program will perform inside DCA, we need to consider the building blocks of those existing programs:

  1. SAS DATA Steps
  2. SAS PROC SQL
  3. SAS Database-Enabled PROCs
  4. SAS Analytic PROCs (PROC LOGISTIC, PROC REG, and so forth)

Here’s how SAS will handle each of these workloads within DCA:

(1) SAS DATA Steps: SAS attempts to translate SAS DATA Step statements into SQL.   When this translation succeeds, SAS submits the SQL expression to Greenplum Database, which runs the query and returns the result set to SAS.  Since SAS DATA Step programming includes many concepts that do not translate well to SQL, in most cases SAS will extract all required data from the database and run the required operations as a single-threaded process on one of the SAS nodes.

(2) SAS PROC SQL: SAS submits the embedded SQL to Greenplum Database, which runs the query and return the result set to SAS.   The SAS user must verify that the embedded SQL expression is syntactically correct for Greenplum.

(3) SAS Database-Enabled PROCs;  SAS converts the user request to database-specific SQL and submits to Greenplum Database, which runs the query and returns the result set to SAS.

(4) SAS Analytic PROCs:  In most cases, SAS runs the PROC on one of the server blades.  A limited number of SAS PROCs are automatically enabled for Grid Computing; these PROCs will run multi-threaded.

In each case, the SAS workload runs in the same way inside DCA as it would if implemented in a free-standing SAS instance with comparable computing power.   Existing SAS programs are not automatically enabled to leverage Greenplum’s parallel processing; the SAS user must explicitly modify the SAS program to exploit Greenplum Database just as they would when using SAS with other databases.

So, returning to the question: will existing SAS programs run faster in Greenplum DCA without modification?  Setting aside minor changes when moving any SAS program, the performance of existing programs when run in DCA will be no better than what would be achieved when SAS is deployed on competing hardware with comparable computing specifications.

SAS users can only realize radical performance improvements when they explicitly modify their programs to take advantage of in-database processing.   Greenplum has no special advantage in this regard; conversion effort is similar for all databases supported by SAS.

Deconstructing SAS Analytics Accelerator

Now and then I get queries from clients about SAS Analytics Accelerator, an in-database product that SAS supports exclusively with Teradata database.  SAS does not publicize sales figures for individual products, so we don’t know for sure how widely Analytics Accelerator is used.  There are some clues, however.

  • Although SAS released this product in 2008, it has published no customer success stories.  Follow the customer success link from SAS’ overview for Analytics Accelerator and read the stories; none of them describes using this product.
  • SAS has never expanded platform support beyond Teradata.  SAS is a customer-driven company that does not let partner and channel considerations impact its customers.   Absence of a broader product rollout implies absence of market demand
  • Unlike the rest of its product line, SAS has not significantly enhanced Analytics Accelerator since the product was introduced four years ago.

SAS supports seven in-database Base SAS PROCs (FREQ, MEANS, RANK, REPORT, SORT, SUMMARY and TABULATE) on many databases, including Aster, DB2, Greenplum, Netezza, Oracle and Teradata.  Analytics Accelerator supports seven SAS/STAT PROCs (CORR, CANCORR,  FACTOR, PRINCOMP, REG, SCORE, and VARCLUS), one SAS/ETS PROC (TIMESERIES), three Enterprise Miner PROCs (DMDB, DMINE and DMREG) plus macros for sampling and binning for use with Enterprise Miner.   Customers must license SAS/STAT, SAS/ETS and SAS Enterprise Miner to use the in-database capabilities.

On the surface, these appear to be features that offer SAS users significantly better integration with Teradata than with other databases.  When we dig beneath the surface, however, the story is different.

Anyone familiar with SAS understands from a quick look at the supported PROCs that it’s an odd list;  it includes some rarely used PROCs and omits PROCs that are frequently used in business analytics (such as PROC LOGISTIC and PROC GLM).  I generally ask SAS clients to list the PROCs they use most often; when they do so, they rarely list any of the PROCs supported by Analytics Accelerator.

The SAS/STAT PROCs supported by Analytics Accelerator do not actually run in Teradata; the PROC itself runs on a SAS server, just like any other SAS PROC.   Instead, SAS passes a request to Teradata to build a Sum of Squares Cross Product (SSCP) matrix.  SAS then pulls the SSCP matrix over to the SAS server, loads it into memory and proceeds with the computational algorithm.

This is a significant performance enhancement, since MPP databases are well suited to matrix operations and the volume of data moved over the network is reduced to a minimum.   But here’s the kicker: any SAS user can construct SSCP matrices in an MPP database (such as IBM Netezza) and import it into SAS/STAT.  You don’t need to license SAS Analytics Accelerator; every SAS customer who licenses SAS/STAT already has this capability.

This explains, in part, the unusual selection of PROCs:  SAS chose PROCs that could be included with minimal R&D investment.  This is a smart strategy for SAS, but says little about the value of the product for users.

Since SAS/STAT does not currently export PMML documents for downstream integration, in-database support for PROC SCORE is intriguing; once again, though,  the devil is in the details.  Analytics Accelerator converts the SAS model to a SQL expression and submits it to the database; unfortunately, this translation only supports linear models.  SAS users can score with models developed in thirteen different SAS PROCs  (ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, ORTHOREG, QUANTREG, REG, ROBUSTREG, TCALIS and VARCLUS), but with the exception of PROC REG these are rarely used in predictive modeling for business analytics.  SAS seems to have simply selected those PROCs whose output is easy to implement in SQL, regardless of whether or not these PROCs are useful.

Overall, Analytics Accelerator lacks a guiding design approach, and reflects little insight into actual use cases; instead, SAS has cobbled together a collection of features that are easy to implement.   When clients consider the tasks they actually want to do in SAS, this product offers little value.

Leverage the In-Database Capabilities of Analytic Software

Many analysts have a strong preference for commercial analytic workbenches such as SAS or SPSS.  Both packages are widely used, respected by analysts, and each has strong advocates.  The purpose of this article is to point out that analytic users can benefit from the performance and simplicity of IBM Netezza in-database analytics without abandoning their preferred interface.

Let’s start with SAS.  One of the most frequent complaints from IT organizations about SAS users is the propensity for users to require significant amounts of storage space for SAS data sets.  A leading credit card issuer, for example, reports that users have more than one hundred terabytes of SAS files – and the volume is growing rapidly.

But SAS users can store data tables in the Netezza appliance and run data preparation steps against those tables using the SAS Pass-Through Facility.  In addition to centralizing storage, reducing data movement and simplifying security, users can realize 100X improvements in program runtime.

In-database PROCs are another SAS feature.  SAS currently enables in-database execution of FREQ, MEANS, RANK, REPORT, SORT, SUMMARY, and TABULATE in a number of databases and data warehouse appliances, including Netezza.   For the user, database-enabled PROCs operate like any other SAS PROC — but instead of running on the server, the PROC runs in the database.

SAS supports a number of other in-database capabilities through SAS/ACCESS, including the ability to pass functions and formats to Netezza, the ability to create temporary tables and the ability to leverage Netezza’s bulk load/unload facility

SAS users can make calls to Netezza in-database functions by invoking Netezza In-Database Analytics through PROC SQL.  In-database functions are far more efficient for building analytic data sets, data cleansing and enhancement.  Customers who have implemented this approach have observed remarkable improvements in overall runtime: jobs that ran in hours now run in minutes.

SAS customers using SAS Enterprise Miner or SAS Model Manager can also benefit from SAS Scoring Accelerator.  Scoring Accelerator which SAS enables an Enterprise Miner user to export a scoring function that runs on Netezza.  This capability helps the organization avoid a custom programming task, and enables the analyst to easily hand off model scoring to a production operation.

IBM SPSS Modeler also offers the capability to work directly with database tables in Netezza; like SAS, it can be configured to minimize storage on the SPSS server.  Modeler also offers Pushback SQL capabilities, which enable the user to perform functions within the Netezza appliance, including table joins, aggregation, selections, sorting, field derivation, field projection and scoring.  While the in-database functional capabilities of the two packages are similar, SPSS accomplishes this entirely within the graphical environment of the Stream canvas.

As with SAS, SPSS Modeler users can leverage Netezza in-database analytics to build, score and store predictive models, either through custom nodes or out-of-the box integration in Release 15.0.  Again, a key difference between SAS and SPSS is that while SPSS Modeler surfaces Netezza in-database analytics through the graphical user environment, SAS users must have programming and SQL skills.

To summarize, leading commercial software packages like SAS and SPSS already offer the ability to manage files, perform data preparation, build models and run scoring processes entirely within the Netezza appliance.  Users of these tools can significantly improve runtime performance by leveraging these existing capabilities.