Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

Big Analytics Roundup (March 23, 2015)

This week, Spark Summit East produced a deluge of news and analysis on Apache Spark and Databricks.  Also in the news: a couple of ventures landed funding, SAP released software and SAS soft-launched something new for SAS Visual Analytics.

Analytic Startups

Venture Capital Dispatch on WSJ.D reports that Andreeson Horowitz has invested $7.5 million in AMPLab spinout Tachyon Nexus.  Tachyon Nexus supports the eponymous Tachyon project, a memory-centric storage layer that runs underneath Apache Spark or independently.

Social media mining venture Dataminr pulls $130 million in “D” round financing, demonstrating that the real money in analytics is in applications, not algorithms.

Apache Flink

On the Flink project blog, Fabian Hueske posts an excellent article that describes how joins work in Flink.

Apache Spark

ADTMag rehashes the tired debate about whether Spark and Hadoop are “friends” or “foes”.  Sounds like teens whispering in the hallways of Silicon Valley High.  Spark works with HDFS, and it works with other datastores; it all depends on your use case.  If that means a little less buzz for Hadoop purists, get over it.

To that point, Matt Kalan explains how to use Spark with MongoDB on the Databricks blog.

A paper published by a team at Berkeley summarizes results from Spark benchmark testing, draws surprising conclusions.

In other commentary about Spark:

  • TechCrunch reports on the growth of Spark.
  • TechRepublic wonders if anything can dim Spark.
  • InfoWorld lists five reasons to use Spark for Big Data.

In VentureBeat, Sharmila Mulligan relates how ClearStory Data’s big bet on Spark paid off without explaining the nature of the payoff.  ClearStory has a nice product, but it seems a bit too early for a victory lap.

On the Spark blog, Justin Kestelyn describes exactly-once Spark Streaming with Apache Kafka, a new feature in Spark 1.3.

Databricks

Doug Henschen chides Ion Stoica for plugging Databricks Cloud at Spark Summit East, hinting darkly that some Big Data vendors are threatened by Spark and trying to plant FUD about it.  Vendors planting FUD about competitors that threaten them: who knew that people did such things?  It’s not clear what revenue model Henschen thinks Databricks should pursue; as Hortonworks’ numbers show, “contributing to open source” alone is not a viable business model.  If those Big Data vendors are unhappy that Databricks Cloud competes with what they offer, there is nothing to stop them from embracing Spark and standing up their own cloud service.

In other news:

  • On the Databricks blog, the folks from Uncharted Software describe PanTera, cool visualization software that runs in Databricks Cloud.
  • Rob Marvin of SD Times rounds up new product announcements from Spark Summit East.
  • In PCWorld, Joab Jackson touts the benefits of Databricks Cloud.
  • ConsumerElectronicsNet recaps Databricks’ announcement of the Jobs feature for Databricks Cloud, plus other news from Spark Summit East.
  • On ZDNet, Toby Wolpe reviews the new Jobs feature for production workloads in Databricks Cloud.
  • On the Databricks blog, Abi Mehta announces that Tresata’s TEAK application for AML will be implemented on Databricks Cloud.  Media coverage here, here and here.

Geospatial

MemSQL announced geospatial capabilities for its distributed in-memory NewSQL database.

J. Andrew Rogers asks why geospatial databases are hard to build, then answers his own question.

RapidMiner

Butler Analytics publishes a favorable review of RapidMiner.

SAP

SAP released a new on-premises version of Lumira Edge for visualization, adding to the list of software that is not as good as Tableau.  SAP also released Predictive Analytics 2.0, a product that marries the toylike SAP Predictive Analytics with KXEN InfiniteInsight, a product acquired in 2013.  According to SAP, Predictive Analytics 2.0 is a “single, unified analytics product” with two work environments, which sounds like SAP has bundled two different code bases into a marketing bundle with a common datastore.  Going for a “three-fer”, SAP also adds Lumira Edge to the bundle.

SAS

American Banker reports that SAS has “launched” SAS Transaction Monitoring Optimization for AML scenario testing; in this case, “launch”, means marketing collateral is available.  The product is said to run on top of SAS Visual Analytics, which itself runs on top of SAS LASR Server, SAS’ “other” distributed in-memory platform.

Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.

What’s Next for SAS?

First, some background.

— SAS is a privately held company.  Founder and CEO Jim Goodnight owns a controlling interest.

Goodnight is 71 years old.

— Goodnight’s children are not engaged in management of the business.

Within the next few years, SAS faces a dual transition of management and ownership.   This should be a concern for customers and prospective customers; due to SAS’ proprietary architecture, building on the SAS platform necessarily means a long-term bet on the future of the company.  Suppose, for example, that IBM acquires SAS: will SAS continue to support interfaces to Oracle and Teradata?

Succession is a problem for any business;  it is especially so for a founder-managed business, where ownership must change as well as management.   Goodnight may be interested in SAS as a going concern, but his heirs are more likely to want its cash value, especially when the IRS calls to collect estate taxes.

Large founder-managed firms typically struggle with two key issues.  First, the standards of corporate governance in public companies differ markedly from those that apply to private companies.  The founder’s personal business may be closely intermingled with corporate business in a manner that is not acceptable in a public company.

For example, suppose (hypothetically) that Goodnight or one of his personal entities owns the land occupied by SAS headquarters in Cary, North Carolina; as a transaction between related parties, such a relationship is problematic for a public company.   Such interests must be unwound before an IPO or sale to a public company can proceed; failure to do so can lead to serious consequences, as the Rigas brothers discovered when Adelphia Communications went public.

The other key issue is that founders may clash with senior executives who demonstrate independent thought and leadership.  Over the past fifteen years, a number of strong executives with industry and public company experience have joined SAS  through acquisition or hire; most exited within two years.  The present SAS management team consists primarily of long term SAS employees whose leadership skills are well adapted to survival under Goodnight’s management style.  How well this management team will perform when out from under Goodnight is anyone’s guess.

SAS flirted with an IPO in 1999, at the height of the tech-driven stock market boom, and hired ex-Oracle executive Andre Boisvert as COO to lead the transition.  Preparations for the IPO proceeded slowly; Boisvert clashed with Goodnight and left.  SAS shelved the IPO soon thereafter.

Subsequent to this episode, Goodnight told USA Today that talk about an IPO was never serious, that he had pursued an IPO for the benefit of the employees, and abandoned the move because employees were against it.    In the story, USA Today noted that this claim appeared to be at odds with Goodnight’s previous public statements.  The reader is left to wonder whether the real reason has something to do with Goodnight’s personal finances, or if he simply did not want to let go of the company.  In any case, it’s not surprising that many SAS employees opposed an IPO, since Boisvert reportedly told employees at a company meeting that headcount reduction would follow public ownership.

Since then, there have been opportunities to sell the company in whole or in part.  IBM tried to acquire the company twice.  Acquisition by IBM makes a lot of sense; SAS built its business on the strength of its IBM technology partnership; SAS still earns a large share of its revenue from software running on IBM hardware.  Both companies have a conservative approach to technology, preferring to wait until innovations are proven before introducing them to blue chip customers.

But Goodnight rebuffed IBM’s overtures and bragged about doing so, claiming an exaggerated value for SAS of $20 billion, around ten times sales at the time.  It’s not unknown for two parties to disagree about the value of a company.   But according to a SAS insider, Goodnight demanded that IBM agree to his price “without due diligence”, which no acquiring company can ever agree to do.  That seems like the behavior of a man who simply does not want to sell to anyone, under any circumstances.

Is SAS really worth ten times revenue?  Certainly not.  SAS’ compound annual revenue growth rate over the past twenty years is around 10%, which suggests a revenue multiplier of a little under 4X at current valuations (see graph below).  Of course, that assumes SAS’ past revenue growth rate is a good indicator of its future growth, which is a stretch when you consider the saturation of its market, increased competition and limited customer response to “game-changing” new products.

Software Industry Rev Gro and Mult
Source: Yahoo Finance. Market Capitalization and Revenue for publicly owned software companies

One obstacle to sale of the company is Goodnight’s stated unwillingness to sell to buyers who might cut headcount.  SAS’ company culture is the subject of business school case studies and the like, but the unfortunate truth is that SAS’ revenue per employee badly lags the IT industry, as shown in the table below.  SAS appears to be significantly overstaffed relative to revenue compared to other companies in the industry, and markedly so compared to any likely acquirer.

Table of RPE
Source: Yahoo Finance; SAS Website

One could speculate about the causes of this relatively low revenue per employee — I won’t — but an acquiring company will expect this to improve.  Flogging the business for more sales seems like pushing on a string — according to company insiders, SAS employs more people in its Marketing organization than in its Research and Development organization.  An acquirer will likely examine SAS’ product line, which consists of a few strong performers — the “Legacy” SAS software, such as Base and STAT — and a long list of other products, many of which do not seem to be widely used.  Rationalization of the SAS product line — and corresponding headcount — will likely be Job One for an acquirer.

So what’s ahead for SAS?

One option: Goodnight can simply donate his ownership interest in SAS to a charitable trust, which would continue to manage the business much the way Hershey Trust manages Hershey Foods.   This option would be least disruptive to customers and employees, and the current management team would likely stay in place (if the Board is stacked with insiders, locals and friends).    It’s anyone’s guess how likely this is; such a move would be consistent with Goodnight’s public statements about philanthropy, but unlike Larry Ellison, Goodnight hasn’t signed Warren Buffett’s Giving Pledge.

But if Goodnight needs the cash, or wants his heirs to inherit something, a buyer must be found.  Another plausible option consistent with Goodnight’s belief in the virtues of private ownership would be a private equity led buyout.  The problem here is that while private equity investors might be willing to put up with either low sales growth or low employee productivity, they won’t tolerate both at the same time.    A private equity investor would likely treat the Legacy SAS software as a cash cow, kill off or spin off the remaining products, and shed assets.   The rock collection and the culinary farm will be among the first to go.

There are a limited number of potential corporate buyers.  IBM, H-P, Oracle, Dell and Intel all sell hardware that supports SAS software, and all have a vested interest in SAS, but it seems unlikely that any of these will step up and buy the company.   Twice rebuffed, IBM has moved on from SAS, reporting double-digit growth in business analytics revenue while SAS struggles to put up single digits.   H-P and Dell have other issues at the moment.  Oracle could easily put up $10 billion in cash to buy SAS, and Oracle’s analytic story would benefit if SAS were added to the mix, but I suspect that Oracle doesn’t think it needs a better analytics story.

SAP has the resources to acquire SAS; a weak dollar favors acquirers from outside of the United States.  Such a transaction would add to SAP’s credibility in analytics, which isn’t strong (the recently announced acquisition of KXEN notwithstanding).   Until recently, there was no formal partnership between the two companies, and SAS executives spent the better part of the last SAS Global Forum strutting around the stage sniping at SAP HANA.  It will be interesting to see how this alliance develops.

Update

A reader on Twitter asks: what about employee ownership?  Well, yes, but if Goodnight wants to sell the company, the employees would need to come up with the market price of $10-11 billion.  That works out to about $750,000 for each employee.  There are investors who would consider lending the capital necessary for an employee-led buyout, but they would subject the business and its management to the same level of scrutiny as an independent buyer.

SAP Buys KXEN

SAP announced today that it plans to acquire KXEN in a deal that will close in the fourth quarter.  No purchase price was announced.  Since one recently laid-off employee characterized the company’s prospects as “circling the toilet”, this seems like a case of bottom-feeding by SAP.

KXEN has struggled to position and sell its InfiniteInsight analytic software.  The vendor’s black-boxy approach has little appeal for hard-core analysts, who prefer tooling that offers greater control over the analytics process.  At the other end of the value chain, business executives are not interested in analytics, but in business solutions.

Hence, KXEN is neither fish nor fowl as a standalone company, but its technology is worth something to an enterprise vendor such as SAP, who say they will embed KXEN in applications for managing operations, customer relationships, supply chains, risk and fraud

KXEN has never been terribly forthcoming about details of its technology.  The software is server-based, with database integration primarily through ODBC and PMML.   KXEN has an established partnership with SAP Sybase, but for model scoring only in a “run-beside” architecture.   SAP says it will integrate KXEN with HANA, but I suspect that will also be in a run-beside architecture, since KXEN adds little to SAPs’ in-database Predictive Analytics Library.

Update:   Several analysts have commented on SAP’s move, including Curt Monash.  Monash correctly distinguishes between analytic programming languages (such as SAS or R) and analytic applications such as KXEN’s InfiniteInsight.  (There is a third category, which I call the analytic workbench, that is designed for users who have some understanding of analytics but would rather not program.  SPSS Modeler is an example,)

Monash also rightly throws cold water on SAP’s ability to embed KXEN in business solutions, pointing out InfiniteInsight’s lack of tooling needed for risk applications.  I’d go farther to say that KXEN has no credibility outside of Marketing Campaign Management, where SAP CRM is sadly stuck behind IBM/Unica, SAS, Neolane, Teradata Aprimo, Oracle and Pitney Bowes.