Big Analytics Roundup (March 2, 2015)

Here is a roundup of some recent Big Analytics news and analysis.

General

  • SiliconAngle covers the Big Data money trail.

Apache Spark

  • Curt Monash writes about Databricks and Spark on his DBMS2 blog.
  • On the Databricks blog, Dave Wang summarizes Spark highlights from Strata + Hadoop World.
  • In this post, Hammer Lab describes how to monitor Spark with Graphite and Grafana.
  • Cloudera announces Hive on Spark beta.
  • InfoWorld covers Spark’s planned support for R in Release 1.3.
  • Qubole announces Spark as a Service.

 Dato/GraphLab

  • Dato announces new version of GraphLab Create.

 H2O

  • From Strata + Hadoop World, Prithvi Pravu talks about using H2O.
  • Also from Strata, here is Cliff Click’s presentation on H2O, Spark and Python.
  • On the H2O blog, Arno Candel publishes a performance tuning guide for H2O Deep Learning.

 

 

2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.

Spark 1.1 Update

For an overview of Spark, see the Apache Spark Page.

On September 11, the Spark team announced release of Spark 1.1.   This latest version of Spark includes a number of significant enhancements:

  • As announced at the Spark Summit, Shark is now converged with Spark SQL.  Databricks has migrated its Shark workloads to Spark, and reports 2X-5X performance improvement.
  • The team has added a library of basic statistics for exploratory analysis, including correlations and hypothesis testing.  There are also new tools for stratified sampling and random generation.
  • Also new to MLLib: utilities for feature extraction for text mining and feature transformation.  Feature extraction techniques include Word2Vec and TF-IDF;  transformation techniques include normalization and scaling.
  • New MLLib algorithms include non-negative matrix factorization and singular value decomposition (SVD) using the Lanczos algorithm.  The combination of feature extraction capabilities and a robust SVD give Spark a strong foundation for text mining.
  • For Spark Streaming, the team has added support for Amazon Kinesis and a streaming linear regression algorithm.

There are also many bug fixes, as well as performance and usability improvements.  With ~175 contributors for this release, Spark continues to be one of the most active projects in the Hadoop ecosystem.

Since release of Spark 1.0, Databricks has announced certification for three additional Spark distributions:

  • Bluedata, a pioneer in big data private cloud.
  • Guavus, an operational intelligence platform.
  • Stratio, a commercially supported open source “Pure Spark” distribution.

In related news, Databricks and O’Reilly Media recently announced a certification program, which will be launched October 15-17 at Strata NY + Hadoop World.  More information here, here, here and here.

SAS and Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities.    Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.

Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area.  In March 2012, SAS announced support for Hadoop connectivity;  since then, SAS has gradually expanded the features it supports with Hadoop.

As of today, there are four primary ways that a SAS user can leverage Hadoop:

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface.  SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing.  It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user.  For more detailed information, read the manual.

SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop.  If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop.  If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps.  (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce).  SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document);  if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.

SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through.  Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”.   The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.

SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.

SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera.  Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product.   Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server.   These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture.  That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.

That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative.  In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time.  SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine.  Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode.  This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format.   To import the data into SASHDAT, you will need to license SAS Data Integration Server.

A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes.   SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations.  Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K.  Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.

While SAS has struggled to implement its in-memory software in Hadoop to date,  YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop.  Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.

SAS and H-P Close the Curtains

Michael Kinsley wrote:

It used to be, there was truth and there was falsehood. Now there is spin and there are gaffes. Spin is often thought to be synonymous with falsehood or lying, but more accurately it is indifference to the truth. A politician engaged in spin is saying what he or she wishes were true, and sometimes, by coincidence, it is. Meanwhile, a gaffe, it has been said, is when a politician tells the truth — or more precisely, when he or she accidentally reveals something truthful about what is going on in his or her head. A gaffe is what happens when the spin breaks down.

Hence, a Kinsley gaffe means “accidentally telling the truth”.

Back in April, an H-P engineer committed a Kinsley gaffe by publishing a white paper that describes in some detail issues encountered by SAS and H-P on implementations of SAS Visual Analytics.  I blogged about this at the time here.

Some choice bits:

— “Needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.”

— “(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.”

— “Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.”

And my personal favorite:

— “The potential exists, with even as few as 4 servers, for a Data Storm to occur.”

If you’re wondering what a Data Storm is, let’s just say that its not a good thing.

Since I published the blog post, SAS has withdrawn the paper from its website.   This is not too surprising, since every other paper on “SAS and Big Data” is also hidden from view.   Fortunately, I downloaded a copy of the paper for my records.   H-P can claim copyright, so I can’t upload the whole thing, but I’ve attached a few screen shots below so you can see that this paper is real.

You might wonder why SAS feels compelled to keep its “Big Data” stories under wraps.  Keep in mind that we’re not talking about software design or any other intellectual property that warrants protection; in this case, the vendors don’t want you to know the truth about implementation because it conflicts with the hype.  As the paper’s author puts it, “this sounds very scary and expensive.”  “Very scary” and “expensive” don’t mix with “buy this product now.”

If you’re evaluating SAS Visual Analytics ask your SAS rep for a copy of Paper 466-2013.  And ask if they’ve done anything about those Data Storms.

Hp1

Hp2

Notes from Strata 2013

Last week I attended the O’Reilly Strata 2013 Conference.    Here are some notes on presentations pertinent to analytics, in four categories:

  • Vendors
  • Users
  • Technical
  • Thought Provokers

Vendor Presentations

We wouldn’t have trade shows without sponsors, and the big ones get ten minutes of fame.  Some used their time well, others not so much.  I’ll refrain from shaming the bloviaters, but will single out three for applause:

  • John Schroeder of MapR did a nice preso on the business case for Hadoop, with a refreshing focus on measurable revenue impact and cost reduction;
  • Girish Juneja from Intel delivered a thoughtful summary of Intel’s participation in open source projects.  Not a lot of sizzle, but refreshingly free of hype;
  • Charles Zedlewski of Cloudera provided a terrific explanation of the history and direction of Hadoop and made a compelling case for the platform.

Someone should tell O’Reilly that Skytree is a vendor.  Skytree managed to get a 45-minute slot in a non-vendor track, and Alexander Gray of Skytree used the time to say stuff that data miners learned years ago.

User Presentations

Several presenters spoke about how their organization uses analytics.   In any conference, presentations like this are often the most compelling.

  • Rajat Taneja from Electronic Arts spoke about the depth of information captured by gaming companies, and how they use this information to improve the gaming experience.  Good presentation, with great visuals
  • Eric Colson of Stitch Fix (and formerly with Netflix) spoke about recommendation engines.  Stitch Fix sends bundles of new clothing to buyers on spec, and they have finely tuned the bundling process using a mix of machine learning and human decisions.  Eric spoke about the respective strengths of machine and human decisioning, and how to use them together effectively.
  • Michael Bailey of Facebook gets credit for truth in packaging for “Introduction to Forecasting”.  His presentation covered very basic content, the sort of thing covered in Stat 101, and he did a fine job presenting that.  Michael hinted at Facebook’s complex forecasting problem — they have to simultaneously forecast eyeballs and ad placements — and it would be great to hear more about that in a future presentation.

Technical Presentations

It’s tough to deliver detailed content in a short session; most of the presenters I saw struck the right balance.

  • Sharmila Shahani-Mulligan and others from ClearStory Data presented to an overflow audience interested in learning more about Spark and Shark.  Spark is an open source in-memory distributed computational engine that runs on top of Hadoop. It is designed to support iterative algorithms, and supports Java, Scala and Python.  Shark is part of Hive, integrates with Spark, and offers a SQL interface
  • Dr. Vijay Srinivas Agneeswaran of Impetus Technologies delivered what I thought was the best presentation in the show.   He summarized the limits of legacy analytics, discussed analytics in Hadoop (such as Mahout),  and spoke about a third wave of distributed analytics based on technologies like Spark, HaLoop, Twister, Apache Hama and GraphLab.
  • Jayant Shekhar of Cloudera delivered a very detailed presentation on how to build a recommendation engine.

Thought Provokers

Several presenters spoke on broad conceptual topics, with mixed results.

  • James Hendler of RPI spoke on the subject of “broad data”.  His presentation seemed thoughtful, but to be honest he lost me.
  • Nathan Marz of Twitter has co-authored a book on Big Data coming out soon.   After listening to his short preso on data modeling (“Human Fault Tolerance”) , I added the book to my wish list.
  • Kate Crawford of Microsoft presented on the subject of hidden biases in big data.  Her presentation covered material well known to seasoned analysts (“hey, did you know that your data may be biased?), Kate’s presentation was excellent, and full of good examples.

Overall, an excellent show.

Notes From #BigDataMN

Analytics conferences tend to be held in places like Orlando or Las Vegas, where it’s sunny and warm all of the time and there are copious incidental pleasures to fill the off hours.  I can’t speak to the incidental pleasures of Minneapolis in January, but warm it is not; peak temperature on Monday had a minus sign in front of it, and that’s in Fahrenheit.

Nevertheless, a sellout crowd for MinneAnalytics#BigDataMN event filled the rooms at the Carlson School of Management in Minneapolis.   MinneAnalytics is one of the more visible regional analytic user groups, and their events are well-organized and content rich.

Vendors present at #BigDataMN included the usual suspects, including IBM, EMC, Teradata Aster, Cloudera and several others.   SAS was conspicuous by its absence, which is noteworthy because MinneAnalytics is operated by the Twin Cities Area SAS Users Group.  It seems that SAS does not wish to appear at events where R is discussed favorably.   Those crafty strategists at SAS corporate headquarters know a threat when they see it.

At least a third of the presentations featured open source analytics.   Some highlights:

  • Erik Iverson, chair of the local R User Group, presented two excellent overviews of R.  The second of these, an introduction to R basics, drew an overflow audience of all ages; about 90% of these, by show of hands, had no prior experience with R.  In his first presentation, a balanced “flyover” of R from a business perspective, Erik made the excellent point that prospective analysts entering the labor force today have all grown up with R; and so, by inference, we can expect that perceived R learning curve issues will decline as this cohort matures.
  • Winston Chang introduced RStudio‘s new Shiny server for R web applications a tool that gives the lie to the notion that R is suitable for academic research but little more.  This presentation had some impact.  As I stood in the back of the room, I could see a number of participants download and install RStudio then and there.
  • Luba Gloukov of Revolution Analytics offered an excellent interactive demonstration of how she uses Revolution R together with YouTube and Google Maps to identify and map emerging artists.  This was a fun and lively presentation.  One does not often associate the words “fun” and “lively” with an analytics conference.

Mark Pitts from United Health offered a balanced overview of SAS High Performance Analytics, based on his organization’s ongoing assessment of HPA and alternatives.  Mark nicely presented what HPA does well (it’s extremely fast with large data sets) together with its limitations (functionality is limited relative to standard single-threaded SAS).  Mark did not mention cost of ownership of this product, which exceeds the GNP of some countries.  🙂

The format of this event — which provides most speakers with slots of twenty to twenty-five minutes — is excellent.  The short time slots prevents bloviation, and if a speaker is less than inspired the audience doesn’t have to decide between a catnap or checking email.  Conference presentations should be like speed dates: get in, make your point quickly, and if there’s a fit you can follow up afterwards.

Book Review: Big Data Big Analytics

Big Data Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, by Michael Minelli, Michele Chambers and Ambiga Dhiraj.

Books on Big Data tend to fall into two categories: they are either “strategic” and written at a very high level, or they are cookbooks that tell you how to set up a Hadoop cluster.  Moreover, many of these books focus narrowly on data management — an interesting subject in its own right for those who specialize in the discipline, but yawn-inducing for managers in Sales, Marketing, Risk Management, Merchandising or Operations who have businesses to run.

Hey, we can manage petabytes of data.  Thank you very much.  Now go away.

Big Data Big Analytics appeals to business-oriented readers who want a deeper understanding of Big Data, but aren’t necessarily interested in writing MapReduce code.   Moreover, this is a book about analytics — not just how we manage data, but what we do with it and how we make it drive value.

The authors of this book — Michael Minelli, Michele Chambers and Ambiga Dhiraj — combine in-depth experience in enterprise software and data warehousing with real-world experience delivering analytics for clients.  Building on interviews with a cross-section of subject matter experts — there are 58 people listed in the acknowledgements — they map out the long-cycle trends behind the explosion of data in our economy, and the expanding tools to manage and learn from that data.  They also point to some of the key bottlenecks and constraints enterprises face as they attempt to deal with the tsunami of data, and provide sensible thinking about how to address these constraints.

Big Data Big Analytics includes rich and detailed examples of working applications.  This is refreshing; books in this category tend to push case studies to the back of the book, or focus on one or two niche applications.  This book documents the disruptive nature of Big Data analytics across numerous vertical and horizontal applications, including Consumer Products, Digital Media, Marketing, Advertising, Fraud and Risk Management, Financial Markets and Health Care.

The book includes excellent chapters that describes the technology of Big Data, chapters on Information Management, Business Analytics, Human Factors — people, process, organization and culture.   The final chapter is a good summary of Privacy and Ethics.

The Conclusion aptly summarizes this book: it’s not how much data you have, it’s what you do with it that matters.  Big Data Big Analytics will help you get started.

Latest Forrester Analytics “Wave”

Forrester’s latest assessment of predictive analytics vendors is available here;  news reports summarizing the report are herehere and here.

A few comments:

(1) While the “Wave” analysis purports to be an assessment of “Predictive Analytics Solutions for Big Data”, it is actually an assessment of vendor scale.  You can save yourself $2,495 by rank-ordering vendors by revenue and get similar results.

(2) The assessment narrowly focuses on in-memory tools, which arbitrarily excludes some of the most powerful analytic tools in the market today.    Forrester claims that in-database analytics “tend to be oriented toward technical users and require programming or SQL”.  This is simply not true.  Oracle Data Mining and Teradata Warehouse Miner have excellent user interfaces, and IBM SPSS Modeler provides an excellent front-end to in-database analytics across a number of databases (including IBM Netezza, DB2, Oracle and Teradata).  Alpine Miner is a relatively new entrant that also has an excellent UI.

(3) Forrester exaggerates SAS’ experience and market presence in Big Data.  Most SAS customers work primarily with foundation products that do not scale effectively to Big Data; this is precisely why SAS developed the high performance products demonstrated in the analyst beauty show.   SAS has exactly one public reference customer for its new in-memory high-performance analytics software.

(4) SAS Enterprise Miner may be “easy to learn”, but it is a stretch to say that it has the capability to “run analytics in-database or distributed clusters to handle Big Data”.

(5) Designation of SAP as a “leader” in predictive analytics is also a stretch.  SAP’s Predictive Analytics Library is a new product with some interesting capabilities; however, it is largely unproven and SAP lacks a track record in this area.

(6) The omission of Oracle Data Mining from this report makes no sense at all.

(7) Forrester’s scorecard gives every vendor the same score for pricing and licensing.  That’s like arguing that a Bentley and a Chevrolet are equally suitable as family cars, but the Bentley is preferable because it has leather seats.    TCO matters.  As I reported here, a firm that tested SAS High Performance Analytics and reported good results did not actually buy the product because, as a company executive notes, “this stuff is expensive.”