2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

SAS Visual Analytics: FAQ (Updated 1/2014)

SAS charged its sales force with selling 2,000 licenses for Visual Analytics in 2013; the jury is still out on whether they met this target.  There’s lots of marketing action lately from SAS about this product, so here’s an FAQ.

Update:  SAS recently announced 1,400 sites licensed for Visual Analytics.  In SAS lingo, a site corresponds roughly to one machine, but one license can include multiple sites; so the actual number of licenses sold in 2013 is less than 1,400.  In April 2013 SAS executives claimed two hundred customers for the product.   In contrast, Tableau reports that it added seven thousand customers in 2013 bringing its total customer count to 17,000.

What is SAS Visual Analytics?

Visual Analytics is an in-memory visualization and reporting tool.

What does Visual Analytics do?

SAS Visual Analytics creates reports and graphs that are visually compelling.  You can view them on mobile devices.

VA is now in its fifth dot release.  Why do they call it Release 6.3?

SAS Worldwide Marketing thinks that if they call it Release 6.3, you will think it’s a mature product.  It’s one of the games software companies play.

Is Visual Analytics an in-memory database, like SAP HANA?

No.  HANA is a standards-based in-memory database that runs on many different brands of hardware and supports a range of end-user tools.  VA is a proprietary architecture available on a limited choice of hardware platforms.  It cannot support anything other than the end-user applications SAS chooses to develop.

What does VA compete with?

SAS claims that Visual Analytics competes with Tableau, Qlikview and Spotfire.  Internally, SAS leadership refers to the product as its “Tableau-killer” but as the reader can see from the update at the top of this page, Tableau is alive and well.

How well does it compare?

You will have to decide for yourself whether VA reports are prettier than those produced by Tableau, Qlikview or Spotfire.  On paper, Tableau has more functionality.

VA runs in memory.  Does that make it better than conventional BI?

All analytic applications perform computations in memory.  Tableau runs in memory, and so does Base SAS.   There’s nothing unique about that.

What makes VA different from conventional BI applications is that it loads the entire fact table into memory.  By contrast, BI applications like Tableau query a back-end database to retrieve the necessary data, then perform computations on the result set.

Performance of a conventional BI application depends on how fast the back-end database can retrieve the data.  With a high-performance database the performance is excellent, but in most cases it won’t be as fast as it would if the data were held in memory.

So VA is faster?  Is there a downside?

There are two.

First, since conventional BI systems don’t need to load the entire fact table into memory, they can support usage with much larger datastores.  The largest H-P Proliant box for VA maxes out at about 10 terabytes; the smallest Netezza appliance supports 30 terabytes, and scales to petabytes.

The other downside is cost; memory is still much more expensive than other forms of storage, and the machines that host VA are far more expensive than data warehouse appliances that can host far more data.

VA is for Big Data, right?

SAS and H-P appear to be having trouble selling VA in larger sizes, and are positioning a small version that can handle 75-100 Gigabytes of data.  That’s tiny.

The public references SAS has announced for this product don’t seem particularly large.  See below.

How does data get into VA?

VA can load data from a relational database or from a proprietary SASHDAT file.  SAS cautions that loading data from a relational database is only a realistic option when VA is co-located in a Teradata Model 720 or Greenplum DCA appliance.

To use SASHDAT files, you must first create them using SAS.

Does VA work with unstructured data?

VA works with structured data, so unstructured data must be structured first, then loaded either to a co-located relational database or to SAS’ proprietary SASHDAT format.

Unlike products like Datameer or IBM Big Sheets, VA does not support “schema on read”, and it lacks built-in tools for parsing unstructured text.

But wait, SAS says VA works with Hadoop.  What’s up with that?

A bit of Marketing slight-of-hand.  VA can load SASHDAT files that are stored in the Hadoop File System (HDFS); but first, you have to process the data in SAS, then load it back into HDFS.  In other words, you can’t visualize and write reports from the data that streams in from machine-generated sources — the kind of live BI that makes Hadoop really cool.  You have to batch the data, parse it, structure it, then load it with SAS to VA’s staging area.

Can VA work with streaming data?

SAS sells tools that can capture streaming data and load it to a VA data source, but VA works with structured data at rest only.

With VA, can my users track events in real time?

Don’t bet on it.   To be usable VA requires significant pre-processing before it is loaded into VA’s memory.  Moreover, once it is loaded it can’t be updated; updating the data in VA requires a full truncate and reload.   Thus, however fast VA is in responding to user requests, your users won’t be tracking clicks on their iPads in real time; they will be looking at yesterday’s data.

Does VA do predictive analytics?

Visual Analytics 6.1 can perform correlation, fit bivariate trend lines to plots and do simple forecasting.  That’s no better than Tableau.  Surprisingly, given the hype, Tableau actually supports more analysis functions.

While SAS claims that VA is better than SAP HANA because “HANA is just a database”, the reality is that SAP supports more analytics through its Predictive Analytics Library than SAS supports in VA.

Has anyone purchased VA?

A SAS executive claimed 200 customers in early 2013, a figure that should be taken with a grain of salt.  If there are that many customers for this product, they are hiding.

There are five public references, all of them outside the US:

SAS has also recently announced selection (but not implementation) by

OfficeMax has also purchased the product, according to this SAS blog.

As of January 2014, the four customers who announced selection or purchase are not cited as reference customers.

What about implementation?  This is an appliance, right?

Wrong.  SAS’ considers an implementation that takes a month to be wildly successful.  Implementation tasks include the same tasks you would see in any other BI project, such as data requirements, data modeling, ETL construction and so forth.  All of the back end feeds must be built to put data into a format that VA can load.

Bottom line, does it make sense to buy SAS Visual Analytics?

Again, you will have to decide for yourself whether the SAS VA reports look better than Tableau or the many other options in this space.  BI beauty shows are inherently subjective.

You should also demand that SAS prove its claims to performance in a competitive POC.  Despite the theoretical advantage of an in-memory architecture, actual performance is influenced by many factors.  Visitors to the recent Gartner BI Summit who witnessed a demo were unimpressed; one described it to me as “dog slow”.  She didn’t mean that as a compliment.

The high cost of in-memory platforms mean that VA and its supporting hardware will be much more expensive for any given quantity of data than Tableau or equivalent products. Moreover, its proprietary architecture means you will be stuck with a BI silo in your organization unless you are willing to make SAS your exclusive BI provider.  That makes this product very good for SAS; the question is whether it is good for you.

The early adopters for this product appear to be very SAS-centric organizations (with significant prior SAS investment).  They also appear to be fairly small.  If you have very little data, money to burn and are willing to experiment with a relatively new product, VA may be for you.

A Few Interesting Things About SAS Visual Analytics

This post is more than two years old, but remains popular.  For an updated discussion, read How to Buy SAS Visual Analytics on this blog.

Thanks to a white paper recently published by an H-P engineer, we now have a better idea about what it takes to implement SAS Visual Analytics, SAS’ in-memory BI and visualization platform.

(Note: SAS has taken down the white paper since this post was published).

(Updated again June 27:   SAS has reposted an edited version of the white paper, with interesting parts removed.  The paper currently posted at this link is not the original.)

It’s an interesting picture.

A few key points:

(1) Implementation is a science project.  

Quoting from the paper:

…too often the needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.

Someone should explain to SAS and H-P that vendors are supposed to provide customers with honest guidance about how to implement a product.  If “needed pre-planning does not occur”, it’s likely because the customer wasn’t told it was necessary.  Of course, some customers ignore vendor guidance; but if the issues described in this white paper are systematic (as the author suggests), there are only two plausible explanations: (a) the vendors don’t know how to implement the product, or (b) they’re positioning the product as an “appliance” that is “ready to run.”

Elsewhere in the paper, the author notes that a response to a key question about how to monitor this application is “evolving”.  In other words, thirteen months and three releases into production and the vendors still don’t have an answer.

(2) Pre-Release testing?  What’s that?

Quoting:

Experience on initial installations is showing that networking is proving to be one of the biggest challenges and impediments to a timely implementation.

Well, duh.  This is the sort of thing ordinarily revealed in something called “system testing” and “benchmarking”, the product of which is something called “reference architecture”.   Smart people generally think it’s a good idea to do this sort of testing and benchmarking before you release a product rather than figuring it out in the course of “initial installations”.

Pity the early adopters for this product.

(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.

Well, why are they overlooked?   Ordinarily when implementing a product one starts with something called an “architecture review” where you — I’m talking to you, SAS and H-P — tell the customer how the product works and point out these network thingies and why it’s a good idea to provision them and not leave them just hanging out there.

(3) It’s not an appliance.

The author refers to the hardware VA sits on as an “appliance”; word on the street is that SAS reps position VA as an alternative to appliances (such as IBM PureData, Teradata Aster or EMC Greenplum DCA).   Well, caveat emptor on that.  The author goes into an extended discussion of data networking, with much interesting detail on such topics as the type of cables you will need to wire this thing together (copper or glass fiber).

It seems that you need lots of cable, and for good performance you need good cable.

(4) Implement with care.

The potential exists, with even as few as 4 servers, for a Data Storm to occur.

No, he does not mean the Amiga game.    For those not hip to the lingo, a Data Storm is a Really Bad Thing that you don’t want to happen in your IT environment, and vendors generally design products that don’t set off Data Storms.

A Data Storm is to Business Intelligence what train wrecks are to travel.

(5) Infrastructure requirements may be daunting.

After an extended discussion of IP addresses and the load this product places on your network, the author writes:

Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.

And after more extended discussion about networking and cabling:

…while all this sounds very scary and expensive….

Dude, you have no idea.

…there can be assistance from vendors during the hardware ordering process that makes this simpler and clearer to comprehend.

No doubt.  Hardware vendors will be happy to explain all of this to you.

(6)  High Availability?  Not exactly.

The author opens the High Availability section with a Readiness Checklist detailing whether or not you should even attempt to cold swap a SAS Head Node.  The answer: it depends.  Left unanswered: what to do if the Readiness Checklist says Do Not Attempt.  I presume that the answer is “call H-P”, but I’m just guessing.

This leads to the next section:

How to transplant a SAS Head Node and survive the experience (you hope) in a SAS VA configuration.

Nine steps follow, closing with “Assuming that everything seems to be in working order…”

(7) Finding people to keep this running will be a challenge.

The author is a twenty-year veteran of SAS and H-P (with four acronyms after his name), and his paper is littered with words like “scary”, “problematic”  and “crisis”.

SAS Analyst Conference: Take Two

Analyst comments about SAS’ 24th annual analyst conference continue to dribble out.   Ordinarily, events like this produce a storm of Google alerts, but this year the quiet speaks volumes.   Yesterday, Tony Cosentino of Ventana Research published his perspective on the conference, writing at length about SAS Visual Analytics; link here.

Here are a few quotes from Mr. Cosentino’s post, with my embedded comments.

“For SAS, the competitive advantage in Big Data rests in predictive analytics…

…a capability that is completely absent from the current version of SAS Visual Analytics, the software featured in Mr. Cosentino’s article.  The big “news” of the analyst conference is that SAS says they plan to add some toylike predictive analytics to Visual Analytics this year, which will give the application functional parity with, say, MicroStrategy vintage 1999.  I don’t completely understand why this is news at all, since SAS said they would do this at the analyst conference last year, but spent 2012 attempting to sell their other in-memory architecture without visible success.

“…according to our benchmark research into predictive analytics, 55 percent of businesses say the challenge of architectural integration is a top obstacle to rolling out predictive analytics in the organization.”

No doubt this is true, and SAS’ proprietary server-based architecture is one reason why this is a problem.  SAS/STAT, for example, is still one of the most widely used SAS products, and it exports predictive models to nothing other than SAS.  SAS Visual Analytics simply adds to the clutter by introducing an entirely new architecture into the mix that is hard to integrate with legacy SAS products in the same category.  For more details about the data integration challenges posed by SAS Visual Analytics, see my previous post.

“Integration of analytics is particularly daunting in a big-data-driven world, since analytics processing has traditionally taken place on a platform separate from where the data is stored…”

A trend that continues with SAS Visual Analytics, which is deployed on a platform separate from where the data is stored.

Jim Goodnight, the company’s founder and plainspoken CEO, says he saw the industry changing a few years ago. He speaks of a large bank doing a heavy analytical risk computation that took upwards of 18 hours, which meant that the results of the computation were not ready in time for the next trading day.

Banks have suffered serious performance issues with analytics for more than “a few years”.    And 18 hours is pretty good compared to some; there are organizations with processes that take days and weeks to run in SAS.

Goodnight also discussed the fact that building these parallelizing statistical models is no easy task. One of the biggest hurdles is getting the mathematicians and data scientists that are building these elaborate models to think in terms of the new parallelized architectural paradigm.

Really?  Parallelized algorithms for statistics and data mining are hardly new, and commercial versions first appeared on the market in 1994.  There are companies with a fraction of SAS’ headcount that are able to roll out parallelized algorithms without complaining about how hard it is to do.  A few examples:  Alpine Data Labs,  Fuzzy Logix,   Revolution Analytics (my current employer) and Skytree.

The biggest threat to SAS today is the open source movement, which offers big data analytic approaches such as Mahout and R.

If this is true, SAS Visual Analytics is not an effective response because it caters to a completely different user persona.  The biggest threats to SAS today are IBM, SAP and Oracle, who have the analytic tooling, deep pockets and credibility to challenge SAS in the enterprise analytics market.  SAS Visual Analytics seems more like an attempt to compete with SAP HANA.

At the same time, SAS possesses blueprints for major analytic processes across different industries as well as horizontal analytic deployments, and it is working to move these to a parallelized environment. This may prove to be a differentiator in the battle versus R, since it is unclear how quickly the open source R community, which is still primarily academic, will undertake the parallelization of R’s algorithms.

Actually, it’s already done.

SAS Analyst Conference

SAS held its annual Analyst Conference in Steamboat Springs, Colorado last week, an event that drew scant buzz from persons not on the SAS payroll.   For a good summary of major news from the event, check Cindi Howson’s post on the BI Scorecard blog (link here).

A few key points:

(1) SAS isn’t talking about SAS High Performance Analytics Server, its marquee in-memory software for predictive analytics.   This product went into production fourteen months ago and has no public reference customers to date.  Given the full-court marketing press SAS gave to this product last year, the implication is that (a) nobody’s buying it; (b) it doesn’t work; or both.

(2) SAS continues to provide a public speaking platform for a “customer” who sings the praises of High Performance Analytics Server but hasn’t actually purchased the product.  Note to analysts: if some guy tells you how much he likes the product, ask him if he bought it.

(3) On the other hand, SAS is aggressively promoting its other in-memory software (Visual Analytics), which seems to be selling smartly.  (SAS has a target to sell 1,000 licenses in North America this year).  Visual Analytics is a slick in-memory BI tool, that currently does not support predictive analytics.

(4) SAS plans to add some simple predictive analytics to Visual Analytics in 2013.

(5) SAS’ BI revenues grew only 3.2% in 2012, compared to the double digit growth reported by other vendors.   This quote from Gartner’s most recent BI Magic Quadrant offers insight into why:

References continue to report that SAS is very difficult to implement and use — it was the No. 3 vendor in both categories. Aggravating this, although it has a worldwide network of support centers and an extensive list of service partners, SAS’s customer experience and product support are in the lower quartile of vendors in the Magic Quadrant. A revision of user interfaces and an enhancement of product integration is under way to help improve the customer experience, but SAS must also improve its level of service — including level of expertise, response time and time to resolution.

(6)  As Howson notes, Visual Analytics “…may offer a more modern and appealing interface, but only when data has been loaded into memory on the SAS LASR Server.”  And there’s the rub, because it turns out that loading data into Visual Analytics is not exactly a day in the park.

(7) According to SAS product documentation,  there are exactly two ways to load data into VA: from a registered table in a relational database or from a SASHDAT file stored in HDFS.   According to SAS, the first option is “appropriate for smaller data sets because the data must be transferred over the network.  If the table is unloaded or the server stops, the data must be transferred over the network again.”   So if you’re working with Big Data, the only way to load data into VA is to first create a file in SAS’ proprietary SASHDAT format, store the file in HDFS, then load it into VA.  And by the way, you must use SAS Data Integration Server to create a SASHDAT file.  Surprise!  More SAS software to buy.

(8) Howson misses the obvious point, though, that SAS Visual Analytics cannot read data from Hadoop unless it has been previously extracted and reloaded in SAS’ closed format.  Which misses the point of using Hadoop in the first place.

(9) The other big announcement is that SAS now says it will support public cloud.  Yay.  I’m reminded of this article from November 2011, in which SAS CEO Jim Goodnight declared that cloud computing is “a lot of hype.”   Color me shocked.  It seems that when Jim Goodnight makes public statements about SAS’ product direction, we really can’t take him seriously.

Analytic Applications (Part Three): Operational Analytics

This is the third post in a series on analytic applications organized by how analytic work product is used within the enterprise.

  • The first post, linked here, covers Strategic Analytics (defined as analytics for the C-Suite)
  • The second post, linked here, covers Managerial Analytics (defined as analytics to measure and optimize the performance of value-delivering units such as programs, products, stores or factories).

This post covers Operational Analytics, defined as analytics that improve the efficiency or effectiveness of a business process.  The distinction between Managerial and Operational analytics can be subtle, and generally boils down to the level of aggregation and frequency of the analysis.  For example, the CMO is interested in understanding the performance and ROI of all Marketing programs, but is unlikely to be interested in the operational details of any one program.  The manager of that program, however, may be intensely interested in its operational details, but have no interest in the performance of other programs.

Differences in level of aggregation and frequency lead to qualitative differences in the types of analytics that are pertinent.  A CMO’s interest in Marketing programs is typically at a level of “keep or kill”;  continue funding the program if its effective, kill it if it is not.  This kind of problem is well-suited to dashboard-style BI combined with solid revenue attribution, activity based costing and ROI metrics.  The Program Manager, on the other hand, is intensely interested in a range of metrics that shed insight not simply on how well the program is performing, but why it is performing as it is and how to improve it.  Moreover, the Program Manager in this example will be deeply involved in operational decisions such as selecting the target audience, determining which offers to assign, handling response exceptions and managing delivery to schedule and budget.  This is the realm of Operational Analytics.

While any BI package can handle different levels of aggregation and cadence, the problem is made more challenging due to the very diverse nature of operational detail across business processes.   A social media marketing program relies on data sources and operational systems that are entirely different from web media or email marketing programs; preapproved and non-pre-approved credit card acquisition programs do not use the same systems to assign credit lines; some or all of these processes may be outsourced.  Few enterprises have successfully rationalized all of their operational data into a single enterprise data store (nor is it likely they will ever do so).  As a result, it is very rare that a common BI system comprehensively supports both Managerial and Operational analytic needs.  More typically, one system supports Managerial Analytics (for one or more disciplines), while diverse systems and ad hoc analysis support Operational Analytics.

At this level, questions tend to be domain-specific and analysts highly specialized in that domain.  Hence, an analyst who is an expert in search engine optimization will typically not be seen as qualified to perform credit risk analysis.  This has little to do with the analytic methods used, which tend to be similar across business disciplines, and more to do with the language and lingo used in the discipline as well as domain-specific technology and regulatory issues.  A biostatistician must understand common health care data formats and HIPAA regulations; a consumer credit risk analysis must understand FICO scores, FISERV formats and FCRA.  In both cases, the analyst must have or develop a deep understanding of the organization’s business processes, because this is essential to recognizing opportunities for improvement and prioritizing analytic projects.

While there is a plethora of different ways that analytics improve business processes, most applications fall in to one of three categories:

(1) Applied decision systems supporting business processes such as customer-requested line increases or credit card transaction authorizations.  These applications improve the business process by applying consistent data-driven rules designed to balance risks and rewards.  Analytics embedded in such systems help the organization optimize the tradeoff between “loose” and “tight” criteria, and ensure that decision criteria reflect actual experience.  An analytics-driven decisioning system performs in a faster and more consistent way than systems based on human decisions, and can take more information into account than a human decision-maker.

(2) Targeting and routing systems (such as a text-processing system that reads incoming email and routes it to a customer service specialist).  While applied decision systems in the first category tend to recommend categorical yes/no, approve/decline decisions in a stream of transactions, a targeting system selects from a larger pool of candidates, and may make qualitative decisions among a large number of alternate routings.   The business benefit from this kind of system is improved productivity and reduced processing time as, for example, the organization no longer requires a team to read every email and route it to the appropriate specialist.  Applied analytics make these systems possible.

(3) Operational forecasting (such as a system that uses projected store traffic to determine staffing levels).   These systems enable to organization to operate more efficiently through better alignment of operations to customer demand.  Again, applied analytics make such systems possible; while it is theoretically possible to build such a system without an analytic forecasting component, it is inconceivable that any management would risk the serious customer service issues that would be created without one.  Unlike the first two applications, forecasting systems often work with aggregate data rather than atomic data.

For analytic reporting, the ability to flexibly ingest data from operational data sources (internal and external) is critical, as is the ability to publish reports into a broad-based reporting and BI presentation system.

Deployability is the key requirement for predictive analytics; the analyst must be able to publish a predictive model as a PMML (Predictive Model Markup Language) document or as executable code in a choice of programming languages.

In the next post, I will cover the most powerful and disruptive form of analytics, what I call Customer-Enabling Analytics: analytics that differentiate your products and services and deliver value to the customer.

Analytic Applications (Part Two): Managerial Analytics

This is the second in a four-part taxonomy of analytics based on how the analytic work product is used.  In the first post of this series, I covered Strategic Analytics, or analytics that support the C-suite.  In this post, I will cover Managerial Analytics: analytics that support middle management, including functional and regional line managers.

At this level, questions and issues are functionally focused:

  • What is the best way to manage our cash?
  • Is product XYZ performing according to expectations?
  • How effective are our marketing programs?
  • Where can we find the best opportunities for new retail outlets?

There are differences in nomenclature across functions, as well as distinct opportunities for specialized analytics (retail store location analysis, marketing mix analysis, new product forecasting), but managerial questions and issues tend to fall into three categories:

  • Measuring the results of existing entities (products, programs, stores, factories)
  • Optimizing the performance of existing entities
  • Planning and developing new entities

Measuring existing entities with reports, dashboards, drill-everywhere (etc.) is the sweet spot for enterprise business intelligence systems.  Such systems are highly effective when the data is timely and credible, reports are easy to use and the system reflects a meaningful assessment framework.  This means that metrics (activity, revenue, costs, profits) reflect the goals of the business function and are standardized to enable comparison across entities.

Given the state of BI technology, analysis teams within functions (Marketing, Underwriting, Store Operations etc.) spend a surprisingly large amount of time preparing routine reports for managers.  (For example, an insurance client asked my firm to perform an assessment of actual work performed by a group of more than one hundred SAS users.  The client was astonished to learn that 80% of the SAS usage could be done in Cognos, which the client also owned).

In some cases, this is simply due to a lack of investment by the organization in the necessary tools and enablers, a problem that is easily fixed.  More often than not, though, the root cause is the absence of consensus within the function of what is to be measured and how performance should be compared across entities.   In organizations that lack measurement discipline, assessment is a free-for-all where individual program and product managers seek out customized reports that show their program or product to the best advantage; in this environment, every program or product is a winner and analytics lose credibility with management.  There is no technical “fix” for this problem; it takes leadership for management to set out clear goals for the organization and build consensus for an assessment framework.

Functional analysts often complain that they spend so much time preparing routine reports that they have little or no time to perform analytics that optimize the performance of existing entities.  Optimization technology is not new, but tends to be used more pervasively in Operational Analytics (which I will discuss in the next post in this series).   Functionally focused optimization tools for management decisions have been available for well over a decade, but adoption is limited for several reasons:

  • First, an organization stuck in the “ad hoc” trap described in the previous paragraph will never build the kind of history needed to optimize anything.
  • Second, managers at this level tend to be overly optimistic about the value of their own judgment in business decisions, and resist efforts to replace intuitive judgment with systematic and metrics-based optimization.
  • Finally, in areas such as Marketing Mix decisions, constrained optimization necessarily means choosing one entity over another for resources; this is inherently a leadership decision, so unless functional leadership understands and buys into the optimization approach it will not be used.

Analytics for planning and developing new entities (such as programs, products or stores) usually require information from outside of the organization, and may also require skills not present in existing staff.  For both reasons, analytics for this purpose are often outsourced to providers with access to pertinent skills and data.  For analysts inside the organization, technical requirements look a lot like those for Strategic Analytics: the ability to rapidly ingest data from any source combined with a flexible and agile programming environment and functional support for a wide range of generic analytic problems.

In the next post in this series, I’ll cover Operational Analytics, defined as analytics whose purpose is to improve the efficiency or effectiveness of a business process.

Analytic Applications (Part One)

Conversations about analytics tend to get muddled because the word describes everything from a simple SQL query to climate forecasting.  There are several different ways to classify analytic methods, but in this post I propose a taxonomy of analytics based on how the results are used.

Before we can define enterprise best practices for analytics, we need to understand how they add value to the organization.  One should not lump all analytics together because, as I will show, the generic analytic applications have fundamentally different requirements for people, processes and tooling.

There are four generic analytic applications:

  • Strategic Analytics
  • Managerial Analytics
  • Operational Analytics
  • Customer-Enabling Analytics

In today’s post, I’ll address Strategic Analytics; the rest I’ll cover in subsequent posts.

Strategic Analytics directly address the needs of the C-suite.  This includes answering non-repeatable questions, performing root-cause analysis and supporting make-or-break decisions (among other things).   Some examples:

  • “How will Hurricane Sandy impact our branch banks?”
  • “Why does our top-selling SUV turn over so often?”
  • “How will a merger with XYZ Co. impact our business?”

Strategic issues are inherently not repeatable and fall outside of existing policy; otherwise the issue would be delegated.   Issues are often tinged with a sense of urgency, and a need for maximum credibility; when a strategic decision must be taken, time is of the essence, and the numbers must add up.   Answers to strategic questions frequently require data that is not readily accessible and may be outside of the organization.

Conventional business intelligence systems do not address the needs of Strategic Analytics, due to the ad hoc and sui generis nature of the questions and supporting data requirements.   This does not mean that such systems add no value to the organization; in practice, the enterprise BI system may be the first place an analyst will go to seek an answer.  But no matter how good the enterprise BI system is, it will never be sufficiently complete to provide all of the answers needed by the C-suite.

The analyst is key to the success of Strategic Analytics.  This type of work tends to attract the best and most capable analysts, who are able to work rapidly and accurately under pressure.  Backgrounds tend to be eclectic: an insurance company I’ve worked with, for example, has a strategic analysis team that includes an anthropologist, an economist, an epidemiologist and graduate of the local community college who worked her way up in the Claims Department.

Successful strategic analysts develop domain, business and organizational expertise that lends credibility to their work.  Above all, the strategic analyst takes a skeptical approach to the data, and demonstrates the necessary drive and initiative to get answers.  This often means doing hard stuff, such as working with programming tools and granular data to get to the bottom of a problem.

More often than not, the most important contribution of the IT organization to Strategic Analytics is to stay out of the way.  Conventional IT production standards are a bug, not a feature, in this kind of work, where the sandbox environment is the production environment.  Smart IT organizations recognize this, and allow the strategic analysts some latitude in how they organize and manage data.   Dumb IT organizations try to force the strategic analysis team into a “Production” framework.  This simply inhibits agility, and encourages top executives to outsource strategic issues to outside consultants.

Analytic tooling tends to reflect the diverse backgrounds of the analytics, and can be all over the map.  Strategic analysts use SAS, R, Stata, Statsoft, or whatever to do the work, and drop the results into Powerpoint.  One of the best strategy analysts I’ve ever worked with used nothing other than SQL and Excel.  Since strategic analysis teams tend to be small, there is little value in demanding use of a single tool set; moreover, most strategic analysts want to use the best tool for the job, and prefer to use niche tools that are optimized for a single problem.

The most important common requirement is the capability to rapidly ingest and organize data from any source and in any format.  For many organizations, this has historically meant using SAS.  (A surprisingly large number of analytic teams use SAS to ingest and organize the data, but perform the actual analysis using other tools).    Growing data volumes, however, pose a performance challenge for the conventional SAS architecture, so analytic teams increasingly look to data warehouse appliances like IBM Netezza, to Hadoop, or a combination of the two.

In the next post, I’ll cover Managerial Analytics, which includes analytics designed to monitor and optimize the performance of programs and products.

Recent Books on Analytics

For your Christmas gift list,  here is a brief roundup of four recently published books on analytics.

Business Intelligence in Plain Language by Jeremy Kolb (Kindle Edition only) is a straightforward and readable summary of conventional wisdom about Business Intelligence.  Unlike many guides to BI, this book devotes some time and attention to data mining.  As an overview, however, Mr. Kolb devotes too little attention to the most commonly used techniques in predictive analytics, and too much attention to more exotic methods.  There is nothing wrong with this per se, but given the author’s conventional approach to implementation it seems eccentric.  At $6.99, though, even an imperfect book is a pretty good value.

Tom Davenport’s original Harvard Business Review article Competing on Analytics is one of the ten most-read articles in HBR’s history; Google Trends shows a spike in search activity for the term “analytics” concurrent with its publication, and steady growth in interest since them.  Mr. Davenport’s latest book  Enterprise Analytics: Optimize Performance, Process, and Decisions Through Big Data is a collection of essays by Mr. Davenport and members of the International Institute of Analytics, a commercial research organization funded in part by SAS.   (Not coincidentally, SAS is the most frequently mentioned analytics vendor in the book).  Mr. Davenport defines enterprise analytics in the negative, e.g. not “sequestered into several small pockets of an organization — market research, or actuarial or quality management”.    Ironically, though, the best essays in this book are about narrowly focused applications, while the worst essay, The Return on Investments in Analytics, is little more than a capital budgeting primer for first-year MBA students, with the word “analytics” inserted.  This book would benefit from a better definition of enterprise analytics, the value of “unsequestering” analytics from departmental silos, and more guidance on exactly how to make that happen.

Jean-Paul Isson and Jesse Harriott have hit a home run with Win with Advanced Business Analytics: Creating Business Value from Your Data, an excellent survey of the world of Business Analytics.   This book combines an overview of traditional topics in business analytics (with a practical “what works/what does not work” perspective) with timely chapters on emerging areas such as social media analytics, mobile analytics and the analysis of unstructured data.  A valuable contribution to the business library.

The “analytical leaders” featured in Wayne Eckerson’s  Secrets of Analytical Leaders: Insights from Information Insiders — Eric Colson, Dan Ingle, Tim Leonard, Amy O’Connor, Ken Rudin, Darren Taylor and Kurt Thearling — are executives who have actually done this stuff, which distinguishes them from many of those who write and speak about analytics.  The practical focus of this book is apparent from its organization — departing from the conventional wisdom of how to talk about analytics, Eckerson focuses on how to get an analytics initiative rolling, and keep it rolling.  Thus, we read about how to get executive support for an analytics program, how to gain momentum, how to hire, train and develop analysts, and so forth.  Instead of writing about “enterprise analytics” from a top-down perspective, Eckerson writes about how to deploy analytics in an enterprise — which is the real problem that executives need to solve.