Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.


In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.


We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.


(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

What’s Next for SAS?

First, some background.

— SAS is a privately held company.  Founder and CEO Jim Goodnight owns a controlling interest.

Goodnight is 71 years old.

— Goodnight’s children are not engaged in management of the business.

Within the next few years, SAS faces a dual transition of management and ownership.   This should be a concern for customers and prospective customers; due to SAS’ proprietary architecture, building on the SAS platform necessarily means a long-term bet on the future of the company.  Suppose, for example, that IBM acquires SAS: will SAS continue to support interfaces to Oracle and Teradata?

Succession is a problem for any business;  it is especially so for a founder-managed business, where ownership must change as well as management.   Goodnight may be interested in SAS as a going concern, but his heirs are more likely to want its cash value, especially when the IRS calls to collect estate taxes.

Large founder-managed firms typically struggle with two key issues.  First, the standards of corporate governance in public companies differ markedly from those that apply to private companies.  The founder’s personal business may be closely intermingled with corporate business in a manner that is not acceptable in a public company.

For example, suppose (hypothetically) that Goodnight or one of his personal entities owns the land occupied by SAS headquarters in Cary, North Carolina; as a transaction between related parties, such a relationship is problematic for a public company.   Such interests must be unwound before an IPO or sale to a public company can proceed; failure to do so can lead to serious consequences, as the Rigas brothers discovered when Adelphia Communications went public.

The other key issue is that founders may clash with senior executives who demonstrate independent thought and leadership.  Over the past fifteen years, a number of strong executives with industry and public company experience have joined SAS  through acquisition or hire; most exited within two years.  The present SAS management team consists primarily of long term SAS employees whose leadership skills are well adapted to survival under Goodnight’s management style.  How well this management team will perform when out from under Goodnight is anyone’s guess.

SAS flirted with an IPO in 1999, at the height of the tech-driven stock market boom, and hired ex-Oracle executive Andre Boisvert as COO to lead the transition.  Preparations for the IPO proceeded slowly; Boisvert clashed with Goodnight and left.  SAS shelved the IPO soon thereafter.

Subsequent to this episode, Goodnight told USA Today that talk about an IPO was never serious, that he had pursued an IPO for the benefit of the employees, and abandoned the move because employees were against it.    In the story, USA Today noted that this claim appeared to be at odds with Goodnight’s previous public statements.  The reader is left to wonder whether the real reason has something to do with Goodnight’s personal finances, or if he simply did not want to let go of the company.  In any case, it’s not surprising that many SAS employees opposed an IPO, since Boisvert reportedly told employees at a company meeting that headcount reduction would follow public ownership.

Since then, there have been opportunities to sell the company in whole or in part.  IBM tried to acquire the company twice.  Acquisition by IBM makes a lot of sense; SAS built its business on the strength of its IBM technology partnership; SAS still earns a large share of its revenue from software running on IBM hardware.  Both companies have a conservative approach to technology, preferring to wait until innovations are proven before introducing them to blue chip customers.

But Goodnight rebuffed IBM’s overtures and bragged about doing so, claiming an exaggerated value for SAS of $20 billion, around ten times sales at the time.  It’s not unknown for two parties to disagree about the value of a company.   But according to a SAS insider, Goodnight demanded that IBM agree to his price “without due diligence”, which no acquiring company can ever agree to do.  That seems like the behavior of a man who simply does not want to sell to anyone, under any circumstances.

Is SAS really worth ten times revenue?  Certainly not.  SAS’ compound annual revenue growth rate over the past twenty years is around 10%, which suggests a revenue multiplier of a little under 4X at current valuations (see graph below).  Of course, that assumes SAS’ past revenue growth rate is a good indicator of its future growth, which is a stretch when you consider the saturation of its market, increased competition and limited customer response to “game-changing” new products.

Software Industry Rev Gro and Mult
Source: Yahoo Finance. Market Capitalization and Revenue for publicly owned software companies

One obstacle to sale of the company is Goodnight’s stated unwillingness to sell to buyers who might cut headcount.  SAS’ company culture is the subject of business school case studies and the like, but the unfortunate truth is that SAS’ revenue per employee badly lags the IT industry, as shown in the table below.  SAS appears to be significantly overstaffed relative to revenue compared to other companies in the industry, and markedly so compared to any likely acquirer.

Table of RPE
Source: Yahoo Finance; SAS Website

One could speculate about the causes of this relatively low revenue per employee — I won’t — but an acquiring company will expect this to improve.  Flogging the business for more sales seems like pushing on a string — according to company insiders, SAS employs more people in its Marketing organization than in its Research and Development organization.  An acquirer will likely examine SAS’ product line, which consists of a few strong performers — the “Legacy” SAS software, such as Base and STAT — and a long list of other products, many of which do not seem to be widely used.  Rationalization of the SAS product line — and corresponding headcount — will likely be Job One for an acquirer.

So what’s ahead for SAS?

One option: Goodnight can simply donate his ownership interest in SAS to a charitable trust, which would continue to manage the business much the way Hershey Trust manages Hershey Foods.   This option would be least disruptive to customers and employees, and the current management team would likely stay in place (if the Board is stacked with insiders, locals and friends).    It’s anyone’s guess how likely this is; such a move would be consistent with Goodnight’s public statements about philanthropy, but unlike Larry Ellison, Goodnight hasn’t signed Warren Buffett’s Giving Pledge.

But if Goodnight needs the cash, or wants his heirs to inherit something, a buyer must be found.  Another plausible option consistent with Goodnight’s belief in the virtues of private ownership would be a private equity led buyout.  The problem here is that while private equity investors might be willing to put up with either low sales growth or low employee productivity, they won’t tolerate both at the same time.    A private equity investor would likely treat the Legacy SAS software as a cash cow, kill off or spin off the remaining products, and shed assets.   The rock collection and the culinary farm will be among the first to go.

There are a limited number of potential corporate buyers.  IBM, H-P, Oracle, Dell and Intel all sell hardware that supports SAS software, and all have a vested interest in SAS, but it seems unlikely that any of these will step up and buy the company.   Twice rebuffed, IBM has moved on from SAS, reporting double-digit growth in business analytics revenue while SAS struggles to put up single digits.   H-P and Dell have other issues at the moment.  Oracle could easily put up $10 billion in cash to buy SAS, and Oracle’s analytic story would benefit if SAS were added to the mix, but I suspect that Oracle doesn’t think it needs a better analytics story.

SAP has the resources to acquire SAS; a weak dollar favors acquirers from outside of the United States.  Such a transaction would add to SAP’s credibility in analytics, which isn’t strong (the recently announced acquisition of KXEN notwithstanding).   Until recently, there was no formal partnership between the two companies, and SAS executives spent the better part of the last SAS Global Forum strutting around the stage sniping at SAP HANA.  It will be interesting to see how this alliance develops.


A reader on Twitter asks: what about employee ownership?  Well, yes, but if Goodnight wants to sell the company, the employees would need to come up with the market price of $10-11 billion.  That works out to about $750,000 for each employee.  There are investors who would consider lending the capital necessary for an employee-led buyout, but they would subject the business and its management to the same level of scrutiny as an independent buyer.

SAS Visual Analytics: FAQ (Updated 1/2014)

SAS charged its sales force with selling 2,000 licenses for Visual Analytics in 2013; the jury is still out on whether they met this target.  There’s lots of marketing action lately from SAS about this product, so here’s an FAQ.

Update:  SAS recently announced 1,400 sites licensed for Visual Analytics.  In SAS lingo, a site corresponds roughly to one machine, but one license can include multiple sites; so the actual number of licenses sold in 2013 is less than 1,400.  In April 2013 SAS executives claimed two hundred customers for the product.   In contrast, Tableau reports that it added seven thousand customers in 2013 bringing its total customer count to 17,000.

What is SAS Visual Analytics?

Visual Analytics is an in-memory visualization and reporting tool.

What does Visual Analytics do?

SAS Visual Analytics creates reports and graphs that are visually compelling.  You can view them on mobile devices.

VA is now in its fifth dot release.  Why do they call it Release 6.3?

SAS Worldwide Marketing thinks that if they call it Release 6.3, you will think it’s a mature product.  It’s one of the games software companies play.

Is Visual Analytics an in-memory database, like SAP HANA?

No.  HANA is a standards-based in-memory database that runs on many different brands of hardware and supports a range of end-user tools.  VA is a proprietary architecture available on a limited choice of hardware platforms.  It cannot support anything other than the end-user applications SAS chooses to develop.

What does VA compete with?

SAS claims that Visual Analytics competes with Tableau, Qlikview and Spotfire.  Internally, SAS leadership refers to the product as its “Tableau-killer” but as the reader can see from the update at the top of this page, Tableau is alive and well.

How well does it compare?

You will have to decide for yourself whether VA reports are prettier than those produced by Tableau, Qlikview or Spotfire.  On paper, Tableau has more functionality.

VA runs in memory.  Does that make it better than conventional BI?

All analytic applications perform computations in memory.  Tableau runs in memory, and so does Base SAS.   There’s nothing unique about that.

What makes VA different from conventional BI applications is that it loads the entire fact table into memory.  By contrast, BI applications like Tableau query a back-end database to retrieve the necessary data, then perform computations on the result set.

Performance of a conventional BI application depends on how fast the back-end database can retrieve the data.  With a high-performance database the performance is excellent, but in most cases it won’t be as fast as it would if the data were held in memory.

So VA is faster?  Is there a downside?

There are two.

First, since conventional BI systems don’t need to load the entire fact table into memory, they can support usage with much larger datastores.  The largest H-P Proliant box for VA maxes out at about 10 terabytes; the smallest Netezza appliance supports 30 terabytes, and scales to petabytes.

The other downside is cost; memory is still much more expensive than other forms of storage, and the machines that host VA are far more expensive than data warehouse appliances that can host far more data.

VA is for Big Data, right?

SAS and H-P appear to be having trouble selling VA in larger sizes, and are positioning a small version that can handle 75-100 Gigabytes of data.  That’s tiny.

The public references SAS has announced for this product don’t seem particularly large.  See below.

How does data get into VA?

VA can load data from a relational database or from a proprietary SASHDAT file.  SAS cautions that loading data from a relational database is only a realistic option when VA is co-located in a Teradata Model 720 or Greenplum DCA appliance.

To use SASHDAT files, you must first create them using SAS.

Does VA work with unstructured data?

VA works with structured data, so unstructured data must be structured first, then loaded either to a co-located relational database or to SAS’ proprietary SASHDAT format.

Unlike products like Datameer or IBM Big Sheets, VA does not support “schema on read”, and it lacks built-in tools for parsing unstructured text.

But wait, SAS says VA works with Hadoop.  What’s up with that?

A bit of Marketing slight-of-hand.  VA can load SASHDAT files that are stored in the Hadoop File System (HDFS); but first, you have to process the data in SAS, then load it back into HDFS.  In other words, you can’t visualize and write reports from the data that streams in from machine-generated sources — the kind of live BI that makes Hadoop really cool.  You have to batch the data, parse it, structure it, then load it with SAS to VA’s staging area.

Can VA work with streaming data?

SAS sells tools that can capture streaming data and load it to a VA data source, but VA works with structured data at rest only.

With VA, can my users track events in real time?

Don’t bet on it.   To be usable VA requires significant pre-processing before it is loaded into VA’s memory.  Moreover, once it is loaded it can’t be updated; updating the data in VA requires a full truncate and reload.   Thus, however fast VA is in responding to user requests, your users won’t be tracking clicks on their iPads in real time; they will be looking at yesterday’s data.

Does VA do predictive analytics?

Visual Analytics 6.1 can perform correlation, fit bivariate trend lines to plots and do simple forecasting.  That’s no better than Tableau.  Surprisingly, given the hype, Tableau actually supports more analysis functions.

While SAS claims that VA is better than SAP HANA because “HANA is just a database”, the reality is that SAP supports more analytics through its Predictive Analytics Library than SAS supports in VA.

Has anyone purchased VA?

A SAS executive claimed 200 customers in early 2013, a figure that should be taken with a grain of salt.  If there are that many customers for this product, they are hiding.

There are five public references, all of them outside the US:

SAS has also recently announced selection (but not implementation) by

OfficeMax has also purchased the product, according to this SAS blog.

As of January 2014, the four customers who announced selection or purchase are not cited as reference customers.

What about implementation?  This is an appliance, right?

Wrong.  SAS’ considers an implementation that takes a month to be wildly successful.  Implementation tasks include the same tasks you would see in any other BI project, such as data requirements, data modeling, ETL construction and so forth.  All of the back end feeds must be built to put data into a format that VA can load.

Bottom line, does it make sense to buy SAS Visual Analytics?

Again, you will have to decide for yourself whether the SAS VA reports look better than Tableau or the many other options in this space.  BI beauty shows are inherently subjective.

You should also demand that SAS prove its claims to performance in a competitive POC.  Despite the theoretical advantage of an in-memory architecture, actual performance is influenced by many factors.  Visitors to the recent Gartner BI Summit who witnessed a demo were unimpressed; one described it to me as “dog slow”.  She didn’t mean that as a compliment.

The high cost of in-memory platforms mean that VA and its supporting hardware will be much more expensive for any given quantity of data than Tableau or equivalent products. Moreover, its proprietary architecture means you will be stuck with a BI silo in your organization unless you are willing to make SAS your exclusive BI provider.  That makes this product very good for SAS; the question is whether it is good for you.

The early adopters for this product appear to be very SAS-centric organizations (with significant prior SAS investment).  They also appear to be fairly small.  If you have very little data, money to burn and are willing to experiment with a relatively new product, VA may be for you.

SAS and H-P Close the Curtains

Michael Kinsley wrote:

It used to be, there was truth and there was falsehood. Now there is spin and there are gaffes. Spin is often thought to be synonymous with falsehood or lying, but more accurately it is indifference to the truth. A politician engaged in spin is saying what he or she wishes were true, and sometimes, by coincidence, it is. Meanwhile, a gaffe, it has been said, is when a politician tells the truth — or more precisely, when he or she accidentally reveals something truthful about what is going on in his or her head. A gaffe is what happens when the spin breaks down.

Hence, a Kinsley gaffe means “accidentally telling the truth”.

Back in April, an H-P engineer committed a Kinsley gaffe by publishing a white paper that describes in some detail issues encountered by SAS and H-P on implementations of SAS Visual Analytics.  I blogged about this at the time here.

Some choice bits:

— “Needed pre-planning does not occur and the result is weeks to months of frantic activity to address those issues which should and could have been addressed earlier and in a more orderly fashion.”

— “(Data and management networks) are typically overlooked and are the cause of most issues and delays encountered during implementation.”

— “Since a switch with 100s to 1000s of ports is required to achieve the consolidation of network traffic, list price can start at about US$500,000 and be into the millions of dollars.”

And my personal favorite:

— “The potential exists, with even as few as 4 servers, for a Data Storm to occur.”

If you’re wondering what a Data Storm is, let’s just say that its not a good thing.

Since I published the blog post, SAS has withdrawn the paper from its website.   This is not too surprising, since every other paper on “SAS and Big Data” is also hidden from view.   Fortunately, I downloaded a copy of the paper for my records.   H-P can claim copyright, so I can’t upload the whole thing, but I’ve attached a few screen shots below so you can see that this paper is real.

You might wonder why SAS feels compelled to keep its “Big Data” stories under wraps.  Keep in mind that we’re not talking about software design or any other intellectual property that warrants protection; in this case, the vendors don’t want you to know the truth about implementation because it conflicts with the hype.  As the paper’s author puts it, “this sounds very scary and expensive.”  “Very scary” and “expensive” don’t mix with “buy this product now.”

If you’re evaluating SAS Visual Analytics ask your SAS rep for a copy of Paper 466-2013.  And ask if they’ve done anything about those Data Storms.