2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

Machine Learning in Hadoop: Part One

Much has changed since I last blogged on this subject a year ago (here and here).  This is the first of a three-part blog covering the current state of play for machine learning in Hadoop.  I use the term “machine learning” deliberately, to refer to tools that can learn from data in an automated or semi-automated manner; this includes traditional statistical modeling plus supervised and unsupervised machine learning.  For convenience, I will not cover fast query tools, BI applications, graph engines or streaming analytics; all of those are important, and deserve separate treatment.

Every analytics vendor claims the ability to work with Hadoop.  In Part One, we cover five things to consider when evaluating how well a particular machine learning tool integrates with Hadoop: deployment topology, hardware requirements, workload integration, data integration, and the user interface.  Of course, these are not the only things an organization should consider when evaluating software; other features, such as support for specific analytic methods, required authentication protocols and other needs specific to the organization may be decisive.

Deployment Topology

Where does the machine learning software reside relative to the Hadoop TaskTracker and Data Nodes (“worker nodes”)?  Is it (a) distributed among the Hadoop worker nodes; (b) deployed on special purpose “analytic” nodes or (c) deployed outside the Hadoop cluster?

Distribution among the worker nodes offers the best performance; under any other topology, data movement will impair performance.  If end users tend to work with relatively small snippets of data sampled from the data store, “beside” architectures may be acceptable, but fully distributed deployment is essential for very large datasets.

Deployment on special purpose “analytic” nodes is a compromise architecture, usually motivated either by a desire to reduce software licensing fees or avoid hardware upgrades for worker node servers.  There is nothing wrong with saving money, but clients should not be surprised if performance suffers under anything other than a fully distributed architecture.

Hardware Requirements

If the machine learning software supports distributed deployment on the Hadoop worker nodes, can it run effectively on standard Hadoop node servers?  The definition of a “standard” node server is a moving target; Cloudera, for example, recognizes that the appropriate hardware spec depends on planned workload.  Machine learning, as a rule, benefits from a high memory spec, but some machine learning software tools are more efficient than others in the way they use memory.

Clients are sometimes reluctant to implement a fully distributed machine learning architecture in Hadoop because they do not want to replace or upgrade a large number of node servers.  This reluctance is natural, but the problem is attributable in part to a gap in planning and rapidly changing technology.  Trading off performance for cost reduction may be the right thing to do, but it should be a deliberate decision.

Workload Integration

If the machine learning software can be distributed among the worker nodes, how well does it co-exist with other MapReduce and non-MapReduce applications?  The gold standard is the ability to run under Apache YARN, which supports resource management across MapReduce and non-MapReduce applications.   Machine learning software that pushes commands down to MapReduce is also acceptable, since the generated MapReduce jobs run under existing Hadoop workload management.

Software that effectively takes over the Hadoop cluster and prevents other jobs from running is only acceptable if the cluster will be dedicated to the machine learning application.   This is not completely unreasonable if the Hadoop cluster replaces a conventional standalone analytic server and file system; the TCO for a Hadoop cluster is very favorable relative to a dedicated high-end analytic server.  Obviously, clients should know how they plan to use the cluster when considering this.

Data Integration

Ideally, machine learning software should be able to work with every data format supported in Hadoop; most machine learning tools are more limited in what they can read and write. The ability to work with uncompressed text in HDFS is table stakes; more sophisticated tools can work with sequence files as well, and support popular compression formats such as Snappy and Bzip/Gzip.  There is also growing interest in use of Apache Avro.   Users may also want to work with data in HBase, Hive or Impala.

There is wide variation in the data formats supported by machine learning software; clients are well advised to tailor assessments to the actual formats they plan to use.

User Interface

There are many aspects of the user interface that matter to clients when evaluating software, but here we consider just one aspect:  Does the machine learning software require the user to specify native MapReduce commands, or does it effectively translate user requests to run in Hadoop behind the scenes?

If the user must specify MapReduce, Hive or Pig it begs the question: why not just perform that task directly in MapReduce, Hive or Pig?

In Part Two, we will examine current open source alternatives for machine learning in Hadoop. 

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS and HDP

(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.