2015: Predictions for Big Analytics
First, a review of last year’s predictions:
(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.
At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon. At my lunch table, every single person said his company is currently evaluating Spark. There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.
(2) “Co-location” will be the latest buzzword.
Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop. YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.
(3) Graph engines will be hot.
Graph engines did not take off in 2014. Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production. However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.
(4) R approaches parity with SAS in the commercial job market.
As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.
Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists. I asked Linda what analytic languages hiring managers seek when they hire quants. “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says. “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”
(5) SAP emerges as the company most likely to buy SAS.
After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014. The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.
(6) Competition heats up for “easy to use” predictive analytics.
Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch. Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover. In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.
Four out of six ain’t bad. Now looking ahead:
(1) Apache Spark usage will explode.
While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists. That will change in 2015, for several reasons:
- The R interface planned for release in Q1 opens the platform to a large and engaged community of users
- Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
- Databricks Cloud offers an easy way to spin up a Spark cluster
As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.
(2) Analytics in the cloud will take off.
Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud. And yet, all of the top ten data breaches in 2014 defeated an on-premises security system. Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.
Cloud is eating the analytics world for three big reasons:
- Analytic workloads tend to be lumpy and difficult to predict
- Analytic projects often need to get up and running quickly
- Analytic service providers operate in a variable cost world, with limited capital for infrastructure
Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others. For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.
Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services. Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS, lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.
(3) Python will continue to gain on R as the preferred open source analytics platform.
The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome. As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base. For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up. According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.
Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge. Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.
Python has two key advantages over R. As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R. Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use. There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.
(4) H2O will continue to win respect and customers in the Big Analytics market.
If you’re interested in scalable analytics but haven’t checked out H2O, you should. H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau. H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.
While the software is freely available, H2O offers support and services for an attractive price. The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.
(5) SAS customers will continue to seek alternatives.
SAS once had an almost religious loyalty from its customers. This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software. While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration. While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.
While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives. This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important. Even among big banks and pharma companies, though, SAS user headcount is declining.