2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.

Statwing: A Review

In every enterprise that uses analytics, there are a few power users who need the most advanced tools all of the time, and an army of casual users who need to do simple analysis now and then.  For the latter group, cloud-based analytics make perfect sense; users get the tools they need when they need them, and the organization gets out of the business of licensing, hosting, distributing and maintaining infrequently used software.

Statwing launched in 2012, and has recently scored some buzz and funding; it wants to “make your data…dreams come true.”  A review of the service seems timely.

Registration is simple; no credit card is needed for a trial license, just plug in your email address and go.   Statwing lets you try its Silver plan for fourteen days at no charge; after that, you can pay $25 per month to stay on the Silver plan, upgrade to the Gold plan for $100 per month, or downgrade to the free public plan.   The Silver and Gold plans keep your data private and let you share charts; the Gold plan lets you upload more data.

I tried uploading a few data sets.  The 1998 KDD Cup data was too large for the Silver plan, but a couple of other smaller data sets uploaded quickly, in seconds.

If you don’t have any data to work with, no problem: Statwing offers a few of its own to try:

Screen Shot 2014-02-01 at 3.15.41 PM

Once you select a data set, Statwing displays the variables in your data set on the left, with most of the screen available for your charts.  There is a video tutorial narrated by a robot, which is marginally useful and not really necessary since the application is very intuitive and easy to use.  (Statwing: is it all that expensive to hire someone to read the script?)

Screen Shot 2014-02-01 at 3.22.54 PM

Statwing does two things well: one-way profiles and two-way tests of correlation.  (Statwing claims to do crosstabs, but after watching the video and reading the available help, I can’t figure out how).

Univariate profiles offer the user a nice graphic and the option to toggle statistical measures on or off:

Screen Shot 2014-02-01 at 3.24.19 PM

Bivariate analysis gives the user a “plain English” interpretation of the statistical tests, which is helpful.

Screen Shot 2014-02-01 at 3.28.02 PM

Like any other statistical package, Statwing discovered a statistically significant relationship between two columns of random numbers in one of my test data sets.  This simply illustrates that making analytics “easy” isn’t helpful unless the users have an actual clue about what they are doing:

Screen Shot 2014-02-01 at 3.20.52 PM

Some caveats:

  • Statwing does not handle time series data — a problem since many enterprise users work with time series
  • By default, Statwing treats coded variables as numeric variables; the user can override this, but see my comment about users having a clue
  • Statwing lacks even the most basic tools for data processing, so you will need to prepare your data table in some other tool
  • Significance tests appear to be hard coded at 95% confidence, which is relatively “tight” for commercial work

Overall, this service is well implemented and easy to use.  It does very little that other tools can’t do; for example, if you use SurveyMonkey or a similar tool to conduct an online survey you can simply do the analysis there and forget about Statwing.   Given its limited functionality, Statwing is seriously overpriced; the Gold Plan will run you $1,200 per year ($800 if billed in advance); at that pricing, there are a number of alternatives that are just as easy to use.

To crack the enterprise market, Statwing will need to add more analytic features to current capabilities and offer enterprise licensing with concurrent user pricing.