Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

Spark 1.4 Released

On June 11, the Spark team announced availability of Release 1.4.  More than 210 contributors from 70 different organizations contributed more than 1,000 patches.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-06-12 at 2.00.20 PM

Spark Core

The Spark team continues to improve Spark operability, performance and compatibility.  Key enhancements include:

  • The first phase in Project Tungsten performance improvements, a cache-friendly sort algorithm
  • Also for improved performance, serialized shuffle output
  • For the Spark UI, visualization for Spark DAGs and operational monitoring
  • A REST API for application information, such as job, stage, task and storage status
  • For Python users, support for Python 3.x, plus external spilling for Python groupByKey operations
  • Two YARN enhancements: support for YARN on EC2 and security for long-running YARN applications
  • Two Mesos enhancements: Docker support and cluster mode.

DataFrames and SQL

This release includes extensions of analytic functions for DataFrames, operational utilities for Spark SQL and support for ORCFile format.

A complete list of enhancements to the DataFrame API is here.

R Interface

AMPLab released a developer version of SparkR in January 2014.  In June 2014, Alteryx and Databricks announced a partnership to lead development of this component.  In March, 2015, SparkR officially merged into Spark.

SparkR offers an interface to use Apache Spark from R.  In Spark 1.4, SparkR supports operations like selection, filtering and aggregation on large datasets.  It’s important to note that as of this release SparkR does not support an interface to MLLib, Streaming or GraphX.

Machine Learning

In Spark 1.4, ML pipelines graduate from alpha release, add feature transformations (Vector Assembler, String Indexer, Bucketizer etc.) and a Python API.  Additional enhancements to ML include:

There appears to be an effort under way to rebuild MLLib’s supervised learning algorithms in ML.

Enhancements to MLLib include:

There is a single enhancement to GraphX in Spark 1.4, a personalized PageRank.  Spark’s graph analytics capabilities are comparatively static.

Streaming

The enhancements to Spark Streaming include improvements to the UI plus enhanced support for Kafka and Kinesis and a pluggable interface for write ahead logs.  Enhanced support for Kafka includes better error reporting, support for Kafka 0.8.2.1 and Kafka with Scala 2.11, input rate tracking and a Python API for Kakfa direct mode.

Spark is Too Big to Fail

Reacting to growing interest in Apache Spark, there is a developing contrarian meme:

  • David Ramel asks: are Spark and Hadoop friends or foes?
  • Jack Vaughan compares Spark to the PDP-11, dismisses it as “just processing.”
  • Doug Henschen praises Spark, pans Databricks
  • Nicole Laskowski complains that Spark Summit East “felt like a Databricks show.”
  • Andrew Oliver thinks Spark needs to grow up
  • Andrew Brust worries that vendors are ahead of customers on Spark
  • IBM’s James Kobelius characterizes Spark as “the shiny new thing”
  • Gartner’s Nick Heudecker asserts that Spark is “not enterprise ready”

Spark skepticism falls into three broad categories:

  • Hadoop Purism: Spark deviates from the MapReduce/HDFS framework, and some people aren’t happy about that
  • Backseat Driving: Some analysts argue that Spark is great but Databricks, the commercial venture behind Spark, should do X, Y or Z
  • FUD: Spark’s competitors — commercial and open source — plant “issues” and “concerns” about Spark with industry analysts

Let’s examine each in turn.

“Spark Competes With Hadoop”

Spark does not compete with Hadoop; it competes with MapReduce.  Hadoop is an ecosystem of projects; there are a few components included in all commercial distributions (e.g. Hive, Pig, Hbase), but these  aren’t used at every site.  The ability to mix and match components is a strength for Hadoop.

Some software, like Spark, can run co-located in a Hadoop cluster or on clustered machines outside of Hadoop.  This should not surprise anyone; clustering and distributed computing existed before Hadoop.  Why does it matter if a software component can run both ways?  Users and use cases will drive implementation, and if Spark works better with Cassandra than with HDFS, or if a Spark user does not need the other Hadoop bits, so be it.

While there are reports of organizations that have abandoned MapReduce, most organizations will use Spark together with MapReduce; if users are happy with existing MapReduce jobs, there is no need to rewrite them.  For new applications, however, some users will choose Spark over MapReduce for a variety of reasons; for better runtime performance, more efficient programming, more built-in features or simply because it’s the latest thing.  Isn’t competition a wonderful thing?

Organizations using standalone instances of Spark likely never considered using MapReduce for the application in question.  For these use cases, Spark competes with SAS, Skytree, H2O, Graphlab or some other machine learning software.

Databricks Envy

Sniping at Databricks is equally unwarranted. (Note: I’m not on the payroll.)  There are only so many ways to build a viable open source business model.   Offering a commercial product with additional bits is one way to do so; that is how Cloudera and MapR operate.  Databricks offers a hosted service for Spark with a few extra bits; if you don’t like Databricks’ offering, you can implement on-premises yourself or get Spark as a service through Amazon Web Services, BlueData, Qubole or elsewhere.

And if you really must have a notebook for Spark, try Zeppelin.

Of course, it’s true that Hortonworks open sources everything.  HDP loses $3.76 for every dollar they sell.  They hope to make it up on volume.

Databricks contributes heavily to the open source Spark project, supporting developers whose sole job is to improve Spark.  Most importantly, Databricks provides leadership and release management, which inspires confidence that Spark will not turn into a muddled mess like Mahout.

The complaint that Spark Summit East “felt like a Databricks show” is odd — one rarely hears complaints that Oracle World “feels like an Oracle show.”  There were thirty-nine presentations on the agenda at Spark Summit East, and one — Ion Stoica’s keynoter — highlighted Databricks Cloud.   In contrast, sponsored sessions accounted for a third of the sessions at the 2015 Strata + Hadoop World in Santa Clara.

“Spark Is Not Enterprise-Ready”

Some of the criticism is silly.   Andrew Oliver is shocked to discover that Release 1.0 of Databricks Cloud’s notebook, currently still in beta release, isn’t as slick as Tableau.  Also, a process he was watching timed out.  But wait!  That might be due to slow hotel wi-fi…

Meanwhile, SecurityTracker reports a major security flaw in IBM’s BigSQL.

Is Spark “enterprise ready?”  The same question could be asked about Hadoop, and conservative enterprises will answer “no” in both cases.  There is no single threshold that determines when a piece of software is “enterprise-ready”.  Use cases matter; the standard for software that will run your ATMs is not the same as the standard for software to be used for genomics research.

According to Gartner’s Heudecker, “actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearData Story and Paxata, which uses Spark for data preparation. Other companies primarily use Spark to power dashboards.”  Interesting to hear Gartner dismiss the dashboard market; but enterprises are currently using Spark for more than dashboards.  A top global bank uses Spark today for Basel reporting and stress testing; if you’re not familiar with stress testing, suffice to say that a bank that gets this application wrong is in a heap of trouble.

It’s true that vendors are ahead of customers on Spark  This is hardly out of the ordinary with new technology; one could have said the same thing about Hive in 2010.  Vendors are always ahead of customers; it’s their job.

Spark is Too Big to Fail 

What are the alternatives to Spark?  Gartner’s Heudecker correctly notes that Spark excels at iterative processing, where MapReduce performance is sandbagged by its need to persist after each pass through the data.  High-performance advanced analytics must run in memory; there are commercial products available from SAS and Skytree, but for open source distributed analytics there are few alternatives to Spark.  Flink and Tez lack Spark’s analytic libraries; Impala can support SQL but lacks capabilities for machine learning, streaming analytics and graph analytics.

Whether or not Spark is fully buttoned down in Release 1.3 is irrelevant; at this point it is a settled matter that Spark is superior to MapReduce for advanced analytics applications.

I am not suggesting that Spark is free of bugs or issues.  Like every other commercial and open source software project, Spark has bugs; unlike some of the commercial products Gartner rates as “Leaders”, the Spark team is transparent about issues and fixes them quickly.   It’s also fair to say that this time next year Spark will have more features than it has today; the community of users and contributors will determine what features need to be added.

Unlike some other open source projects, Spark has strong leadership, a disciplined approach to development and an impressive release cadence.  People build software, and the people behind Spark have proven that they know what they are doing.

The list of Spark users is strong and growing.  I’ve attended every Spark Summit since the first one in 2013 and there is noticeable growth in the number and sophistication of the applications presented.  This is not hype; it is real progress by users who are accomplishing bigger and better things with Spark than they could have accomplished without it.

Spark has already achieved a level of commercial support that ensures it will live up to its promise.  Available in every commercial Hadoop distribution and with Datastax, endorsed by SAP and Oracle, it is inconceivable that these players will let Spark fail.  This is partly because reputations are at stake, and also because there are few other options for open source high-performance advanced analytics inside or outside of Hadoop.

Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

Spark Summit East: A Report (Updated)

Updated with links to slides where available.  Some links are broken, conference organizers have been notified.

Spark Summit East 2015 met on March 18 and 19 at the Sheraton Times Square in New York City.  Conference organizers announced another sellout (like the last two Spark Summits on the West Coast).

Competition for speaking slots at Spark events is heating up.  There were 170 submissions for 30 speaking slots at this event, compared to 85 submissions for 50 slots at Spark Summit 2014.  Compared to the last Spark Summit, presentations in the Applications Track, which I attended, were more polished, and demonstrate real progress in putting Spark to work.

The “father” of Spark, Matei Zaharia, kicked off the conference with a review of Spark progress in 2014 and planned enhancements for 2015.  Highlights of 2014 include:

  • Growth in contributors, from 150 to 500
  • Growth in the code base, from 190K lines to 370K lines
  • More than 500 known production instances at the close of 2014

Spark remains the most active project in the Hadoop ecosystem.

Also, in 2014, a team at Databricks smashed the Daytona GreySort record for petabyte-scale sorting.  The previous record, set in 2013, used MapReduce running on 2,100 machines to complete the task in 72 minutes.  The new record, set by Databricks with Spark running in the cloud, used 207 machines to complete the task in 23 minutes.

Key enhancements projected for 2015 include:

  • DataFrames, which are similar to frames in R, already released in Spark 1.3
  • R interface, which currently exists as SparkR, an independent project, targeted to be merged into Spark 1.4 in June
  • Enhancements to machine learning pipelines, which are sequences of tasks linked together into a process
  • Continued expansion of smart interfaces to external data sources, pushing logic into the sources
  • Spark packages — a repository for third-party packages (comparable to CRAN)

Databricks CEO Ion Stoica followed with a pitch for Databricks Cloud, which included brief testimonials from myfitnesspal, Automatic, Zoomdata, Uncharted Software and Tresata.

Additional keynoters included Brian Schimpf of Palantir, Matthew Glickman of Goldman Sachs and Peter Wang of Continuum Analytics.

Spark contributors presented detailed views on the current state of Spark:

  • Michael Armbrust, Spark SQL lead developer presented on the new DataFrames API and other enhancements to Spark SQL.
  • Tathagata Das delivered a talk on the current state and future of Spark Streaming.
  • Joseph Bradley covered MLLib, focusing on the Pipelines capability added in Spark 1.2
  • Ankur Dave offered an overview of GraphX, Spark’s graph engine.

Several observations from the Applications track:

(1) Geospatial applications had a strong presence.

  • Automatic, Tresata and Uncharted all showed live demonstrations of marketable products with geospatial components running on Spark
  • Mansour Raad of ESRI followed his boffo performance at Strata/Hadoop World last October with a virtuoso demonstration of Spark with massive spatial and temporal datasets and the ESRI open source GIS stack

(2) Spark provides a great platform for recommendation engines.

  • Comcast uses Spark to serve personalized recommendations based on analysis of billions of machine-generated events
  • Gilt Groupe uses Spark for a similar real-time application supporting flash sale events, where products are available for a limited time and in limited quantities
  • Leah McGuire of Salesforce described her work building a recommendation system using Spark

(3) Spark is gaining credibility in retail banking.

  • Sandy Ryza of Cloudera presented on Value At Risk (VAR) computations in Spark, a critical element in Basel reporting and stress testing
  • Startup Tresata demonstrated its application for Anti Money Laundering, which is built on a social graph built in Spark

(4) Spark has traction in the life sciences

  • Jeremy Freeman of HHMI Janelia Research Center, a regular presenter at Spark Summits, covered Spark’s unique capability for streaming machine learning.
  • David Tester of Novartis presented plans to build a trillion-edge graph for genomic integration
  • Timothy Danforth of Berkeley’s AMPLab delivered a presentation on next-generation genomics with Spark and ADAM
  • Kevin Mader of ETH Zurich spoke about turning big hairy 3D images into simple, robust, reproducible numbers without resorting to black boxes or magic

Also in the applications track: presenters from Baidu, myfitnesspal and Shopify.

Spark Release 1.3.0 Goes Live

On Friday, March 13, the Apache Spark team announced availability Release 1.3.0.  See Databricks’ announcement here; additional coverage here.

Spark continues to maintain its rapid cadence of enhancements, with 175 contributors and more than 1,000 commits.  The new DataFrame API, previously called SchemaRDD, is a key feature of the new release.  DataFrames are a key prerequisite for the long awaited R interface to Spark; originally envisioned for Release 1.3 at the earliest, the Spark team now expects to include this in Release 1.4.

While there is strong developer interest in Spark Core, the APIs and the SQL, MLLib and Streaming libraries, interest in GraphX remains low.  Together with the low confidence in GraphX among users (surveyed recently by Typesafe and Databricks), this raises questions about the future of the module.

Here is a list of new features in this release:

Spark Core

  • Multi-level aggregation trees to speed up reduce operations
  • Improved error reporting for certain operations
  • Spark’s Jetty dependency is now shaded
  • Support for SSL encryption
  • Support for realtime GC metrics and record counts added to the UI

DataFrame API

  • New capability, previously called SchemaRDD
  • Includes named fields along with schema information
  • Common interchange format among spark components as well as import/export
  • Build from Hive tables, JSON data, JDBC databases or any of Spark’s data source APIs

Spark SQL

  • Graduates from alpha
  • Backward compatibility for HiveQL and stable APIs
  • Support for writing tables in data sources
  • New JDBC data source enables interface with MySQL, Postgres and other RDBMS systems
  • Ability to merge compatible schemas in Parquet

Spark MLLib

New algorithms:

Spark has also added an initial capability to import and export models for some algorithms using a Spark-specific format.  The team plans to add import/export capability for additional models in the future, as well as PMML support.  Design document here.

Also new in 1.3.0: performance improvements for k-Means and ALS; Python API for the ML Pipeline, Gradient-Boosted Trees and Gaussian Mixture Model; and support for DataFrames.

Spark Streaming

Spark GraphX

Updates to GraphX include several utility functions, including a tool to transform a graph into a canonical edge graph.

Gartner Advanced Analytics Magic Quadrant 2015

Gartner’s latest Magic Quadrant for Advanced Analytics is out; for reference, the 2014 report is here; analysis from Doug Henschen here.  Key changes from last year:

  • Revolution Analytics moves from Visionary to Niche
  • Alpine and Microsoft move from Niche to Visionary
  • Oracle, Actuate and Megaputer drop out of the analysis
Gartner 2015 Magic Quadrant, Advanced Analytics
Gartner 2015 Magic Quadrant, Advanced Analytics

Gartner changed its evaluation criteria this year to reflect only “native” (e.g. proprietary) functionality; as a result, Revolution Analytics dropped from Visionary to Niche.   Other vendors, it seems, complained to Gartner that the old criteria were “unfair” to those who don’t leverage open source functionality.  If Gartner applies this same reasoning to other categories, it will have to drop coverage of Hortonworks and evaluate Cloudera solely on the basis of Impala.  🙂

Interestingly, Gartner’s decision to ignore open source functionality did not impact its evaluation of open source vendors RapidMiner and KNIME.

Based on modest product enhancements from Version 4.0 to Version 5.0, Alpine jumped from Niche to Visionary.   Gartner’s inclusion criteria for the category mandate that “a vendor must offer advanced analytics functionality as a stand-alone product…”; this appears to exclude Alpine, which runs in Pivotal Greenplum database (*).  Gartner’s criteria are flexible, however, and I’m sure it’s purely coincidental that Gartner analyst Gareth Herschel flacks for Alpine.

(*) Yes, I know — Alpine supports other databases and Hadoop as well.   The number of Alpine customers who use it in anything other than Pivotal can meet in Starbucks at one of the little tables in the back.

Gartner notes that Alpine “still lacks depth of functionality. Several model techniques are either absent or not fully developed within its tool.”  Well, yes, that does seem important.   Alpine’s promotion to Visionary appears to rest on its Chorus collaboration capability (originally developed by Greenplum).  It seems, however, that customers don’t actually use Chorus very much; as Gartner notes, “adoption is currently slow and the effort to boost it may divert Alpine’s resources away from the core product.”

Microsoft’s reclassification from Niche to Visionary rests purely on the basis of Azure Machine Learning (AML), a product still in beta at the time of the evaluation.  Hardly anyone uses MSFT’s “other” offering for analytics (SQL Server Analytic Services, or SSAS), apparently for good reason:

  • “The 2014 edition of SSAS lacks breadth, depth and usability, in comparison with the Leaders’ offerings.”
  • “Microsoft received low scores from SSAS customers for its willingness to incorporate their feedback into future versions of the product.”
  • “SSAS is a low-performing product (with poor features, little data exploration and questionable usability.”

On paper, AML is an attractive product, though it maxes out at 10GB of data; however, it seems optimistic to rate Microsoft as “Visionary” purely on the basis of a beta product.  “Visionary” is a stretch in any case — analytic software that runs exclusively in the cloud is by definition a niche product, as it appeals only to a certain segment of the market.  AML’s most attractive capabilities are its ability to run Python and R — and, as we noted above — these no longer carry any weight with Gartner.

Dropping Actuate and Megaputer from the MQ simply recognizes the obvious.  It’s not clear why these vendors were included last year in the first place.

It appears that Oracle chose not to participate in the MQ this year.  Analytics that run in a single database platform are by definition niche products — you can’t use Oracle Advanced Analytics if you don’t have Oracle Database, and few customers will choose Oracle Database because it has Oracle Advanced Analytics.

 

SAS Misses 2014 Growth Forecast

At the beginning of 2014, SAS EVP and CMO Jim Davis predicted double-digit revenue growth for 2014; in October, CEO Jim Goodnight walked that back to 5%, citing a challenging business climate in Europe.  Today, SAS announced 2014 revenue of $3.09 Billion, up 2.3%.

Meanwhile, IBM reported growth in analytics revenue of 7% in Q4.

The challenge for SAS is that the US market is saturated: virtually every enterprise that ever will use SAS already does so, and there are limits to the number of new products one can add to the stack.  Much of SAS’ growth comes from overseas, and a strong dollar impairs SAS’ ability to sell in foreign markets.

On the positive side, SAS reports a total of 3,400 sites for SAS Visual Analytics, its “Tableau-killer”, compared to 1,400 sites announced last year, for a net growth of 2,000 sites.  (In SAS’ parlance, a “site” is roughly equivalent to a server.)  Tableau has not yet released its 2014 results, but in Q3 Tableau reports that it added 2,500 customer accounts.

SAS also reports 24% revenue growth for its cloud services.   IT analyst Synergy Research Group reports that the cloud market is growing at a 49% annualized rate, although AWS, Microsoft, IBM and Google are all growing much faster than that.

In other news, the WSJ reports that Big Data analytics startup Palantir is now valued at $15 billion, which is about the same as what it would cost an acquirer to buy SAS at 5X revenue.

Still More Comments on Microsoft and Revolution Analytics

Three full business days post-announcement, and stories continue to roll in.

Stephen Sowyer of TDWI writes an excellent summary of what Microsoft will likely do with Revolution Analytics.  He correctly notes, for example, that Microsoft is unlikely to develop a business user interface for R with code-generating capabilities (comparable to SAS Enterprise Guide, for example).  This is difficult to do, and the demand is low; people who care about R tend to like working in a programming environment, and value the ability to write their own code.  Business users, on the other hand, tend to be indifferent about the underlying code generated by the application.

Since Revolution’s Windows-based IDE requires some investment to keep it competitive, the most likely scenario is that Microsoft will add R to the Visual Studio suite.

Mr. Sowyer also notes that popular data warehouses (such as Oracle, IBM Netezza and Teradata Aster) can run R scripts in-database.  While this is true, what these databases cannot do is run R scripts in distributed mode, which limits the capability to embarrassingly parallel tasks.  Enabling R scripts to run in distributed databases — necessary for Big Data — is a substantial development project, which is why Revolution Analytics completed only two such ports (one to Hadoop and one to Teradata).

While Microsoft’s deep pockets give Revolution Analytics the means to support more platforms, they still need the active collaboration of database vendors.  Oracle and Pivotal have their own strategies for R, so partnerships with those vendors is unlikely.

For some time now, commercial database vendors have attempted to differentiate their product by including machine learning engines.  Teradata was the first, in 1987, followed by IBM DB2 in 1992; SQL Server followed in the late 1990s, and Oracle acquired what was left of Thinking Machines in 1999 primarily so it could build Darwin software for predictive analytics into Oracle database.  None of these efforts has gained much traction with working analysts. for several reasons: (1) database vendors generally sell to the IT organization and not to an organization’s end users; (2) as a result, most organizations do not link the purchase decision for databases and analytics; (3) users for predictive analytics tend to be few in number compared to SQL and BI users, and their needs tend to get overlooked.

Bottom line: I think it is doubtful that Microsoft will pursue enabling R to run in relational databases other than SQL Server, and they will drop Revolution’s “Write Once Deploy Anywhere” tagline, as it is impossible to deliver.

Elsewhere, Mr. Dan Woods doubles down on his argument that Microsoft should emulate Tibco, which is like arguing that the Seattle Seahawks should emulate the Jacksonville Jaguars.  Sorry, JAX; it just wasn’t your year.

 

More Comments on Microsoft + Revolution Analytics

My inbox continues to fill with Google Alerts about Microsoft’s announced purchase of Revolution Analytics — too numerous to link.

Most of these stories simply repackage the Microsoft announcement.

Clint Boulton of the WSJ’s CIO Journal writes one of the best analyses:

Microsoft is betting on the timeliness of its acquisition as more businesses adopt analytics. Revolution’s software helps companies use R, an open source programming language that more than two million programmers use daily to build predictive models. R is popular among university computer science students, many of whom continue to use it in their careers as data scientists.

Data scientists who extract data from of a data warehouse or Hadoop processing system, use R to slice and dice it for insights, and visualize the results. But businesses analyzing financial, social media and other data often need to scale the analytics across clusters of computers.

Several analysts pass along the factoid that two million people use R.   The truth is that nobody has any idea how many people use R; we don’t even know how many have downloaded the software.  The New York Times pointed out the difficulty in its piece five years ago:

While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that close to 250,000 people work with it regularly.

It’s possible that R has gained 1,750,000 users in the interceding five years.  It’s also possible that R has gained 10,000,000 users.  “Those most familiar with the software” are simply guessing.

While most analysts are neutral to positive on Microsoft’s move, Mr. Dan Woods takes a contrary view.  In an article published in Forbes and cross-posted on multiple platforms, Mr. Woods argues that Microsoft was wrong to buy Revolution Analytics, and instead should buy Tibco.   (That is the implication of his argument that Microsoft should “emulate” Tibco, since the only way to “emulate” Tibco is to own the clump of software Tibco packages up as TERR.)

Mr. Woods is a “content specialist”, as freelance writers call themselves today, and his expertise in analytics is exemplified by his most recent book, Wikis for Dummies, published in 2007.  One suspects that the private equity firm that acquired Tibco in September is peddling the pieces, and has engaged “content specialists” to bang the drum.

Mr. Woods gets two things right.  It’s true that R is a mess, and it is also true that the GPL license makes R difficult to commercialize.  R’s messiness is a byproduct of crowdsourced development; it is a feature to its devotees and a bug for everyone else.  (For those who simply cannot tolerate R’s messiness there is a simple solution: use Python.)  Under the GPL license, any enhancements become part of the free distribution, so if you distribute a product built with R you must share the source code of your product as well.

At the crux of his argument, though, Mr. Woods gets it wrong:

Revolution Analytics has made a business, like many open source-based companies, of supporting Open Source R.

This is factually incorrect.  Revolution only recently started to offer a consulting service for open source R users; for most of its history, its business was built around Revolution R Enterprise, a commercially supported enhanced R distribution.  This is not a trivial distinction.  Cloudera Hadoop, for example, is based on Apache Hadoop, but it is not the same thing; while many enterprises use commercially supported Hadoop distributions (from vendors like Cloudera, Hortonworks or MapR), hardly anyone uses open source Apache Hadoop in production.

The same is true for R; while many enterprises have an issue using open source R, they are willing to deploy commercially supported R distributions (such as Oracle R or Revolution R).  This is the business Microsoft enters by acquiring Revolution Analytics.

Regarding Mr. Woods’ point about the need to rebuild R from the ground up, that is neither possible nor necessary.  The GPL license prevents anyone from “rebuilding” R as a commercial venture; if anyone “rebuilds” the language it will be the open source development team itself.

In any case, one need not “make R scale” — one need only provide an R API to other platforms (such as Apache Spark or dbLytix) that can scale, so that R users can interface with them.   This is the approach taken by Revolution Analytics’ ScaleR software, which is actually written in C, but includes an interface from the R programming language.  By building this component into Azure, Microsoft can offer those who use R locally a scaleable back end.

Update: Mr. Woods doubles down here.