The Year in Machine Learning (Part One)

This is the first installment in a four-part review of 2016 in machine learning and deep learning.

In the first post, we look back at ML/DL news organized in five high-level topic areas:

  • Concerns about bias
  • Interpretable models
  • Deep learning accelerates
  • Supercomputing goes mainstream
  • Cloud platforms build ML/DL stacks

In Part Two, we cover developments in each of the leading open source machine learning and deep learning projects.

Parts Three and Four will review the machine learning and deep learning moves of commercial software vendors.

Concerns About Bias

As organizations expand the use of machine learning for profiling and automated decisions, there is growing concern about the potential for bias. In 2016, reports in the media documented racial bias in predictive models used for criminal sentencing, discriminatory pricing in automated auto insurance quotes, an image classifier that learned “whiteness” as an attribute of beauty, and hidden stereotypes in Google’s word2vec algorithm.

Two bestsellers were published in 2016 that address the issue. The first, Cathy O’Neil’s Weapons of Math Destruction, is a candidate for the National Book Award. In a review for The Wall Street Journal, Jo Craven McGinty summarizes O’Neil’s arguments as “algorithms aren’t biased, but the people who build them may be.”

A second book, Virtual Competition, written by Ariel Ezrachi and Maurice Stucke, focuses on the ways that machine learning and algorithmic decisions can promote price discrimination and collusion. Burton Malkiel notes in his review that the work “displays a deep understanding of the internet world and is outstandingly researched. The polymath authors illustrate their arguments with relevant case law as well as references to studies in economics and behavioral psychology.”

Most working data scientists are deeply concerned about bias in the work they do. Bias, after all, is a form of error, and a biased algorithm is an inaccurate algorithm. The organizations that employ data scientists, however, may not commit the resources needed for testing and validation, which is how we detect and correct bias. Moreover, people in business suits often exaggerate the accuracy and precision of predictive models or promote their use for inappropriate applications.

In Europe, GDPR creates an incentive for organizations that use machine learning to take the potential for bias more seriously. We’ll be hearing more about GDPR in 2017.

Interpretable Models

Speaking of GDPR, beginning in 2018, organizations that use machine learning to drive automated decisions must be prepared to explain those decisions to the affected subjects and to regulators. As a result, in 2016 we saw considerable interest in efforts to develop interpretable machine learning algorithms.

— The MIT Computer Science and Artificial Intelligence Laboratory announced progress in developing neural networks that deliver explanations for their predictions.

— At the International Joint Conference on Artificial Intelligence, David Gunning summarized work to date on explainability.

— MIT selected machine learning startup Rulex as a finalist in its Innovation Showcase. Rulex implements a technique called Switching Neural Networks to learn interpretable rule sets for classification and regression.

— In O’Reilly Radar, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin explained Local Interpretable Model-Agnostic Explanations (LIME), a technique that explains the predictions of any machine learning classifier.

The Wall Street Journal reported on an effort by Capital One to develop machine learning techniques that account for the reasoning behind their decisions.

In Nautilus, Aaron M. Bornstein asked: Is artificial intelligence permanently inscrutable?  There are several issues, including a lack of clarity about what “interpretability” means.

It is important to draw a distinction between “interpretability by inspection” versus “functional” interpretability. We do not evaluate an automobile by disassembling its engine and examining the parts; we get behind the wheel and take it for a drive. At some point, we’re all going to have to get behind the idea that you evaluate machine learning models by how they behave and not by examining their parts.

Deep Learning Accelerates

In a September Fortune article, Roger Parloff explains why deep learning is suddenly changing your life. Neural networks and deep learning are not new techniques; we see practical applications emerge now for three reasons:

— Computing power is cheap and getting cheaper; see the discussion below on supercomputing.

— Deep learning works well in “cognitive” applications, such as image classification, speech recognition, and language translation.

— Researchers are finding new ways to design and train deep learning models.

In 2016, the field of DL-driven cognitive applications reached new milestones:

— A Microsoft team developed a system that recognizes conversational speech as well as humans do. The team used convolutional and long short-term memory (LSTM) neural networks built with Microsoft Cognitive Toolkit (CNTK).

— On the Google Research Blog, a Google Brain team announced the launch of the Google Neural Machine Translation System, a system based on deep learning that is currently used for 18 million translations per day.

— In TechCrunch, Ken Weiner reported on advances in DL-driven image recognition and how they will transform business.

Venture capitalists aggressively funded startups that leverage deep learning in applications, especially those that can position themselves in the market for cognitive solutions:

Affectiva, which uses deep learning to read facial expressions in digital video, closed on a $14 million “D” round led by Fenox Venture Capital.

Clarifai, a startup that offers a DL-driven image and video recognition service, landed a $30 million Series B round led by Menlo Ventures.

Zebra Medical Vision, an Israeli startup, uses DL to examine medical images and diagnose diseases of the bones, brain, cardiovascular system, liver, and lungs. Zebra disclosed a $12 million venture round led by Intermountain Health.

There is an emerging ecosystem of startups that are building businesses on deep learning. Here are six examples:

Deep Genomics, based in Toronto, uses deep learning to understand diseases, disease mutations and genetic therapies.

— Cybersecurity startup Deep Instinct uses deep learning to predict, prevent, and detect threats to enterprise computing systems.

Ditto Labs uses deep learning to identify brands and logos in images posted to social media.

Enlitic offers DL-based patient triage, disease screening, and clinical support to make medical professionals more productive.

— Gridspace provides conversational speech recognition systems based on deep learning.

Indico offers DL-driven tools for text and image analysis in social media.

And, in a sign that commercial development of deep learning isn’t all hype and bubbles, NLP startup Idibon ran out of money and shut down. We can expect further consolidation in the DL tools market as major vendors with deep pockets ramp up their programs. The greatest opportunity for new entrants will be in specialized applications, where the founders can deliver domain expertise and packaged solutions to well-defined problems.

Supercomputing Goes Mainstream

To make deep learning practical, you need a lot of computing horsepower. In 2016, hardware vendors introduced powerful new platforms that are purpose-built for machine learning and deep learning.

While GPUs are currently in the lead, there is a serious debate under way about the relative merits of GPUs and FPGAs for deep learning. Anand Joshi explains the FPGA challenge. In The Next Platform, Nicole Hemsoth describes the potential of a hybrid approach that leverages both types of accelerators. During the year, Microsoft announced plans to use Altera FPGAs, and Baidu said it intends to standardize on Xilinx FPGAs.

NVIDIA Launches the DGX-1

NVIDIA had a monster 2016, tripling its market value in the course of the year. The company released the DGX-1, a deep learning supercomputer. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

NVIDIA also revealed a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow, and Theano.

Tech media salivated:

MIT Technology Review interviewed NVIDIA CEO Jen-Hsun Huang, who is now Wall Street’s favorite tech celebrity.

Separately, Karl Freund reports on NVIDIA’s announcements at the SC16 supercomputing show.

Early users of the DGX-1 include BenevolentAI, PartnersHealthCare, Argonne and Oak Ridge Labs, New York University, Stanford University, the University of Toronto, SAP, Fidelity Labs, Baidu, and the Swiss National Supercomputing Centre. Nicole Hemsoth explains how NVIDIA supports cancer research with its deep learning supercomputers.

Cray Releases the Urika-GX

Cray launched the Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high-performance network interconnect. Cray ships 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

Intel Responds

The headline on the Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. Intel acquired Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reported a price tag of $408 million. The customary tech media unicorn story storm ensues.

Intel said it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Paul Alcorn offers additional detail on Intel’s new Xeon CPU and Deep Learning Inference Accelerator. In Fortune, Aaron Pressman argues that Intel’s strategy for machine learning and AI is smart, but lags NVIDIA. Nicole Hemsoth describes Intel’s approach as “war on GPUs.”

Separately, Intel acquired Movidius, the folks who put a deep learning chip on a memory stick.

Cloud Platforms Build ML/DL Stacks

Machine learning use cases are inherently well-suited to cloud platforms. Workloads are ad hoc and project oriented; model training requires huge bursts of computing power for a short period. Inference workloads are a different matter, which is one of many reasons one should always distinguish between training and inference when choosing platforms.

Amazon Web Services

After a head fake earlier in the year when it publishing DSSTNE, a deep learning project that nobody wants, AWS announces that it will standardize on MXNet for deep learning. Separately, AWS launched three new machine learning managed services:

Rekognition, for image recognition

Polly, for text to speech

Lex, a conversational chatbot development platform

In 2014, AWS was first to market among the cloud platforms with GPU-accelerated computing services. In 2016, AWS added P2 instances with up to 16 Tesla K8- GPUs.

Microsoft Azure

Released in 2015 as CNTK, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit and released Version 2.0, with a new Python API and many other enhancements. The company also launched 22 cognitive APIs in Azure for vision, speech, language, knowledge, and search. Separately, MSFT released its managed service for Spark in Azure HDInsight and continued to enhance Azure Machine Learning.

MSFT also announced the Azure N-Series compute instances powered by NVIDIA GPUs for general availability in December.

Azure is one part of MSFT’s overall strategy in advanced analytics, which I’ll cover in Part Three of this review.

Google Cloud

In February, Google released TensorFlow Serving, an open source inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explained.

Later in the Spring, Google announced that it was building its own deep learning chips, or Tensor Processing Units (TPUs). In Forbes, HPC expert Karl Freund dissected Google’s announcement. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs.

Google launched a dedicated team in October to drive Google Cloud Machine Learning, and announced a slew of enhancements to its services:

— Google Cloud Jobs API provides businesses with capabilities to find, match and recommend jobs to candidates. Currently available in a limited alpha.

Cloud Vision API now runs on Google’s custom Tensor Processing Units; prices reduced by 80%.

Cloud Translation API will be available in two editions, Standard and Premium.

Cloud Natural Language API graduates to general availability.

In 2017, GPU-accelerated instances will be available for the Google Compute Engine and Google Cloud Machine Learning. Details here.

IBM Cloud

In 2016, IBM contributed heavily to the growing volume of fake news.

At the Spark Summit in June, IBM announced a service called the IBM Data Science Experience to great fanfare. Experienced observers found the announcement puzzling; the press release described a managed service for Apache Spark with a Jupyter IDE, but IBM already had a managed service for Apache Spark with a Jupyter IDE.

In November, IBM quietly released the service without a press release, which is understandable since there was nothing to crow about. Sure enough, it’s a Spark service with a Jupyter IDE, but also includes an R service with RStudio, some astroturf “community” documents and “curated” data sources that are available for free from a hundred different places. Big Whoop.

In IBM’s other big machine learning move, the company rebranded an existing SPSS service as Watson Machine Learning. Analysts fell all over themselves raving about the new service, apparently without actually logging in and inspecting it.

screen-shot-2016-10-30-at-11-05-33-am

Of course, IBM says that it has big plans to enhance the service. It’s nice that IBM has plans. We should all aspire to bigger and better things, but keep in mind that while IBM is very good at rebranding stuff other people built, it has never in its history developed a commercially successful software product for advanced analytics.

IBM Cloud is part of a broader strategy for IBM, so I’ll have more to say about the company in Part Three of this review.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

Teradata Lays Another Egg

Teradata reports Q3 revenue of $606 million, down 3% in “constant” dollars, down 9% in actual dollars, the kind you can spend.  Product revenue, from selling software and boxes, declined 14%.

In a brutal call with analysts, CEO Mike Koehler noted: “revenue was not what we expected.”  It could have been a recorded message.

Teradata executives tried to blame the weak revenue on the strong dollar.  When pressed, however, they admitted that deferred North American sales drove the shortfall, as companies put off investments in Teradata’s big box solutions.

In other words, the dogs don’t like the dog food.

From the press release:

Teradata is in the process of making transformational changes to improve the long-term performance of the company, including offering more flexibility and options in the way customers buy Teradata products such as a software-only version of Teradata as well as making Teradata accessible in the public cloud. The initial cloud version of Teradata will be available on Amazon’s Web Services in the first quarter of 2016.

An analyst asked about expected margins in the software-only business; Teradata executives clammed up.  The answer is zero.  Teradata without a box is a bladeless knife without a handle, competing directly with open source databases, such as Apache Greenplum.

Another analyst asked about Teradata on AWS, noting that Teradata executives previously declared that their customers would never use AWS.  Response from the executives was more mush.  HP just shuttered its cloud business; Teradata’s move to AWS implies that Teradata Cloud is toast.

Koehler also touted Teradata’s plans to offer Aster on Hadoop, citing “100 pre-built applications”.  Good luck with that.  Aster on Hadoop is a SQL engine that still runs through MapReduce; in other words it’s obsolete, a point reinforced by Teradata’s plans to move forward with Presto.  Buying an analytic database with pre-built applications is like buying a car with pre-built rides.

More from the press release:

“We remain confident in Teradata’s technology, our roadmaps and competitive leadership position in the market and we are taking actions to increase shareholder value.  We are making transformative changes to the company for longer term success, and are also aligning our cost structure for near term improvement,” said Mike Koehler, chief executive officer, Teradata Corporation. 

In other words, expect more layoffs.

“Our Marketing Applications team has made great progress this year, and has market leading solutions. As part of our business transformation, we determined it best to exclusively focus our investments and attention on our core Data and Analytics business.  We are therefore selling our Marketing Applications business. As we go through this process, we will work closely with our customers and employees for continued success.

“We overpaid for Aprimo five years ago, so now we’re looking for some greater fool to buy this dog.”

In parallel, we are launching key transformation initiatives to better align our Data and Analytics solutions and services with the evolving marketplace and to meet the needs of the new Teradata going forward.”

Update your resumes.

During the quarter, Teradata purchased approximately 8.5 million shares of its stock worth approximately $250 million.  Year to date through September 30, Teradata purchased 15.5 million shares, worth approximately $548 million.

“We have no vision for how to invest in our business, so we’re buying back the stock.”

In early trading, Teradata’s stock plunges.

In 2012, five companies led the data warehousing platform market: Oracle, IBM, Microsoft, Teradata and SAP.  Here’s how their stocks have fared since then:

  • Oracle: Up 24%
  • IBM: Down 29%
  • Microsoft: Up 77%
  • Teradata: Down 61%
  • SAP: Up 22%

Nice work, Teradata!  Making IBM look good…

Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

SAS Misses 2014 Growth Forecast

At the beginning of 2014, SAS EVP and CMO Jim Davis predicted double-digit revenue growth for 2014; in October, CEO Jim Goodnight walked that back to 5%, citing a challenging business climate in Europe.  Today, SAS announced 2014 revenue of $3.09 Billion, up 2.3%.

Meanwhile, IBM reported growth in analytics revenue of 7% in Q4.

The challenge for SAS is that the US market is saturated: virtually every enterprise that ever will use SAS already does so, and there are limits to the number of new products one can add to the stack.  Much of SAS’ growth comes from overseas, and a strong dollar impairs SAS’ ability to sell in foreign markets.

On the positive side, SAS reports a total of 3,400 sites for SAS Visual Analytics, its “Tableau-killer”, compared to 1,400 sites announced last year, for a net growth of 2,000 sites.  (In SAS’ parlance, a “site” is roughly equivalent to a server.)  Tableau has not yet released its 2014 results, but in Q3 Tableau reports that it added 2,500 customer accounts.

SAS also reports 24% revenue growth for its cloud services.   IT analyst Synergy Research Group reports that the cloud market is growing at a 49% annualized rate, although AWS, Microsoft, IBM and Google are all growing much faster than that.

In other news, the WSJ reports that Big Data analytics startup Palantir is now valued at $15 billion, which is about the same as what it would cost an acquirer to buy SAS at 5X revenue.

Microsoft Buys Revolution Analytics

On Friday, January 23, Microsoft announced an agreement to acquire Revolution Analytics.  Coverage of the announcement in the media is extensive, with stories by TechCrunchWiredZDNetVentureBeat and many others (here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here and here.)

Microsoft did not disclose the negotiated purchase price; Revolution’s total capitalization is around $40 million.  Given Revolution’s scale of operations, the acquisition will have minimal impact on Microsoft’s near-term revenue and profit.

Many analysts follow Microsoft, but few have heard of Revolution Analytics, and most seem to be stumped by this move.  An example:

Question: What is the significance of Microsoft acquiring Revolution Analytics?

Answer: I am not sure.

Microsoft gets four things with this deal:

  • Instant credibility with the growing open source analytics community
  • Consulting and support skills to help enterprise customers adopt R
  • A capable engineering organization (conveniently located in Seattle)
  • Software bits that should integrate well with the Microsoft stack

In addition to its primary offering, Revolution R Enterprise, Revolution distributes Revolution R Open, an enhanced free distribution of open source R; and Revolution R Cloud an elastic offering on the AWS Marketplace.  Revolution R Open is equivalent in many respects to Oracle R Distribution, which is also compiled with the Intel Math Kernel Libraries.  Revolution R Plus is commercially supported, and includes additional software bits for enterprise integration; this product is comparable to Oracle R Enterprise.

Revolution Analytics’ other key software assets include ScaleR, a distributed out-of-memory back end with a strong R interface; DeployR, a component that supports enterprise deployment of web-based applications; and DevelopR, a Windows-based IDE.

While the IDE has a number of useful features, it requires significant investment to compete effectively with RStudio, which has won the hearts and minds of R users.  Upgrading software simply to make it competitive with a “free” competitor strikes me as a dubious commercial move; it seems more likely that Microsoft will add an R capability to the Visual Studio suite.

Revolution’s ScaleR back end enables R users to leverage a platform for distributed analytics.  ScaleR already runs on Windows Server HPC clusters, which should make integration with Azure a straightforward matter.  This is important for Microsoft, since Azure Machine Learning currently maxes out at around 10Gb.

ScaleR’s integration with Hadoop currently runs through MapReduce; competing best-in-class Hadoop analytics (such as Spark, H2O, Skytree and SAS) run in memory for better performance.  Microsoft’s deep pockets give Revolution the means to make this product competitive.