Notes on a Watson FAIL

A little over a year ago, on February 17, 2017, the Houston Chronicle reported that the University of Texas’ MD Anderson Cancer Center had halted an AI project for cancer diagnostics. The story revealed that MD Anderson spent $62 million over four years to build a system called the Oncology Expert Advisor (OEA), based on IBM Watson. As envisioned by its champions, OEA would help community oncologists provide quality care to patients unable to seek treatment directly from MD Anderson physicians.

A cascade of stories about the failed project ensued: in ForbesThe Wall Street JournalMIT Technology ReviewMedscapeThe Cancer LetterHealth News ReviewHealth IT and CIO ReviewArsTechnica, and many others. Four themes emerged from the reporting:

(1) OEA was a poster child for bad project management.

An audit report published in November 2016 by the University of Texas Audit Office identified numerous exceptions to standard project management practice. According to the Audit Office, project leadership:

  • Did not use proper contracting and procurement procedures
  • Failed to follow IT Governance processes for project approval
  • Did not effectively monitor vendor contract delivery
  • Overspent pledged donor funds by $12 million

Other than that, the project was well-managed. 🙂

In a response to the audit report, project leader Lynda Chin argued that she was not required to follow IT Governance policies because the effort was a “research” project. This strikes me as a silly and self-serving argument. If you’re spending $62 million on a project intended for clinical use, you need to practice good project management. Calling the project “research” does not absolve you of that responsibility.

(2) Scope changes inflated project costs.

MD Anderson signed agreements with IBM and PwC in June and July 2012, respectively. Under these initial agreements, the scope of OEA included lower risk myelodysplastic syndrome (MDS) leukemia patients. The system would digest a broad range of whole exome, tissue, and other clinical data, produce new insight, and deliver physician decision-support services. The budget: just under $5 million.

Beginning in early 2013, MD Anderson radically expanded the scope of the project:

— Diseases added to OEA’s diagnostic capabilities: $23 million.

— Onboarding two partners to pilot the system: $29 million.

— Additional data sources: $5 million. 

Over the four-year life of the project, MD Anderson executed 7 agreements and 8 amendments with IBM and PwC. The auditors note that the contract value for many of these agreements was just below the threshold for Board approval, which suggests deliberate structuring to avoid scrutiny.

Interestingly, the massive expansion in project scope coincides with a $50 million pledge from “billionaire party boy” Low Taek Jho.

(3) OEA is not integrated with MD Anderson’s electronic health records (EHR) system.

IBM and its partners integrated the system with data from ClinicStation, the EHR system MD Anderson used previously. However, MD Anderson now uses Epic Systems for EHR; without live updates, OEA is unavailable for clinical use.

So, for $62 million, IBM and its partners built a custom demo.

(4) MD Anderson could not sell the system to partner hospitals.

MD Anderson planned from the beginning to use OEA as a way to provide high-quality cancer diagnostics to patients unable to seek treatment with MD Anderson physicians. Hence, business success of the project depended on MD Anderson’s ability to forge agreements with healthcare partners who would use the system. It was unable to do so.

Project leader Lynda Chin told the audit team that several factors prevented piloting with external partners, including “time needed for compliance and information security reviews of the cloud-based data repository,” and “lack of engagement or interest by network partners.” That’s two very different reasons. The former implies bad project estimating, poor delivery, or both; the latter implies an inability to sell the system.

Reports and analysis in the press raise as many questions as they answer.

Q. Why can’t Watson connect to MD Anderson’s EHR system?

Epic Systems is the leading EHR provider. A tool for medical diagnostics that cannot integrate with Epic is a like a tool for optimizing logistics that cannot integrate with SAP.

MD Anderson began the search for a new EHR system in late 2012 and announced that it had selected Epic in early 2013. IBM and its partners knew that OEA would have to integrate with Epic before it could go into production. Moreover, they knew this very early in the OEA development cycle.

IBM announced a partnership with Epic in 2015. Interestingly, MD Anderson is not among the 14 collaborating cancer centers.

Integrating Watson with Epic Systems, MD Anderson’s current EHR system, may be easy or it may be hard. It does not matter. It was a necessary step for OEA to go into production. IBM and PwC knew this.

Yet, they kept building on. Like an AWS commercial, but without brains.

Q. Why did MD Anderson contract the project piecemeal?

MD Anderson knew from the beginning that this project would cost a lot more to deliver than the $16 million budget approved by the Board in early 2013. Otherwise, why solicit a restricted gift of $50 million? Or are we expected to believe that Jho Low just happened to come up with that number by chance?

“Thank you for your interest, Mr. Jho. Our budget for OEA is $16 million.”

“Great! Here’s a check for $50 million.”

Moreover, MD Anderson also knew from the beginning that piloting OEA with partners was critical to success. So why wasn’t this task built into the original project plan? Expanding scope to cover this task nearly doubled the project budget.

There can be good reasons to contract a project in phases. It may be difficult to accurately estimate the cost of later phases before early phases are complete. Contracting serially keeps vendors “honest” and introduces the potential for competition in later phases.

Of course, MD Anderson did not keep IBM and PwC “honest.” No vendor other than IBM and PwC performed work on this project. The cancer center awarded $51.4 million in contract fees to the two vendors under non-competitive procurement. Moreover, per the audit report, it appears that MD Anderson paid IBM and PwC for work they did not do.

Q. Why couldn’t MD Anderson secure partners for the project?

IBM wants us to believe that Watson worked well and that OEA would be in use today if MD Anderson chose to continue the project. If that’s true, why couldn’t MD Anderson interest partners in piloting the system?

OEA may be the greatest breakthrough in medicine since the discovery of penicillin. There’s only one problem: nobody wants it.

Hello, sir, I just sunk a pile of money into this gold-plated veeblefetzer. Would you like to buy one?

IBM claims that OEA agrees with experts 90% of the time. That sounds impressive, but isn’t; for all we know, “community oncologists” perform as well or better.

For more than 90% of Super Bowl LII, the Eagles didn’t sack Tom Brady.

That 10% kills you every time.

A smart organization would gauge the market for partners before sinking money into OEA. Instead, MD Anderson built it, Field of Dreams style, and hoped that partners would come.

Here are a few closing thoughts and observations.

One failed project says little about a technology, product, or company.

Case in point: plenty of ERP projects went sidewise, sometimes with dire results. A botched ERP go-live in 1999 prevented Hershey from shipping $100 million in orders for inventory it had on hand. Despite this, firms continue to invest in ERP, for good reasons.

One failed project does not mean that Watson has no value, nor does it mean that IBM cannot successfully deliver solutions based on Watson. However, it highlights that Watson projects are high-risk IT projects. Customers must exercise good vendor and project management.

Much of the blame for this FAIL rests with MD Anderson.

OEA is a lock for the Pantheon of Bad Project Management. MD Anderson failed to practice competent vendor, contract, and project management. One can hardly blame IBM and PwC for feasting at the trough.

That said, are vendors responsible for customers’ bad project governance? The answer is an emphatic “yes” — as a matter of ethics, and as good business practice. Ethical vendors do not accept contracts that violate a customer’s procurement policies. They also do not initiate or continue projects that they know will fail to deliver the promised solution.

Don’t kid yourself. IBM and PwC knew their contracts violated MD Anderson’s procurement policies. Both vendors embed themselves deeply in organizations; they often know the customer’s policies better than the executives they serve.

They also knew that the project was a train wreck. They couldn’t possibly have not known.

Big expensive AI projects, like any other project, require a sound business case.

Do we still need to bang this drum? Apparently so. MD Anderson, it seems, thought it was smart to build OEA first and figure out a business case later. Oops.

Successful AI projects require solid data architecture.

AI without live data is worthless. Build your data platform first. That is all.

IBM’s claims about the project were <ahem> “aspirational.”

In early 2013, IBM announced in a press release that  MD Anderson “is using the IBM Watson cognitive computing system for its mission to eradicate cancer.”

It all depends on what the meaning of the word ‘is’ is.

Later that year, IBM planted a story in Scientific Americanreporting that “M. D. Anderson Cancer Center is using Watson to help doctors match patients with clinical trials, observe and fine-tune treatment plans, and assess risks.”

There’s that pesky word “is” again.

In October 2014, IBM Watson Health CTO Rob High wrote that “Doctors at the MD Anderson Cancer Center in Houston are using Watson to drive a software tool called the Oncology Expert Advisor, which serves as both a live reference manual and a virtual expert advisor for practicing clinicians.”

IBM continued to speak “aspirationally” about the project after it was stone cold dead.

In September 2016, IBM ended work on OEA and declared it “not ready for human investigational or clinical use, and its use in the treatment of patients is prohibited.” Two months later, IBM Watson Health’s Chief Health Officer Kyu Rhee touted Watson Health’s “collaboration with the world-leading MD Anderson Cancer Center in Houston, Texas. This project involves the rapid analysis of genomic information from cancer cells to provide personalized treatment for individuals.”

I guess Dr. Rhee didn’t get the memo.

The next time you hear IBM tout Watson, you may wonder if those claims, too, are “aspirational.”

Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:


One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

Source: DB-Engines, January 2017

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

Source: DB-Engines, January 2017

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

Source: Open Hub

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.


Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.


It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.


Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.


Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.


In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.


Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.


In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.


DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.


The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.


DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs. develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.


In 2016, added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.


KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.


RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.


The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.


Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).


The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

The Year in Machine Learning (Part Three)

This is the third installment in a four-part review of 2016 in machine learning and deep learning. In Part One, I covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

In this installment, we will review the machine learning and deep learning initiatives of Big Tech Brands — industry leaders with big budgets for software development and marketing. Big Tech Brands fall into three groups:

— SAS is the software revenue leader in predictive analytics. It has a unique business model and falls into its own category.

— Companies such as IBM, Microsoft, Oracle, SAP, and Teradata have all have strong franchises in the data warehousing market, and all except Teradata offer widely used business intelligence software. These companies have the financial strength to develop, market and cross-sell machine learning software to their existing customer base, and can impact the market if they choose to do so.

Dell and HPE dabbled in advanced analytics and exited the market in 2016.

I covered Google and Amazon Web Services in Part One. Although neither company has a strong position in business analytics at present, they are making moves in that direction. Google set up Google Cloud Machine Learning as a distinct product group this year to service that market, and Amazon introduced QuickSight, a business analytics service.

Regular readers know that I favor open source software — as do most data scientists. Among the companies covered in this installment, IBM and Microsoft are making substantial commitments to the open source model, including direct contributions to open source software projects. They deserve kudos for that. Teradata is investing in Presto SQL, for which they get polite applause. Oracle and SAP leverage open source software in their solutions but make no significant contributions. SAS embraces open source the way a cat embraces a porcupine.

In Part Four, I will survey machine learning startups, and deliver results from the Bottom Story of the Year poll.


SAS leads the market in licensing revenue for advanced and predictive analytics software, according to IDC. The company has a loyal following among statisticians, actuaries, life scientists and others whose work depends on statistical analysis.

Partnering with IBM, SAS built its business in the 1970s on the strength of its software for the IBM System/360 mainframe. IBM promoted the software to its enterprise customers to increase adoption and use of its hardware. SAS software still runs on the mainframe, and the company continues to earn a significant share of its revenue on that platform. IBM has mainframe customers who use the big box exclusively for SAS.

In the 1990s, SAS successfully transitioned to a multi-vendor architecture and rebuilt its software to run on many different hardware platforms and operating systems. During this period, SAS established a reputation for industrial-strength and enterprise-grade software — in contrast to vendors like SPSS, who focused on building easy-to-use software for the desktop.

On the face of it, SAS has struggled to transition from server-based computing to the contemporary world of distributed architecture and cloud platforms. In the past ten years, the company has announced multiple initiatives to improve the performance and scalability of its products, with mixed success. In April, SAS announced Viya, its third attempt to deliver advanced analytics in a distributed MPP architecture.

What is SAS Viya? How does it differ from SAS’ previous attempts at high-performance design? Let’s peruse the brochure:

Cloud-ready, elastic and scalable


SAS Viya is built to be elastic and scalable for both private and public clouds. Analytical, in-memory computations are optimized for unconstrained environments, but they can also adjust for constrained environments. The elastic processing automatically adapts to needs and available resources – spinning up or winding down computing capacity as needed. Elastic scalability lets you quickly experiment with different scenarios and apply more complex approaches to larger amounts of streaming data.

Ahem. Any software is “cloud-ready,” in the sense that a Linux instance is a Linux instance whether it runs on-premises or in the cloud. And any software is elastic when you deploy it in a virtual appliance, such as an Amazon Machine Image. That includes SAS 9.4, which SAS touted as “cloud-ready” in 2014, and previous versions of SAS, which you could deploy in AWS even though SAS did not formally support the platform.

If you want to spin up software instances, however, you need software licenses. With open source software, such as Python, R, or Spark, that’s not an issue — you can spin up as many instances as you like without violating license agreements. Commercial software is more complicated since you need to pay for the licenses you want to spin up. Some vendors, like HPE and Teradata, tried to address this problem by marketing their own cloud platforms to compete with Amazon Web Services; they failed miserably. Others, like Oracle, partner with AWS to deliver their software in the cloud — either as a bundled managed service or on a “Bring Your Own License” (BYOL) model.

You can’t have elastic computing with commercial software without a flexible licensing model. Pay-for-what-you-use licensing poses a problem for vendors like SAS, because if customers only pay for what they use, they invariably pay a lot less than they do under term licensing. Most commercial software customers are over-licensed — they’re paying for a lot of software they don’t use. That is why revenue from on-premises software licensing is declining much faster than revenue from cloud-based subscriptions is rising. In the cloud, you can do more with less.

The bottom line is this: unless Viya is available under an elastic pricing model, nobody cares that it is “cloud-ready, elastic and scalable.”

If you want to have a little fun, the next time your SAS rep touts Viya’s elasticity, ask him what it will cost per hour to license the software. Watch him squirm.

Open analytics coding environment


Empower your data scientists with SAS Analytics that are easily available from a variety of programming languages. Whether it’s a Python notebook, Java client, Lua scripting interface or SAS, your modelers and data scientists can easily access the power of SAS for data manipulation, advanced analytics and analytical reporting.

We’ve all been waiting for the ability to run SAS from Lua.

Resilient architecture with guaranteed failover


For answers you depend on, you need analytical processing power you can count on. You need all your analytical computations to finish processing without interruption. The fault-tolerant design of SAS Viya automatically detects server failure, even in multiplatform processing environments, and redistributes processing as needed. It also manages several copies of data on the processing cluster. If a machine in the cluster becomes unavailable or fails, the required data is retrieved from another block to quickly continue processing. These self-healing mechanisms ensure high availability for uninterrupted processing and automated recovery.

“It runs on Hadoop.”

Interviewed in Forbes, SAS CEO Jim Goodnight speaks at length about Viya:

We are ready for big data…(we) just released our first version of our new Viya architecture, which is massively parallel computing where we spread the data out over dozens of servers and then use all the cores inside those servers to process the data in parallel. So we might have 500 cores working on the data all at once in parallel, and that allows it to handle some really, really big problems that we’ve never even thought of before. Things like logistic regression.

Someone should feed Dr. G. better talking points. Just for the record, commercially available software for logistic regression running in a massively parallel (MPP) environment first hit the market in 1989. Distributed logistic regression is currently available in multiple software packages, including one introduced by SAS five years ago.

Logistic regression (a non-linear model) is an iterative process. Essentially, you’re trying to estimate the parameters in the model, and so you take a guess, you’ve got to run through the data using that guess, then to refine it and do another guess and run through the data again, and you keep doing this over and over and over until the parameters converged or they don’t change much at all anymore. That can take 25 to 30 passes of the data. Now, in the old days, we used to have to read the data that many times. Now, it’s in memory. We put it in memory and it stays in memory. It’s spread out over 500 cores and then each one just does a little piece of the work, and so we can do those 25 iterations in just a few minutes, whereas it used to take hours.

It’s just like Spark, but with a license key.

(Viya’s) really our third generation of massively parallel computing. We’ve been working on this problem for seven years, and this is our third major crack at doing it, and this time we’ve got everything figured out.

In 2018 he’ll be talking about a fourth crack in nine years.

It’s possible that Viya works better than SAS’ previous cracks at high-performance analytics. That is a weak hurdle, however; SAS needs to demonstrate that its high-cost proprietary distributed framework is better than Apache Spark, which is rapidly emerging as the standard enterprise platform for Big Data.

While SAS supports machine learning techniques in several different products, it lags in deep learning. The SAS Marketing team created some helpful content about deep learning, but look carefully at that page — you won’t find an actual product for deep learning. Yes, I know that SAS Enterprise Miner supports multilayer perceptrons; but SAS does not support GPUs, Xeon Phi, Intel Nervana or any other high-performance architecture that will make it possible for you to train a deep neural net while you’re young.

If you think that an eighteen-year-old product running on one server is sufficient for your deep learning project, you should definitely talk to SAS. Keep in mind, though, that there is a reason that NVIDIA’s DGX-1 GPU-accelerated deep learning box has the power of 250 conventional servers: you actually need that kind of horsepower.

The rest of SAS’ business seems to be chugging along well enough. A combination of renewals, upgrades and upsells in existing accounts should produce low single-digit revenue growth for 2016, which is not a bad track record when you consider the declines reported by IBM, Oracle, and Teradata.

Business Analytics Leaders

The five companies in this group sell at least a billion dollars a year in business analytics software, according to IDC’s most recent worldwide software market share report. However, most of their revenue comes from data warehousing and business intelligence software; they all trail SAS in predictive analytics revenue.

Software licensing revenue is a misleading measure, however, due to the growing presence of open source software. IBM, Microsoft, and Oracle for example, actively use open source machine learning software to extend the reach of their data warehousing and business intelligence platforms, where they both have strong entries. IBM uses Spark as a foundation for many of its products; Microsoft has integrated R with SQL Server and PowerBI, and actively promotes the use of R for its enterprise customers. Oracle has taken a similar approach.


Unlike SAS, declining tech giant IBM never invested in a proprietary distributed framework for SPSS, its flagship software for advanced analytics. Instead, the company chose to leverage in-database engines (DB2, Netezza, and Oracle) and open source frameworks (MapReduce and Spark.)

IBM contributes to Apache Spark, which it uses in several products, and also to Apache SystemML. IBM Research developed the core of SystemML, which IBM donated to Apache in 2015. IBM has also visibly contributed to the Spark community through its efforts in education and training.

In 2016, IBM continued to market SPSS Statistics and SPSS Modeler, software brands it acquired in 2007. Release 18 of SPSS Modeler, announced in March, includes such things as support for machine learning in DB2 and support for IBM’s General Parallel File System (GPFS) in BigInsights. There aren’t too many data scientists who care about such things, but they appeal to the 150 or so enterprises with CIOs who still believe that nobody ever got fired for buying IBM.

In Part One of this review, I covered IBM’s machine learning moves in IBM Cloud, which I would characterize as Shakespearean, as in Much Ado About Nothing.


Microsoft had quite a year in machine learning and deep learning. As I noted in Parts One and Two, in 2016 MSFT launched cognitive APIs in Azure for vision, speech, language, knowledge, and search; a managed service for Spark in Azure HDInsight; enhancements to Azure Machine Learning and Version 2.0 of its deep learning framework, rebranded as Microsoft Cognitive Toolkit.

That’s just for starters.

In January, Microsoft announced Microsoft R Server, a rebranding of the product it acquired with Revolution Analytics in 2015. Microsoft R Server includes an enhanced R distribution, a scalable back-end, and integration tools. During the year, Microsoft two major releases for R Server. In Release 8, the company added push-down integration with Spark. Release 9 updated the Spark integration for Spark 2.0, and added MicrosoftML, a new R package for machine learning.

Microsoft announced SQL Server 2016 in March with embedded SQL Server R Services. On the Revolutions blog, David Smith reports on the launch. Tomaž Kaštrun explains what you can do with R services in SQL Server.

In November, after an extended preview, Microsoft announced the general availability of R Server for Azure HDInsight, a scale-out implementation of R integrated with Spark clusters created from HDInsight.

Also in Azure, Microsoft added a Linux version of the Data Science Virtual Machine (DSVM). Previously available as a Windows instance, DSVM includes Revolution R Open, Anaconda, Visual Studio Community Edition, PowerBI Desktop, SQL Server Express and the Azure SDK.

PowerBI, Microsoft’s powerful visualization tool, added R support in August. In ComputerWorld, Sharon Machlis, an R user, enthused. More here, on the Revolutions blog.

R Tools for Visual Studio launched to public preview in March, and to general availability in September. Also in September, Microsoft released the Microsoft R Client, a free data science tool that works with Microsoft R Open and the ScaleR distributed back end.

Microsoft data scientists Gopi Krishna Kumar, Hang Zhang and Jacob Spoelstra developed a methodology for data science, which they presented at the Microsoft Machine Learning and Data Science Summit 2016 in September. David Smith reports. The method, which the authors call Team Data Science Process, includes a standard directory structure for managing project artifacts using a system such as Git. It also includes open source utilities to support the process.

Other than that, it was a quiet year in Redmond.


Oracle has a surprisingly robust set of machine learning tools that appeal to Oracle-centric organizations. They include:

Oracle Data Mining (ODM), a suite of machine learning algorithms that run as native SQL functions in Oracle Database.

Oracle Data Miner, a client application for ODM with a business user interface.

Oracle R Distribution (ORD), an enhanced free R distribution.

Oracle R Enterprise (ORE), Oracle R Distribution packaged with tools to integrate R with Oracle Database.

Oracle R Advanced Analytics for Hadoop (ORAAH), a set of R bindings with native algorithms and an interface to Spark.

Oracle claims that ORAAH’s native algorithms are faster than Spark, but ORAAH has only two algorithms, so nobody cares. Oracle OEMs Cloudera, so the Spark release is at least one major release behind the rest of the world.

Other than some dot releases for the components cited above, I don’t see a lot of movement for Oracle in 2016.


SAP introduced an update to its predictive analytics capabilities, now branded as SAP Business Objects Predictive Analytics 3.0. This product includes two separate automation capabilities, one branded as Predictive Factory, the second as HANA Automated Predictive Library. Predictive Factory, like SAS Factory Miner, is a scripting tool that enables a data scientist to create a modeling pipeline and schedules it for execution; it does not automate the data science process itself.  HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts.

HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts. It’s a product that might appeal to SAP HANA bigots and nobody else.

SAP acquired KXEN and its InfiniteInsight software in 2014. Customer satisfaction promptly dropped through the floor, and SAP trails all other advanced analytics vendors rated in a Gartner survey. Legacy InfiniteInsight customers fall into two camps: (a) those whose IT organizations are heavily invested in SAP, and (b) everyone else. The former seem to be sticking with the software as SAP integrates it into its product line; the latter are heading for the exits.


Declining data warehouse vendor Teradata thinks of itself as an analytics powerhouse. In reality, most of its revenue comes from data warehousing, where the company gets high marks from analysts like Gartner.

You could say that Teradata has a commanding position at the bottom of the analytics stack.

Teradata’s executive leadership — if you can call it that — completely missed the implications of Hadoop and cloud computing. Instead, they bet that the Teradata brand was beloved by IT executives, who would keep on buying boxes in bulk. As a result of that blinkered view of the world, the company today is worth a third of what it was worth five years ago. Its product sales have declined for ten straight quarters, seven in a row at double digits.

After a dismal first quarter, Teradata’s board fired accepted the resignation of CEO Mike Koehler; longtime board member Victor Lund stepped into the breach. In September, at the Teradata Partners conference, Lund announced that Teradata would reposition itself as an “analytics solutions” firm.

That may not sit well with SAS, Teradata’s primary partner for advanced analytics software, which also views itself as an “analytic solutions” firm. The difference, of course, is that SAS has been delivering solutions for a long time and has street cred with executives because it actually has sophisticated business solutions, with actual software and intellectual property, while Teradata appears to have little more than big ideas and PowerPoint.

Pro tip for Teradata management: just because you want to move up the value chain does not mean that you have the ability to do so.

In other developments, the company announced that Aster finally supports Spark, two years after anyone might have cared. Teradata also announced that Aster’s analytics are now available for deployment in Hadoop. Aster on Hadoop is a bladeless knife without a handle — a commercial machine learning library that competes with umpteen open source libraries. Aster also competes with another Teradata partner, Fuzzy Logix, whose dbLytix library is six times richer and more mature.

If someone proposes to bet that “solutions” and unbundled Aster will reverse Teradata’s decline, take the under.

Other Tech Giants

We mention two remaining giants, Dell and HPE, only to note their passing from the scene.


HPE announced the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also granted equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets. The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for Haven or Vertica, this may be a good time to dust off your resume.

In March, HPE announced Haven OnDemand, available on Microsoft Azure. Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, initially branded as HAVEn and announced by HP in June 2013.  In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So the March announcement is a re-re-release of the software.

Three years into its product life cycle, Haven hasn’t exactly caught on with data scientists. Just 2 out of 2,895 respondents to the KDnuggets 2016 Data Science Software Usage poll and none in the O’Reilly 2016 Data Science Salary Survey said they use the software. Adding insult to injury, Haven failed to make KDnuggets’ list of the top 50 machine learning APIs, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

Vertica still has some traction with data lovers whose analysis needs are simple enough to satisfy with SQL. Currently, it’s the 28th most popular relational database, according to DB-Engines, which is about on par with Netezza and Greenplum and a lot better than Aster. Expect this ranking to drop like a stone in the hands of Micro Focus.


Dell entered the advanced analytics business by acquiring Statsoft in 2014, a move that impressed nobody. In 2016, Dell exited by selling its software division to private equity investors.

Goodbye, Dell. We hardly knew ye.

How to Steal a Predictive Model

In the Proceedings of the 25th USENIX Security Symposium, Florian Tramer et. al. describe how to “steal” machine learning models via Prediction APIs. This finding won’t surprise anyone in the business, but Andy Greenberg at Wired and Thomas Claburn at The Register express their amazement.

Here’s how you “steal” a model:

— The prediction API tells you what variables the model uses; the packaging for a prediction API will say something like “submit X1 and X2, we will return a prediction for Y”; so you know that X1 and X2 are the variables in the model. The developer can try to fool you by directing you to submit a hundred variables even though it only needs two, but that’s not likely; most developers make the prediction API as parsimonious as possible.

— Use an experimental design to create test records with a range of values for each variable in the model. You won’t need many records; the number depends on the number of variables in the model and the degree of granularity you want.

— Now, ping the API with each test record and collect the results.

— With the data you just collected, you can estimate a model that approximates the model behind the prediction API.

The authors of the USENIX paper tested this approach with BigML and Amazon Machine Learning, succeeding in both cases. BigML objects; Amazon sleeps.

Legally, it may not be stealing. Model coefficients are intellectual property. If someone hacks into your model repository and steals the model file, or bribes one of your data scientists into providing the coefficients, that is theft. But while IP owners can assert a right over their actual code, it is much harder to assert a right to an application’s observable behavior. Reverse-engineering is legal in the U.S. and the European Union so long as the party that performs the work has legal possession of the relevant artifacts. If someone lawfully purchases predictions from your prediction API, they can reverse-engineer your model.

Restrictive licenses offer limited protection. Intellectual property owners can assert a claim against reverse-engineering if the predictions are under an end-user license that prohibits the practice. The fine print will please your Legal department, but is virtually impossible to enforce. Predictions, unlike other forms of intellectual property, aren’t watermarked; they’re just numbers.

Pricing plays a role. While it may be technically feasible to reverse-engineer a predictive model, it may be prohibitively expensive to do so. Models that predict behavior with financial implications, such as consumer credit risk models, are expensive. Arguably, the best way to prevent reverse-engineering is to charge a non-cancellable annual subscription fee for access to the API rather than selling predictions by the record. In any event, the risk of reverse-engineering should be a consideration in pricing.

Encryption may be necessary. If you want to do business with trusted parties over an open API, a hashing algorithm can scramble the prediction in a way that makes reverse-engineering impossible. Of course, the customer must be able to decrypt the prediction at their end of the transaction, with a key transmitted separately or from a common random seed.

Access control is key. The key point of the USENIX authors is that if your prediction API is available “in the wild,” you might as well call it an open source model because reverse-engineering is easy to do. Of course, if you are in the business of selling predictions, you already have some form of access control so you can meter usage and charge an account. Bad actors, however, have credit cards; so, if you are concerned about your predictive model’s IP, you’re going to have to establish tighter control over access to the prediction API.

Databricks Releases Spark Survey

In a press release and blog post, Databricks announces results from its 2016 Spark Survey. Databricks surveyed 1,615 Spark users and prospective users in July, 2016 Respondents include data engineers, data scientists, architects, technical managers, and academics.

Key findings from the survey:

  • Spark SQL remains the most widely used component.
    • 88% use Spark SQL
    • 71% use Spark Streaming
    • 71% use MLlib (machine learning)
  • Respondents value Spark’s performance and advanced analytics.
    • 91% rate performance very important
    • 82% rate advanced analytics very important
    • 76% rate ease of programming very important
    • 69% rate ease of deployment very important
    • 51% rate real-time streaming very important
  • Production use has increased markedly since 2015.
    • 40% use SQL in production, up from 24%
    • 38% use DataFrames in production, up from 15%
    • 22% use streaming in production, up from 14%
    • 18% use machine learning, up from 13%
  • So has usage in the public cloud.
    • 61% said they use Spark in the public cloud, up from 51% in 2015.
  • Usage of Spark deployed on-premises has declined.
    • 42% use Spark in a standalone deployment, down from 48%
    • 36% use Spark under YARN, down from 40%
    • 7% use Spark on Apache Mesos, down from 11%
  • The Scala API remains the most popular, followed closely by the Python API.
    • 65% use Scala, down from 71% in 2015
    • 62% use Python, up from 58%
    • 44% use SQL, up from 36%
    • 29% use Java, down from 31%
    • 20% use R, up from 18%
  • While Linux remains the most popular OS, Mac and Windows usage is growing rapidly.
    • 74% use Linux/Unix, down from 75% in 2015
    • 32% use Windows, up from 23%
    • 22% use Mac OSX, up from 14%

The report also includes statistics about the Spark community at large.

— Databricks reports growth in the contributor base from 600 in 2015 to 1,000 in 2016, a figure that does not seem to square with the statistics reported in OpenHub.

— Spark Meetup membership grew from 66,000 in 2015 to 225,000 in 2016.

— Spark Summit attendance grew from 3,912 to 5,100.

For a copy of the report and an infographic, go here.

Spark 2.0 Released

The Apache Spark team announces the production release of Spark 2.0.0.  Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.

Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.

The Spark team guarantees API stability for all production releases in the Spark 2.X line.


Spark Summit: Matei Zaharia summarizes highlights of the release. Slides here.

— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.

— Reynold Xin explains technical details of Spark 2.0.

SQL Processing

Key Changes

New and updated APIs:

  • In Scala and Java, the DataFrame and DataSet APIs are unified.
  • In Python and R, DataFrame is the main programming interface (due to lack of type safety).
  • For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
  • Enhancements to the Accumulator and Aggregator APIs.

Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:

  • Native SQL parser supports ANSI SQL and HiveQL.
  • Native DDL command implementations.
  • Subquery support.
  • View canonicalization support.

Additional new features:

  • Native CSV support
  • Off-heap memory management for caching and runtime.
  • Hive-style bucketing.
  • Approximate summary statistics.

Performance enhancements:

  • Speedups of 2X-10X for common SQL and DataFrame operators.
  • Improved performance with Parquet and ORC.
  • Improvements to Catalyst query optimizer for common workloads.
  • Improved performance for window functions.
  • Automatic file coalescing for native data sources.


Spark Summit: Andrew Or explains memory management in Spark 2.0+. Slides here.

Spark Summit: Databrick’s Michael Armbrust explains structured analysis in Spark: DataFrames, Datasets, and Streaming. Slides here.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— On KDnuggets, Paige Roberts explains Project Tungsten.

 Sameer Agarwal, Davies Liu, and Reynold Xin dive deeply into Spark 2.0’s second generation Tungsten engine. This paper inspired Tungsten’s design.

Spark Summit: Yin Huai dives deeply into Catalyst, the Spark optimizer. Slides here.

— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.

Spark Summit: AMPLab’s Ankur Dave explains GraphFrames for graph queries in Spark SQL. Slides here.

Spark Streaming

Key Changes

Spark 2.0 includes an experimental release of Structured Streaming.


Spark Summit: Tathagata Das explains Structured Streaming. Slides here.

— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.

— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.

Machine Learning

Key Changes

The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.

ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.

Additional techniques supported vary by API:

  • DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
  • PySpark: LDA, GMM, Generalized linear regression
  • SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.


Spark Summit: Joseph Bradley previews machine learning in Spark 2.0. Slides here.

— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.

— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.


Key Changes

SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.

As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.


Spark Summit: Xiangrui Meng explains the latest developments in SparkR. Slides here.

— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.

— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.

Databricks’ 2016 Spark Survey

Databricks is running a short survey to understand the needs of Apache Spark users. If you haven’t taken the survey yet, do so today.

For results of the 2015 survey, look here. Last year’s survey produced a number of interesting findings; here’s what I wrote back in September when Databricks released its report:


Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here.  The report is an informative mashup of survey findings, plus other information, such as the headcount from Spark Summits.  (Spoiler: it’s increasing.)  On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points.  See additional analysis herehereherehereherehere, here and here.

Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop).  Andrew Oliver, for example, commenting on Cloudera’s One Platform  announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey.  But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.

Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop.  But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.

The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java.  The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.

Big Analytics Roundup (June 13, 2016)

Spark Summit 2016 met last week in SFO. There were many cool things; I will publish a separate report when presentations and videos are available.

KDnuggets releases results of its annual poll on data science software. Key findings:

  • Python use is up 51%, almost catches up to R, the #1 choice.
  • Excel and Tableau usage are up 47% and 49%, respectively.
  • Spark usage is up 91%, overtakes Hadoop.
  • SAS is down big time, drops from the top ten.

Meanwhile, Alex Woodie wraps statistics on Spark adoption, and Qubole’s Ari Amster reports on Spark usage among Qubole users.

Tim Spann recaps the week in Hadoop.

Spark Summit: Roundup of Roundups

— On the Databricks blog, Wayne Chan, Dave Wang, Jules Damji and Denny Lee recap highlights from the Summit.

— Jessica Davis rounds up the highlights.

— Jack Vaughan surrounds the story, quotes some old guy.

— Sam Dean summarizes what you need to know.

— Alex Handy collects the key bits.

— Andrew Brust separately corrals Day One and Day Two.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Spark Summit Europe, Brussels, October 25-27 (closing date TBA)

Top Read

Adrian Colyer summarizes a paper on identifying architectural debt in software.


— Deenar Torasker explains the new capabilities of HDFS.

— Ron Bodkin explains key considerations when designing continuous apps, in the second of a three part series. Part one is here.

— On his eponymous blog, Jesse Steinweg-Woods explains Gradient Boosted Trees with XGBoost in Python.

— Adam Warski explains how Kafka Streams fits into the stream processing landscape.


—’s Vinod Iyengar objects to what he calls the fragmentation of Spark support, correctly noting that Cloudera and Hortonworks support different versions of Spark in their distributions. Of course, nobody is obligated to use Spark with Cloudera and Hortonworks.

— From the Spark Summit on YouTube: Ben Lorica leads a panel discussion of incredibly smart and distinguished people, plus some old guy.

— Altiscale’s Barbara Lewis presents ten use cases for Big Data.

— Tim Wallis believes that AI will relieve boredom.

— Sam Dean touts Grappa, Drill and Kafka as successors to Spark. Grappa is going nowhere. Drill is great if all you want to do is SQL, and Kafka is great if all you want to do is streaming. Pro tip: there are no real-world analytic applications where all you want to do is streaming.

— Allen Downey opines that statistical tests are inflexible and opaque. Funny, my college roommate said the same thing when he flunked his Stat 101 mid-term.

Open Source Announcements

— LinkedIn announces release of PhotonML, a machine learning library for Spark. Feature detail here.

— Google releases TensorFlow 0.9.0, with iOS support. Speculation about deep learning on your phone ensues.

— Twitter donates DistributedLog to Apache.

Commercial Announcements

— Databricks announces general availability for the Databricks Community Edition, and completion of the first phase of Databricks Enterprise Security framework.

— Microsoft announces general availability for its managed Spark service in HDInsight, and summer availability for the Spark pushdown capability in R Server. The company also announced PowerBI support for Spark Streaming, which is confusing for those who thought PowerBI already supported Spark Streaming.

— IBM announces limited preview of a managed service branded as the Data Science Experience. IBM is coy about the details; the service definitely includes Spark, Jupyter and RStudio, H2O and “curated data sets”, and may include other bits. The service itself looks promising, but IBM’s claim to offer the “first development environment for Apache Spark” is BS.

— In an oddly opaque press release, H2O announces that it is “working with” IBM. H2O is open source software, and IBM requires no permission from for use or distribution; presumably, H2O will offer support contracts to users. did not respond to request for comment.

— Splice Machine announces plans to go open source; a company insider says they plan to donate the software to Apache. Dave Ramel reports.