Note to Readers

Many thanks to all who follow this blog. I plan to repurpose this site as a purely personal blog. All of my writing about the machine learning and AI business is now on LinkedIn, where I encourage you to connect.

If you don’t want to read about my garden and my late son, feel free to unfollow the blog.

Most of the old AI/ML posts will remain up for a while.

Thanks again for reading.

— Thomas

Is AI Failing?

Nobody believes that every AI project succeeds. Just ask MD Anderson. Anderson blew $60 million on a Watson project before pulling the plug.

That project was a clown show. A report published by University of Texas auditors found that project leadership:

  • Did not use proper contracting and procurement procedures
  • Failed to follow IT Governance processes for project approval
  • Did not effectively monitor vendor contract delivery
  • Overspent pledged donor funds by $12 million

IT personnel working on the project hesitated to report exceptions because the project leader’s husband was MD Anderson’s President. Project scope grew like kudzu. MD Anderson executed 15 contracts and amendments in a series of incremental expansions. The budget for many of these was just below the threshold for Board approval, which suggests deliberate structuring to avoid scrutiny.

Interestingly, the massive expansion in project scope coincided with a $50 million pledge from “billionaire party boy” Low Taek Jho. (Jho recently cut a deal with the US government to avoid prosecution on charges related to the 1MDB scandal.)

So it’s not news that some AI projects fail. 

Last week, Fast Company published this piece with the clickbait title of Why AI is Failing Business. The authors, an economist and the two co-founders of a tiny startup, want you to believe that failure is the norm for AI projects. 

The article exemplifies a genre I call Everyone is Stupid Except Us. Practitioners of this approach paint a dire picture of current practices. The implicit message is that they have a magic bean that will set things straight. 

Citing an IDC report, the authors write that “most organizations reported failures among their AI projects, with a quarter of them reporting up to a 50% failure rate.”

Wow. Fifty fucking percent.

That number sounds fishy, so I pulled the report and checked with the author. Here’s the pertinent page:

The first part of the authors’ claim is correct. About 92% of the organizations surveyed by IDC reported one or more AI project failures.

The rest is misconstrued. About 2% of respondents reported failure rates as high as 50%. 21% reported a failure rate of more than 30%.

Most respondents report a failure rate below 30%.

In an ideal world, no AI project would fail. But put that failure rate in context. According to a report from the Project Management Institute, only about 70% of all projects completed in 2017 met original goals and business intent.

In other words, AI projects are no more or less likely to fail than any other IT project.

The authors of the Fast Company piece bloviate for another 11 paragraphs about why AI projects fail. They could have just shifted their eyeballs to the right on the page they misquote, where IDC tabulates the reasons for AI project failure. The top five cited by respondents are, in descending order:

  1. AI technology didn’t perform as expected or as promised
  2. Lacked staff with the necessary expertise
  3. Unrealistic expectations
  4. The business case wasn’t well enough understood
  5. Lack of follow-up from the business units

That first reason needs unpacking. Projects rarely fail because technology does not do what it is supposed to do. Projects fail because the buyer wants something the technology isn’t designed to deliver, or the organization cuts corners on implementation. In most cases, the customer and vendor share responsibility for that failure. The vendor may make misleading or exaggerated claims, the customer may fail to define requirements, or the customer may not perform the necessary due diligence.

It’s easier to blame the technology, though.

AI projects are the same as ERP projects or any other IT project. They succeed or fail based on the organization’s project management processes.

Next time you’re at a trade show when some AI vendor starts braying about their magic bean, do yourself a favor. Move on to the next booth.

How to Write Good

Break rules. That is the first principle of good writing. Conventional style and predictable prose will bore your audience. There is no greater sin.

You think I don’t know the difference between good and well, and this blog will be a train wreck. Or you think the headline is a joke, and this blog will be fun to read. Either way, you’re reading this blog and not something else. Which proves my point.

In a different medium, Beethoven understood the principle. His Eroica Symphony begins with a simple tune in the key of E-flat. It’s the sort of tune that, in the hands of Beethoven’s contemporaries, such as Bocklet, GänsbacherHüttenbrenner, or Schenk, would remain firmly in the key of E-flat. That’s the rule. You begin in one key, you stay in that key. At least until you prepare a modulation and introduce a new tune in B-flat.

Seven bars in, however, Beethoven breaks the rule. The music veers into…something strange. Definitely not in the key of E-flat:

German musicologists try to explain this gaffe. “It’s an unprepared modulation!” “It’s a chromatic passing tone!”

I’d insert a joke about German musicologists here, but I don’t want to offend my friends at KNIME and RapidMiner.

There is a simpler explanation. Beethoven broke the rules

He broke them deliberately. Imagine the surprised faces when Beethoven premiered the work at the Palais Lobkowitz in 1804. Vienna’s petty aristocrats did not like revolutionary thought, flies in the strudel, or wrong notes in symphonies. They preferred the music of Bocklet, Gänsbacher, Hüttenbrenner, or Schenk, composers who eschewed wrong notes.

You never heard of Bocklet, Gänsbacher, Hüttenbrenner, or Schenk? I rest my case.

By the way, if “Beethoven” evokes nothing other than a large St. Bernard dog, you need to get out more.

Beethoven also demonstrates the second principle of good writing: break rules sparingly.

If you break rules too often, people assume that you don’t know the rules. Or they figure you’re a loon. Most of Beethoven’s work conforms to classical style; his “wrong” notes stand out. Anton von Webern, on the other hand, wrote nothing but wrong notes, which is why you’ve never heard of Anton von Webern.

The third principle: use a &^%$# grammar checker. People send me writing samples, blog posts, press releases, white papers, and so forth. I drop the text into Grammarly and oops. Overused words, passive voice, unclear antecedents, you name it.

This is what happens to those writing samples.

What to do with bad writing.

When a writing sample fails the Grammarly test, it means the author is too lazy to check their work.

It reminds me of the story about the executive who went to a Mercedes dealer to check out a new S560. (Stop me if you’ve heard this before.) The exec admires the car in the showroom and chats with the sales rep.

“Can I take it for a test drive?” she asks.

“Certainly!” says the rep. “Wait here, I’ll bring one around.”

After a few minutes, the rep pulls up out front in a Mercedes S560 that is completely covered with bird shit.

The executive recoils. “This car is filthy!”

The rep shrugs. “It’s just bird shit. Isn’t this a beautiful car?

Don’t expect me to appreciate your ideas if your text is covered with bird shit grammar and style issues.

For the record, Grammarly does not pay me to shill for them. But they should.

The next principle: omit needless words. Yeah, I know. It’s not original. Strunk and White #13:

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that every word tell.

That paragraph is a thing of beauty.

So what? you say. Strunk wrote that a hundred years ago. Don’t you have any new suggestions?

If Strunk and White #13 is obvious, why do we see so much bloated and tumescent business prose?

Think of words as a tax on the reader’s brain. Every syllable needs a neuron. More syllables –> more neurons –> more work for the reader. Your reader has other things to do, like trolling people or binging on Netflix. Tax your reader’s brain too much and you lose them.

You know who really knew how to omit needless words? The Spartans.

According to Herodotus (The Histories, Book 3, Section 46), a delegation from Samos traveled to Sparta to seek food. There, before the magistrates, they delivered a long and passionate plea for help.

The magistrates turned them down flat. “We can no longer remember the first half of your speech, and thus can make nothing of the remainder.”

Regrouping, the Samians secured another audience. This time, they made a shorter speech.

Again, the magistrates dismissed them. “Too many words.”

Desperate, the Samians returned for a final audience. This time, instead of a speech, they held out an empty bag with a sign: this bag wants bread.

The Spartans agreed to help the Samians but admonished them: they could have omitted the words ‘this bag’.

Learn to write like a fucking Spartan.

That bit about the Spartans demonstrates the next principle: tell stories.

If you lack confidence in your story-telling abilities, cheer up: in business, there is only one story. It goes like this.

  1. There is a shining city on the hill, where everything is perfect. Let’s go there.
  2. Oops, there’s a dragon under the bridge who eats people.
  3. Wouldn’t it be great if there was something that could kill dragons?
  4. Fortunately, (product) kills dragons.
  5. Here’s proof.
  6. You can kill dragons and go to the shining city. All you need is (product). Here’s how to learn more.

Forget outlining. No battle plan ever survived the first shot, said Napoleon. I say: no outline ever survived the first sentence. Just keep your story in mind while you write. Works every time.

The penultimate(*) principle: revise, revise, revise, revise. You want to write good? You have to revise and rewrite, rewrite and revise. Until you have something good enough to publish.

I’ve revised this post 25 times. It’s brilliant, right? Greatest blog post ever. But if I look at it again tomorrow, I’ll find something else to revise. It’s like detailing your car. There’s always one more spot that needs a buff.

What’s that? You have no time to revise because you’re on deadline?

Fuck your deadline. There are very few real emergencies in business. In Six-Sigma factories, anyone who spots a defect can stop the assembly line. More often than not, the “deadline” comes from some asshat who wants to juke the monthly eyeballs and needs you to create “content” on the spot. Plan ahead, and you will have time to revise all you want.

Good writing tomorrow is better than shitty writing today. If anyone tells you otherwise, find another platform.

(*) look it up, dummy.

The last principle of good writing: close with a bang. Don’t write like Wagner. Wagner dragged everything out.

You’ve heard the expression: it ain’t over ’til the fat lady sings. It’s not true. In Wagner’s Der Ring Des Nibelungen, the fat lady is Brünnhilde. When she starts singing in Act II of Die Walküre, she doesn’t stop for sixteen hours. Except for a few short breaks, like when Wotan puts her to sleep and surrounds her with a ring of fire.

Even when she’s done singing, it’s not over. There’s another last gasp of Wagnerian mush while the Rhine overflows, Valhalla burns, Hagen tries to grab the Ring, the RhineMaidens stop him, they grab the Ring, Hagen drowns, and Brünnhilde burns to a crisp. You waited ten years and paid a couple grand for the lamest seats in Bayreuth. You’re not going to run for the parking lot as soon as the fat lady stops singing. You’re going to wait to see the whole sorry mess collapse.

Stravinsky, on the other hand, knew how to close.

In the final section of Le Sacre du Printemps, after an ecstatic dance, the Chosen One collapses, dead. The orchestra delivers an enormous splat. Now that is an ending.

Notes on a Watson FAIL

A little over a year ago, on February 17, 2017, the Houston Chronicle reported that the University of Texas’ MD Anderson Cancer Center had halted an AI project for cancer diagnostics. The story revealed that MD Anderson spent $62 million over four years to build a system called the Oncology Expert Advisor (OEA), based on IBM Watson. As envisioned by its champions, OEA would help community oncologists provide quality care to patients unable to seek treatment directly from MD Anderson physicians.

A cascade of stories about the failed project ensued: in ForbesThe Wall Street JournalMIT Technology ReviewMedscapeThe Cancer LetterHealth News ReviewHealth IT and CIO ReviewArsTechnica, and many others. Four themes emerged from the reporting:

(1) OEA was a poster child for bad project management.

An audit report published in November 2016 by the University of Texas Audit Office identified numerous exceptions to standard project management practice. According to the Audit Office, project leadership:

  • Did not use proper contracting and procurement procedures
  • Failed to follow IT Governance processes for project approval
  • Did not effectively monitor vendor contract delivery
  • Overspent pledged donor funds by $12 million

Other than that, the project was well-managed. 🙂

In a response to the audit report, project leader Lynda Chin argued that she was not required to follow IT Governance policies because the effort was a “research” project. This strikes me as a silly and self-serving argument. If you’re spending $62 million on a project intended for clinical use, you need to practice good project management. Calling the project “research” does not absolve you of that responsibility.

(2) Scope changes inflated project costs.

MD Anderson signed agreements with IBM and PwC in June and July 2012, respectively. Under these initial agreements, the scope of OEA included lower risk myelodysplastic syndrome (MDS) leukemia patients. The system would digest a broad range of whole exome, tissue, and other clinical data, produce new insight, and deliver physician decision-support services. The budget: just under $5 million.

Beginning in early 2013, MD Anderson radically expanded the scope of the project:

— Diseases added to OEA’s diagnostic capabilities: $23 million.

— Onboarding two partners to pilot the system: $29 million.

— Additional data sources: $5 million. 

Over the four-year life of the project, MD Anderson executed 7 agreements and 8 amendments with IBM and PwC. The auditors note that the contract value for many of these agreements was just below the threshold for Board approval, which suggests deliberate structuring to avoid scrutiny.

Interestingly, the massive expansion in project scope coincides with a $50 million pledge from “billionaire party boy” Low Taek Jho.

(3) OEA is not integrated with MD Anderson’s electronic health records (EHR) system.

IBM and its partners integrated the system with data from ClinicStation, the EHR system MD Anderson used previously. However, MD Anderson now uses Epic Systems for EHR; without live updates, OEA is unavailable for clinical use.

So, for $62 million, IBM and its partners built a custom demo.

(4) MD Anderson could not sell the system to partner hospitals.

MD Anderson planned from the beginning to use OEA as a way to provide high-quality cancer diagnostics to patients unable to seek treatment with MD Anderson physicians. Hence, business success of the project depended on MD Anderson’s ability to forge agreements with healthcare partners who would use the system. It was unable to do so.

Project leader Lynda Chin told the audit team that several factors prevented piloting with external partners, including “time needed for compliance and information security reviews of the cloud-based data repository,” and “lack of engagement or interest by network partners.” That’s two very different reasons. The former implies bad project estimating, poor delivery, or both; the latter implies an inability to sell the system.

Reports and analysis in the press raise as many questions as they answer.

Q. Why can’t Watson connect to MD Anderson’s EHR system?

Epic Systems is the leading EHR provider. A tool for medical diagnostics that cannot integrate with Epic is a like a tool for optimizing logistics that cannot integrate with SAP.

MD Anderson began the search for a new EHR system in late 2012 and announced that it had selected Epic in early 2013. IBM and its partners knew that OEA would have to integrate with Epic before it could go into production. Moreover, they knew this very early in the OEA development cycle.

IBM announced a partnership with Epic in 2015. Interestingly, MD Anderson is not among the 14 collaborating cancer centers.

Integrating Watson with Epic Systems, MD Anderson’s current EHR system, may be easy or it may be hard. It does not matter. It was a necessary step for OEA to go into production. IBM and PwC knew this.

Yet, they kept building on. Like an AWS commercial, but without brains.

Q. Why did MD Anderson contract the project piecemeal?

MD Anderson knew from the beginning that this project would cost a lot more to deliver than the $16 million budget approved by the Board in early 2013. Otherwise, why solicit a restricted gift of $50 million? Or are we expected to believe that Jho Low just happened to come up with that number by chance?

“Thank you for your interest, Mr. Jho. Our budget for OEA is $16 million.”

“Great! Here’s a check for $50 million.”

Moreover, MD Anderson also knew from the beginning that piloting OEA with partners was critical to success. So why wasn’t this task built into the original project plan? Expanding scope to cover this task nearly doubled the project budget.

There can be good reasons to contract a project in phases. It may be difficult to accurately estimate the cost of later phases before early phases are complete. Contracting serially keeps vendors “honest” and introduces the potential for competition in later phases.

Of course, MD Anderson did not keep IBM and PwC “honest.” No vendor other than IBM and PwC performed work on this project. The cancer center awarded $51.4 million in contract fees to the two vendors under non-competitive procurement. Moreover, per the audit report, it appears that MD Anderson paid IBM and PwC for work they did not do.

Q. Why couldn’t MD Anderson secure partners for the project?

IBM wants us to believe that Watson worked well and that OEA would be in use today if MD Anderson chose to continue the project. If that’s true, why couldn’t MD Anderson interest partners in piloting the system?

OEA may be the greatest breakthrough in medicine since the discovery of penicillin. There’s only one problem: nobody wants it.

Hello, sir, I just sunk a pile of money into this gold-plated veeblefetzer. Would you like to buy one?

IBM claims that OEA agrees with experts 90% of the time. That sounds impressive, but isn’t; for all we know, “community oncologists” perform as well or better.

For more than 90% of Super Bowl LII, the Eagles didn’t sack Tom Brady.

That 10% kills you every time.

A smart organization would gauge the market for partners before sinking money into OEA. Instead, MD Anderson built it, Field of Dreams style, and hoped that partners would come.

Here are a few closing thoughts and observations.

One failed project says little about a technology, product, or company.

Case in point: plenty of ERP projects went sidewise, sometimes with dire results. A botched ERP go-live in 1999 prevented Hershey from shipping $100 million in orders for inventory it had on hand. Despite this, firms continue to invest in ERP, for good reasons.

One failed project does not mean that Watson has no value, nor does it mean that IBM cannot successfully deliver solutions based on Watson. However, it highlights that Watson projects are high-risk IT projects. Customers must exercise good vendor and project management.

Much of the blame for this FAIL rests with MD Anderson.

OEA is a lock for the Pantheon of Bad Project Management. MD Anderson failed to practice competent vendor, contract, and project management. One can hardly blame IBM and PwC for feasting at the trough.

That said, are vendors responsible for customers’ bad project governance? The answer is an emphatic “yes” — as a matter of ethics, and as good business practice. Ethical vendors do not accept contracts that violate a customer’s procurement policies. They also do not initiate or continue projects that they know will fail to deliver the promised solution.

Don’t kid yourself. IBM and PwC knew their contracts violated MD Anderson’s procurement policies. Both vendors embed themselves deeply in organizations; they often know the customer’s policies better than the executives they serve.

They also knew that the project was a train wreck. They couldn’t possibly have not known.

Big expensive AI projects, like any other project, require a sound business case.

Do we still need to bang this drum? Apparently so. MD Anderson, it seems, thought it was smart to build OEA first and figure out a business case later. Oops.

Successful AI projects require solid data architecture.

AI without live data is worthless. Build your data platform first. That is all.

IBM’s claims about the project were <ahem> “aspirational.”

In early 2013, IBM announced in a press release that  MD Anderson “is using the IBM Watson cognitive computing system for its mission to eradicate cancer.”

It all depends on what the meaning of the word ‘is’ is.

Later that year, IBM planted a story in Scientific Americanreporting that “M. D. Anderson Cancer Center is using Watson to help doctors match patients with clinical trials, observe and fine-tune treatment plans, and assess risks.”

There’s that pesky word “is” again.

In October 2014, IBM Watson Health CTO Rob High wrote that “Doctors at the MD Anderson Cancer Center in Houston are using Watson to drive a software tool called the Oncology Expert Advisor, which serves as both a live reference manual and a virtual expert advisor for practicing clinicians.”

IBM continued to speak “aspirationally” about the project after it was stone cold dead.

In September 2016, IBM ended work on OEA and declared it “not ready for human investigational or clinical use, and its use in the treatment of patients is prohibited.” Two months later, IBM Watson Health’s Chief Health Officer Kyu Rhee touted Watson Health’s “collaboration with the world-leading MD Anderson Cancer Center in Houston, Texas. This project involves the rapid analysis of genomic information from cancer cells to provide personalized treatment for individuals.”

I guess Dr. Rhee didn’t get the memo.

The next time you hear IBM tout Watson, you may wonder if those claims, too, are “aspirational.”

How GDPR Affects Data Science

Adapted from a post originally published on the Cloudera VISION Blog.

If your organization collects data about citizens of the European Union (EU), you probably already know about the General Data Protection Regulation (GDPR). GDPR defines and strengthens data protection for consumers and harmonizes data security rules within the EU. The European Parliament approved the measure on April 27, 2016. It goes into effect in less than a year, on May 25, 2018.

Much of the commentary about GDPR focuses on how the new rules affect the collection and management of personally identifiable information (PII) about consumers. However, GDPR will also change how organizations practice data science. That is the subject of this blog post.

One caveat before we begin. GDPR is complicated. In some areas, GDPR defines high-level outcomes, but delegates detailed compliance rules to a new entity, the European Data Protection Board. GDPR regulations intersect with many national laws and regulations; organizations that conduct business in the United Kingdom must also assess the unknown impacts of Brexit. Organizations subject to GDPR should engage expert management and legal counsel to assist in developing a compliance plan.  

GDPR and Data Science

GDPR affects data science practice in three areas. First, GDPR imposes limits on data processing and consumer profiling. Second, for organizations that use automated decision-making, GDPR creates a “right to an explanation” for consumers. Third, GDPR holds firms accountable for bias and discrimination in automated decisions.  

Data processing and profiling. GDPR imposes controls on data processing and consumer profiling; these rules supplement the requirements for data collection and management. GDPR defines profiling as:

Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular, to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.

In general, organizations may process personal data when they can demonstrate a legitimate business purpose (such as a customer or employment relationship) that does not conflict with the consumer’s rights and freedoms. Organizations must inform consumers about profiling and its consequences, and provide them with the opportunity to opt out.

The Right to an Explanation. GDPR grants consumers the right “not to be subject to a decision…which is based solely on automated processing and which provides legal effects (on the subject).”  Experts characterize this rule as a “right to an explanation.”  GDPR does not precisely define the scope of decisions covered by this section. The United Kingdom’s Information Commissioner’s Office (ICO) says that the right is “very likely” to apply to credit applications, recruitment, and insurance decisions. Other agencies, law courts or the European Data Protection Board may define the scope differently.

Bias and Discrimination. When organizations use automated decision-making, they must prevent discriminatory effects based on racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect. Moreover, they may not use specific categories of personal data in automated decisions except under defined circumstances.

How GDPR Affects Data Science Practice

How will the new rules affect the way data science teams do their work? Let’s examine the impact in three key areas.

Data Processing and Profiling. The new rules allow organizations to process personal data for specific business purposes, fulfill contractual commitments, and comply with national laws. A credit card issuer may process personal data to determine a cardholder’s available credit; a bank may screen transactions for money laundering as directed by regulators. Consumers may not opt out of processing and profiling performed under these “safe harbors.”

However, organizations may not use personal data for a purpose other than the original intent without securing additional permission from the consumer. This requirement could limit the amount of data available for exploratory data science.

GDPR’s constraints on data processing and profiling apply only to data that identifies an individual consumer.

The principles of data protection should therefore not apply to … personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

The clear implication is that organizations subject to GDPR must build robust anonymization into data engineering and data science processes.

Explainable Decisions. There is some controversy about the impact of this provision. Some cheer it; others disapprove; still others deny that GDPR creates such a right. One expert in EU law argues that the requirement may force data scientists to stop using opaque techniques (such as deep learning), which can be hard to explain and interpret.

There is no question that GDPR will affect how organizations handle certain decisions. The impact on data scientists, however, may be exaggerated:

— The “right to an explanation” is limited in scope. As noted above, one regulator interprets the law to cover credit applications, recruitment, and insurance decisions. Other regulators or law courts may interpret the rules differently, but it’s clear that the right applies in specific settings. It does not apply to every automated decision.

— In many jurisdictions, a “right to an explanation” already exists and has existed for years. For example, regulations governing credit decisions in the United Kingdom are similar to those in the United States, where issuers must provide an explanation for adverse credit decisions based on credit bureau information. GDPR expands the scope of these rules, but tools for compliance are commercially available today.

— Most businesses that decline some customer requests understand that adverse decisions should be explained to customers. This is already common practice in the lending and insurance industries. Smart businesses treat adverse decisions as an opportunity to position an alternate product.

— The need to deliver an explanation affects decision engines but need not influence the choice of methods for model training. Techniques available today make it possible to “reverse-engineer” interpretable explanations for model scores even if the data scientist uses an opaque method to train the model.

Nevertheless, there are good reasons for data scientists to consider using interpretable techniques. Financial services giant Capital One considers them to be a potent weapon against hidden bias (discussed below.) But one should not conclude that GDPR will force data scientists to limit the techniques they use to train predictive models.

Bias and Discrimination. GDPR requires that organizations must avoid discriminatory effects in automated decisions. This rule places an extra burden of due diligence on data scientists who build predictive models, and on the procedures organizations use to approve predictive models for production.

Organizations that use automated decision-making must:

  • Ensure fair and transparent processing
  • Use appropriate mathematical and statistical procedures
  • Establish measures to ensure the accuracy of subject data employed in decisions

GDPR expressly prohibits the use of personal characteristics (such as age, race, ethnicity, and other enumerated classes) in automated decisions. However, it is not sufficient to just avoid using this data. The mandate against discriminatory outcomes means data scientists must also take steps to prevent indirect bias from proxy variables, multicollinearity or other causes. For example, an automated decision that uses a seemingly neutral characteristic, such as a consumer’s residential neighborhood, may inadvertently discriminate against ethnic minorities.

Data scientists must also take affirmative steps to confirm that the data they use when they develop predictive models is accurate; “garbage in/garbage out,” or GIGO, is not a defense. They must also consider whether biased training data on past outcomes can bias models. As a result, data scientists will need to concern themselves with data lineage, to trace the flow of data through all processing steps from source to target. GDPR will also drive greater concern for reproducibility, or the ability to accurately replicate a predictive modeling project.

Your Next Steps

If you do business in the European Union, now is the time to start planning for GDPR. There is much to be done: evaluating the data you collect, implementing compliance procedures, assessing your processing operations and so forth. If you are currently using machine learning for profiling and automated decisions, there are four things you need to do now.

Limit access to personally identifiable information (PII) about consumers.

Implement robust anonymization, so that by default analytic users cannot access PII. Define an exception process that permits access to PII in exceptional cases under proper security.  

Identify predictive models that currently use PII.

In each case, ask:

  • Is this data analytically necessary?
  • Does the PII provide unique and irreplaceable information value?
  • Does the predictive model support a permitted use case?

Inventory consumer-facing automated decisions.

  • Identify decisions that require explanations.
  • Implement procedures to handle consumer questions and concerns.

Establish a data science process that minimizes the risk of errors and bias.

  • Implement a workflow that ensures proper model development and testing.
  • Consider the possibility of bias “built in” to training data.
  • Rigorously test and validate predictive models.
  • Implement peer review for an independent assessment of every model.

Even if your organization is not subject to GDPR, consider implementing these practices anyway. It’s the right way to do business.

Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:


One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

Source: DB-Engines, January 2017

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

Source: DB-Engines, January 2017

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

Source: Open Hub

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.


Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.


It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.


Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.


Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.


In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.


Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.


In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.


DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.


The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.


DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs. develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.


In 2016, added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.


KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.


RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.


The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.


Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).


The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

The Year in Machine Learning (Part Three)

This is the third installment in a four-part review of 2016 in machine learning and deep learning. In Part One, I covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

In this installment, we will review the machine learning and deep learning initiatives of Big Tech Brands — industry leaders with big budgets for software development and marketing. Big Tech Brands fall into three groups:

— SAS is the software revenue leader in predictive analytics. It has a unique business model and falls into its own category.

— Companies such as IBM, Microsoft, Oracle, SAP, and Teradata have all have strong franchises in the data warehousing market, and all except Teradata offer widely used business intelligence software. These companies have the financial strength to develop, market and cross-sell machine learning software to their existing customer base, and can impact the market if they choose to do so.

Dell and HPE dabbled in advanced analytics and exited the market in 2016.

I covered Google and Amazon Web Services in Part One. Although neither company has a strong position in business analytics at present, they are making moves in that direction. Google set up Google Cloud Machine Learning as a distinct product group this year to service that market, and Amazon introduced QuickSight, a business analytics service.

Regular readers know that I favor open source software — as do most data scientists. Among the companies covered in this installment, IBM and Microsoft are making substantial commitments to the open source model, including direct contributions to open source software projects. They deserve kudos for that. Teradata is investing in Presto SQL, for which they get polite applause. Oracle and SAP leverage open source software in their solutions but make no significant contributions. SAS embraces open source the way a cat embraces a porcupine.

In Part Four, I will survey machine learning startups, and deliver results from the Bottom Story of the Year poll.


SAS leads the market in licensing revenue for advanced and predictive analytics software, according to IDC. The company has a loyal following among statisticians, actuaries, life scientists and others whose work depends on statistical analysis.

Partnering with IBM, SAS built its business in the 1970s on the strength of its software for the IBM System/360 mainframe. IBM promoted the software to its enterprise customers to increase adoption and use of its hardware. SAS software still runs on the mainframe, and the company continues to earn a significant share of its revenue on that platform. IBM has mainframe customers who use the big box exclusively for SAS.

In the 1990s, SAS successfully transitioned to a multi-vendor architecture and rebuilt its software to run on many different hardware platforms and operating systems. During this period, SAS established a reputation for industrial-strength and enterprise-grade software — in contrast to vendors like SPSS, who focused on building easy-to-use software for the desktop.

On the face of it, SAS has struggled to transition from server-based computing to the contemporary world of distributed architecture and cloud platforms. In the past ten years, the company has announced multiple initiatives to improve the performance and scalability of its products, with mixed success. In April, SAS announced Viya, its third attempt to deliver advanced analytics in a distributed MPP architecture.

What is SAS Viya? How does it differ from SAS’ previous attempts at high-performance design? Let’s peruse the brochure:

Cloud-ready, elastic and scalable


SAS Viya is built to be elastic and scalable for both private and public clouds. Analytical, in-memory computations are optimized for unconstrained environments, but they can also adjust for constrained environments. The elastic processing automatically adapts to needs and available resources – spinning up or winding down computing capacity as needed. Elastic scalability lets you quickly experiment with different scenarios and apply more complex approaches to larger amounts of streaming data.

Ahem. Any software is “cloud-ready,” in the sense that a Linux instance is a Linux instance whether it runs on-premises or in the cloud. And any software is elastic when you deploy it in a virtual appliance, such as an Amazon Machine Image. That includes SAS 9.4, which SAS touted as “cloud-ready” in 2014, and previous versions of SAS, which you could deploy in AWS even though SAS did not formally support the platform.

If you want to spin up software instances, however, you need software licenses. With open source software, such as Python, R, or Spark, that’s not an issue — you can spin up as many instances as you like without violating license agreements. Commercial software is more complicated since you need to pay for the licenses you want to spin up. Some vendors, like HPE and Teradata, tried to address this problem by marketing their own cloud platforms to compete with Amazon Web Services; they failed miserably. Others, like Oracle, partner with AWS to deliver their software in the cloud — either as a bundled managed service or on a “Bring Your Own License” (BYOL) model.

You can’t have elastic computing with commercial software without a flexible licensing model. Pay-for-what-you-use licensing poses a problem for vendors like SAS, because if customers only pay for what they use, they invariably pay a lot less than they do under term licensing. Most commercial software customers are over-licensed — they’re paying for a lot of software they don’t use. That is why revenue from on-premises software licensing is declining much faster than revenue from cloud-based subscriptions is rising. In the cloud, you can do more with less.

The bottom line is this: unless Viya is available under an elastic pricing model, nobody cares that it is “cloud-ready, elastic and scalable.”

If you want to have a little fun, the next time your SAS rep touts Viya’s elasticity, ask him what it will cost per hour to license the software. Watch him squirm.

Open analytics coding environment


Empower your data scientists with SAS Analytics that are easily available from a variety of programming languages. Whether it’s a Python notebook, Java client, Lua scripting interface or SAS, your modelers and data scientists can easily access the power of SAS for data manipulation, advanced analytics and analytical reporting.

We’ve all been waiting for the ability to run SAS from Lua.

Resilient architecture with guaranteed failover


For answers you depend on, you need analytical processing power you can count on. You need all your analytical computations to finish processing without interruption. The fault-tolerant design of SAS Viya automatically detects server failure, even in multiplatform processing environments, and redistributes processing as needed. It also manages several copies of data on the processing cluster. If a machine in the cluster becomes unavailable or fails, the required data is retrieved from another block to quickly continue processing. These self-healing mechanisms ensure high availability for uninterrupted processing and automated recovery.

“It runs on Hadoop.”

Interviewed in Forbes, SAS CEO Jim Goodnight speaks at length about Viya:

We are ready for big data…(we) just released our first version of our new Viya architecture, which is massively parallel computing where we spread the data out over dozens of servers and then use all the cores inside those servers to process the data in parallel. So we might have 500 cores working on the data all at once in parallel, and that allows it to handle some really, really big problems that we’ve never even thought of before. Things like logistic regression.

Someone should feed Dr. G. better talking points. Just for the record, commercially available software for logistic regression running in a massively parallel (MPP) environment first hit the market in 1989. Distributed logistic regression is currently available in multiple software packages, including one introduced by SAS five years ago.

Logistic regression (a non-linear model) is an iterative process. Essentially, you’re trying to estimate the parameters in the model, and so you take a guess, you’ve got to run through the data using that guess, then to refine it and do another guess and run through the data again, and you keep doing this over and over and over until the parameters converged or they don’t change much at all anymore. That can take 25 to 30 passes of the data. Now, in the old days, we used to have to read the data that many times. Now, it’s in memory. We put it in memory and it stays in memory. It’s spread out over 500 cores and then each one just does a little piece of the work, and so we can do those 25 iterations in just a few minutes, whereas it used to take hours.

It’s just like Spark, but with a license key.

(Viya’s) really our third generation of massively parallel computing. We’ve been working on this problem for seven years, and this is our third major crack at doing it, and this time we’ve got everything figured out.

In 2018 he’ll be talking about a fourth crack in nine years.

It’s possible that Viya works better than SAS’ previous cracks at high-performance analytics. That is a weak hurdle, however; SAS needs to demonstrate that its high-cost proprietary distributed framework is better than Apache Spark, which is rapidly emerging as the standard enterprise platform for Big Data.

While SAS supports machine learning techniques in several different products, it lags in deep learning. The SAS Marketing team created some helpful content about deep learning, but look carefully at that page — you won’t find an actual product for deep learning. Yes, I know that SAS Enterprise Miner supports multilayer perceptrons; but SAS does not support GPUs, Xeon Phi, Intel Nervana or any other high-performance architecture that will make it possible for you to train a deep neural net while you’re young.

If you think that an eighteen-year-old product running on one server is sufficient for your deep learning project, you should definitely talk to SAS. Keep in mind, though, that there is a reason that NVIDIA’s DGX-1 GPU-accelerated deep learning box has the power of 250 conventional servers: you actually need that kind of horsepower.

The rest of SAS’ business seems to be chugging along well enough. A combination of renewals, upgrades and upsells in existing accounts should produce low single-digit revenue growth for 2016, which is not a bad track record when you consider the declines reported by IBM, Oracle, and Teradata.

Business Analytics Leaders

The five companies in this group sell at least a billion dollars a year in business analytics software, according to IDC’s most recent worldwide software market share report. However, most of their revenue comes from data warehousing and business intelligence software; they all trail SAS in predictive analytics revenue.

Software licensing revenue is a misleading measure, however, due to the growing presence of open source software. IBM, Microsoft, and Oracle for example, actively use open source machine learning software to extend the reach of their data warehousing and business intelligence platforms, where they both have strong entries. IBM uses Spark as a foundation for many of its products; Microsoft has integrated R with SQL Server and PowerBI, and actively promotes the use of R for its enterprise customers. Oracle has taken a similar approach.


Unlike SAS, declining tech giant IBM never invested in a proprietary distributed framework for SPSS, its flagship software for advanced analytics. Instead, the company chose to leverage in-database engines (DB2, Netezza, and Oracle) and open source frameworks (MapReduce and Spark.)

IBM contributes to Apache Spark, which it uses in several products, and also to Apache SystemML. IBM Research developed the core of SystemML, which IBM donated to Apache in 2015. IBM has also visibly contributed to the Spark community through its efforts in education and training.

In 2016, IBM continued to market SPSS Statistics and SPSS Modeler, software brands it acquired in 2007. Release 18 of SPSS Modeler, announced in March, includes such things as support for machine learning in DB2 and support for IBM’s General Parallel File System (GPFS) in BigInsights. There aren’t too many data scientists who care about such things, but they appeal to the 150 or so enterprises with CIOs who still believe that nobody ever got fired for buying IBM.

In Part One of this review, I covered IBM’s machine learning moves in IBM Cloud, which I would characterize as Shakespearean, as in Much Ado About Nothing.


Microsoft had quite a year in machine learning and deep learning. As I noted in Parts One and Two, in 2016 MSFT launched cognitive APIs in Azure for vision, speech, language, knowledge, and search; a managed service for Spark in Azure HDInsight; enhancements to Azure Machine Learning and Version 2.0 of its deep learning framework, rebranded as Microsoft Cognitive Toolkit.

That’s just for starters.

In January, Microsoft announced Microsoft R Server, a rebranding of the product it acquired with Revolution Analytics in 2015. Microsoft R Server includes an enhanced R distribution, a scalable back-end, and integration tools. During the year, Microsoft two major releases for R Server. In Release 8, the company added push-down integration with Spark. Release 9 updated the Spark integration for Spark 2.0, and added MicrosoftML, a new R package for machine learning.

Microsoft announced SQL Server 2016 in March with embedded SQL Server R Services. On the Revolutions blog, David Smith reports on the launch. Tomaž Kaštrun explains what you can do with R services in SQL Server.

In November, after an extended preview, Microsoft announced the general availability of R Server for Azure HDInsight, a scale-out implementation of R integrated with Spark clusters created from HDInsight.

Also in Azure, Microsoft added a Linux version of the Data Science Virtual Machine (DSVM). Previously available as a Windows instance, DSVM includes Revolution R Open, Anaconda, Visual Studio Community Edition, PowerBI Desktop, SQL Server Express and the Azure SDK.

PowerBI, Microsoft’s powerful visualization tool, added R support in August. In ComputerWorld, Sharon Machlis, an R user, enthused. More here, on the Revolutions blog.

R Tools for Visual Studio launched to public preview in March, and to general availability in September. Also in September, Microsoft released the Microsoft R Client, a free data science tool that works with Microsoft R Open and the ScaleR distributed back end.

Microsoft data scientists Gopi Krishna Kumar, Hang Zhang and Jacob Spoelstra developed a methodology for data science, which they presented at the Microsoft Machine Learning and Data Science Summit 2016 in September. David Smith reports. The method, which the authors call Team Data Science Process, includes a standard directory structure for managing project artifacts using a system such as Git. It also includes open source utilities to support the process.

Other than that, it was a quiet year in Redmond.


Oracle has a surprisingly robust set of machine learning tools that appeal to Oracle-centric organizations. They include:

Oracle Data Mining (ODM), a suite of machine learning algorithms that run as native SQL functions in Oracle Database.

Oracle Data Miner, a client application for ODM with a business user interface.

Oracle R Distribution (ORD), an enhanced free R distribution.

Oracle R Enterprise (ORE), Oracle R Distribution packaged with tools to integrate R with Oracle Database.

Oracle R Advanced Analytics for Hadoop (ORAAH), a set of R bindings with native algorithms and an interface to Spark.

Oracle claims that ORAAH’s native algorithms are faster than Spark, but ORAAH has only two algorithms, so nobody cares. Oracle OEMs Cloudera, so the Spark release is at least one major release behind the rest of the world.

Other than some dot releases for the components cited above, I don’t see a lot of movement for Oracle in 2016.


SAP introduced an update to its predictive analytics capabilities, now branded as SAP Business Objects Predictive Analytics 3.0. This product includes two separate automation capabilities, one branded as Predictive Factory, the second as HANA Automated Predictive Library. Predictive Factory, like SAS Factory Miner, is a scripting tool that enables a data scientist to create a modeling pipeline and schedules it for execution; it does not automate the data science process itself.  HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts.

HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts. It’s a product that might appeal to SAP HANA bigots and nobody else.

SAP acquired KXEN and its InfiniteInsight software in 2014. Customer satisfaction promptly dropped through the floor, and SAP trails all other advanced analytics vendors rated in a Gartner survey. Legacy InfiniteInsight customers fall into two camps: (a) those whose IT organizations are heavily invested in SAP, and (b) everyone else. The former seem to be sticking with the software as SAP integrates it into its product line; the latter are heading for the exits.


Declining data warehouse vendor Teradata thinks of itself as an analytics powerhouse. In reality, most of its revenue comes from data warehousing, where the company gets high marks from analysts like Gartner.

You could say that Teradata has a commanding position at the bottom of the analytics stack.

Teradata’s executive leadership — if you can call it that — completely missed the implications of Hadoop and cloud computing. Instead, they bet that the Teradata brand was beloved by IT executives, who would keep on buying boxes in bulk. As a result of that blinkered view of the world, the company today is worth a third of what it was worth five years ago. Its product sales have declined for ten straight quarters, seven in a row at double digits.

After a dismal first quarter, Teradata’s board fired accepted the resignation of CEO Mike Koehler; longtime board member Victor Lund stepped into the breach. In September, at the Teradata Partners conference, Lund announced that Teradata would reposition itself as an “analytics solutions” firm.

That may not sit well with SAS, Teradata’s primary partner for advanced analytics software, which also views itself as an “analytic solutions” firm. The difference, of course, is that SAS has been delivering solutions for a long time and has street cred with executives because it actually has sophisticated business solutions, with actual software and intellectual property, while Teradata appears to have little more than big ideas and PowerPoint.

Pro tip for Teradata management: just because you want to move up the value chain does not mean that you have the ability to do so.

In other developments, the company announced that Aster finally supports Spark, two years after anyone might have cared. Teradata also announced that Aster’s analytics are now available for deployment in Hadoop. Aster on Hadoop is a bladeless knife without a handle — a commercial machine learning library that competes with umpteen open source libraries. Aster also competes with another Teradata partner, Fuzzy Logix, whose dbLytix library is six times richer and more mature.

If someone proposes to bet that “solutions” and unbundled Aster will reverse Teradata’s decline, take the under.

Other Tech Giants

We mention two remaining giants, Dell and HPE, only to note their passing from the scene.


HPE announced the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also granted equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets. The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for Haven or Vertica, this may be a good time to dust off your resume.

In March, HPE announced Haven OnDemand, available on Microsoft Azure. Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, initially branded as HAVEn and announced by HP in June 2013.  In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So the March announcement is a re-re-release of the software.

Three years into its product life cycle, Haven hasn’t exactly caught on with data scientists. Just 2 out of 2,895 respondents to the KDnuggets 2016 Data Science Software Usage poll and none in the O’Reilly 2016 Data Science Salary Survey said they use the software. Adding insult to injury, Haven failed to make KDnuggets’ list of the top 50 machine learning APIs, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

Vertica still has some traction with data lovers whose analysis needs are simple enough to satisfy with SQL. Currently, it’s the 28th most popular relational database, according to DB-Engines, which is about on par with Netezza and Greenplum and a lot better than Aster. Expect this ranking to drop like a stone in the hands of Micro Focus.


Dell entered the advanced analytics business by acquiring Statsoft in 2014, a move that impressed nobody. In 2016, Dell exited by selling its software division to private equity investors.

Goodbye, Dell. We hardly knew ye.

The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.


Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.


H2O is an open source machine learning project of, a commercial venture. (We’ll cover’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to, H2O more than doubled its user base in 2016.


A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.


Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.


TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.


Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.


MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.


Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.


Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.


Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.