The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and H2O.ai drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.

screen-shot-2016-12-08-at-3-17-32-pm

It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.

fusion

Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.

Databricks

Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.

visualizations-in-databricks

In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.

Dataiku

Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.

dataiku

In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.

DataRobot

DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.

cugrnjwxeaaking

The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.

collab-screen

DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs.

H2O.ai

H2O.ai develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, H2O.ai updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.

maxresdefault

In 2016, H2O.ai added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME

KNIME.com AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.

knime_screenshot

KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.

RapidMiner

RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.

rm7_process

The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.

Skytree

Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).

figure_09a_tuning_results_chart_9_way_grid

The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.