The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and H2O.ai drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.

screen-shot-2016-12-08-at-3-17-32-pm

It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.

fusion

Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.

Databricks

Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.

visualizations-in-databricks

In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.

Dataiku

Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.

dataiku

In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.

DataRobot

DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.

cugrnjwxeaaking

The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.

collab-screen

DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs.

H2O.ai

H2O.ai develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, H2O.ai updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.

maxresdefault

In 2016, H2O.ai added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME

KNIME.com AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.

knime_screenshot

KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.

RapidMiner

RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.

rm7_process

The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.

Skytree

Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).

figure_09a_tuning_results_chart_9_way_grid

The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

The Year in Machine Learning (Part Three)

This is the third installment in a four-part review of 2016 in machine learning and deep learning. In Part One, I covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

In this installment, we will review the machine learning and deep learning initiatives of Big Tech Brands — industry leaders with big budgets for software development and marketing. Big Tech Brands fall into three groups:

— SAS is the software revenue leader in predictive analytics. It has a unique business model and falls into its own category.

— Companies such as IBM, Microsoft, Oracle, SAP, and Teradata have all have strong franchises in the data warehousing market, and all except Teradata offer widely used business intelligence software. These companies have the financial strength to develop, market and cross-sell machine learning software to their existing customer base, and can impact the market if they choose to do so.

Dell and HPE dabbled in advanced analytics and exited the market in 2016.

I covered Google and Amazon Web Services in Part One. Although neither company has a strong position in business analytics at present, they are making moves in that direction. Google set up Google Cloud Machine Learning as a distinct product group this year to service that market, and Amazon introduced QuickSight, a business analytics service.

Regular readers know that I favor open source software — as do most data scientists. Among the companies covered in this installment, IBM and Microsoft are making substantial commitments to the open source model, including direct contributions to open source software projects. They deserve kudos for that. Teradata is investing in Presto SQL, for which they get polite applause. Oracle and SAP leverage open source software in their solutions but make no significant contributions. SAS embraces open source the way a cat embraces a porcupine.

In Part Four, I will survey machine learning startups, and deliver results from the Bottom Story of the Year poll.

SAS

SAS leads the market in licensing revenue for advanced and predictive analytics software, according to IDC. The company has a loyal following among statisticians, actuaries, life scientists and others whose work depends on statistical analysis.

Partnering with IBM, SAS built its business in the 1970s on the strength of its software for the IBM System/360 mainframe. IBM promoted the software to its enterprise customers to increase adoption and use of its hardware. SAS software still runs on the mainframe, and the company continues to earn a significant share of its revenue on that platform. IBM has mainframe customers who use the big box exclusively for SAS.

In the 1990s, SAS successfully transitioned to a multi-vendor architecture and rebuilt its software to run on many different hardware platforms and operating systems. During this period, SAS established a reputation for industrial-strength and enterprise-grade software — in contrast to vendors like SPSS, who focused on building easy-to-use software for the desktop.

On the face of it, SAS has struggled to transition from server-based computing to the contemporary world of distributed architecture and cloud platforms. In the past ten years, the company has announced multiple initiatives to improve the performance and scalability of its products, with mixed success. In April, SAS announced Viya, its third attempt to deliver advanced analytics in a distributed MPP architecture.

What is SAS Viya? How does it differ from SAS’ previous attempts at high-performance design? Let’s peruse the brochure:

Cloud-ready, elastic and scalable

 

SAS Viya is built to be elastic and scalable for both private and public clouds. Analytical, in-memory computations are optimized for unconstrained environments, but they can also adjust for constrained environments. The elastic processing automatically adapts to needs and available resources – spinning up or winding down computing capacity as needed. Elastic scalability lets you quickly experiment with different scenarios and apply more complex approaches to larger amounts of streaming data.

Ahem. Any software is “cloud-ready,” in the sense that a Linux instance is a Linux instance whether it runs on-premises or in the cloud. And any software is elastic when you deploy it in a virtual appliance, such as an Amazon Machine Image. That includes SAS 9.4, which SAS touted as “cloud-ready” in 2014, and previous versions of SAS, which you could deploy in AWS even though SAS did not formally support the platform.

If you want to spin up software instances, however, you need software licenses. With open source software, such as Python, R, or Spark, that’s not an issue — you can spin up as many instances as you like without violating license agreements. Commercial software is more complicated since you need to pay for the licenses you want to spin up. Some vendors, like HPE and Teradata, tried to address this problem by marketing their own cloud platforms to compete with Amazon Web Services; they failed miserably. Others, like Oracle, partner with AWS to deliver their software in the cloud — either as a bundled managed service or on a “Bring Your Own License” (BYOL) model.

You can’t have elastic computing with commercial software without a flexible licensing model. Pay-for-what-you-use licensing poses a problem for vendors like SAS, because if customers only pay for what they use, they invariably pay a lot less than they do under term licensing. Most commercial software customers are over-licensed — they’re paying for a lot of software they don’t use. That is why revenue from on-premises software licensing is declining much faster than revenue from cloud-based subscriptions is rising. In the cloud, you can do more with less.

The bottom line is this: unless Viya is available under an elastic pricing model, nobody cares that it is “cloud-ready, elastic and scalable.”

If you want to have a little fun, the next time your SAS rep touts Viya’s elasticity, ask him what it will cost per hour to license the software. Watch him squirm.

Open analytics coding environment

 

Empower your data scientists with SAS Analytics that are easily available from a variety of programming languages. Whether it’s a Python notebook, Java client, Lua scripting interface or SAS, your modelers and data scientists can easily access the power of SAS for data manipulation, advanced analytics and analytical reporting.

We’ve all been waiting for the ability to run SAS from Lua.

Resilient architecture with guaranteed failover

 

For answers you depend on, you need analytical processing power you can count on. You need all your analytical computations to finish processing without interruption. The fault-tolerant design of SAS Viya automatically detects server failure, even in multiplatform processing environments, and redistributes processing as needed. It also manages several copies of data on the processing cluster. If a machine in the cluster becomes unavailable or fails, the required data is retrieved from another block to quickly continue processing. These self-healing mechanisms ensure high availability for uninterrupted processing and automated recovery.

“It runs on Hadoop.”

Interviewed in Forbes, SAS CEO Jim Goodnight speaks at length about Viya:

We are ready for big data…(we) just released our first version of our new Viya architecture, which is massively parallel computing where we spread the data out over dozens of servers and then use all the cores inside those servers to process the data in parallel. So we might have 500 cores working on the data all at once in parallel, and that allows it to handle some really, really big problems that we’ve never even thought of before. Things like logistic regression.

Someone should feed Dr. G. better talking points. Just for the record, commercially available software for logistic regression running in a massively parallel (MPP) environment first hit the market in 1989. Distributed logistic regression is currently available in multiple software packages, including one introduced by SAS five years ago.

Logistic regression (a non-linear model) is an iterative process. Essentially, you’re trying to estimate the parameters in the model, and so you take a guess, you’ve got to run through the data using that guess, then to refine it and do another guess and run through the data again, and you keep doing this over and over and over until the parameters converged or they don’t change much at all anymore. That can take 25 to 30 passes of the data. Now, in the old days, we used to have to read the data that many times. Now, it’s in memory. We put it in memory and it stays in memory. It’s spread out over 500 cores and then each one just does a little piece of the work, and so we can do those 25 iterations in just a few minutes, whereas it used to take hours.

It’s just like Spark, but with a license key.

(Viya’s) really our third generation of massively parallel computing. We’ve been working on this problem for seven years, and this is our third major crack at doing it, and this time we’ve got everything figured out.

In 2018 he’ll be talking about a fourth crack in nine years.

It’s possible that Viya works better than SAS’ previous cracks at high-performance analytics. That is a weak hurdle, however; SAS needs to demonstrate that its high-cost proprietary distributed framework is better than Apache Spark, which is rapidly emerging as the standard enterprise platform for Big Data.

While SAS supports machine learning techniques in several different products, it lags in deep learning. The SAS Marketing team created some helpful content about deep learning, but look carefully at that page — you won’t find an actual product for deep learning. Yes, I know that SAS Enterprise Miner supports multilayer perceptrons; but SAS does not support GPUs, Xeon Phi, Intel Nervana or any other high-performance architecture that will make it possible for you to train a deep neural net while you’re young.

If you think that an eighteen-year-old product running on one server is sufficient for your deep learning project, you should definitely talk to SAS. Keep in mind, though, that there is a reason that NVIDIA’s DGX-1 GPU-accelerated deep learning box has the power of 250 conventional servers: you actually need that kind of horsepower.

The rest of SAS’ business seems to be chugging along well enough. A combination of renewals, upgrades and upsells in existing accounts should produce low single-digit revenue growth for 2016, which is not a bad track record when you consider the declines reported by IBM, Oracle, and Teradata.

Business Analytics Leaders

The five companies in this group sell at least a billion dollars a year in business analytics software, according to IDC’s most recent worldwide software market share report. However, most of their revenue comes from data warehousing and business intelligence software; they all trail SAS in predictive analytics revenue.

Software licensing revenue is a misleading measure, however, due to the growing presence of open source software. IBM, Microsoft, and Oracle for example, actively use open source machine learning software to extend the reach of their data warehousing and business intelligence platforms, where they both have strong entries. IBM uses Spark as a foundation for many of its products; Microsoft has integrated R with SQL Server and PowerBI, and actively promotes the use of R for its enterprise customers. Oracle has taken a similar approach.

IBM

Unlike SAS, declining tech giant IBM never invested in a proprietary distributed framework for SPSS, its flagship software for advanced analytics. Instead, the company chose to leverage in-database engines (DB2, Netezza, and Oracle) and open source frameworks (MapReduce and Spark.)

IBM contributes to Apache Spark, which it uses in several products, and also to Apache SystemML. IBM Research developed the core of SystemML, which IBM donated to Apache in 2015. IBM has also visibly contributed to the Spark community through its efforts in education and training.

In 2016, IBM continued to market SPSS Statistics and SPSS Modeler, software brands it acquired in 2007. Release 18 of SPSS Modeler, announced in March, includes such things as support for machine learning in DB2 and support for IBM’s General Parallel File System (GPFS) in BigInsights. There aren’t too many data scientists who care about such things, but they appeal to the 150 or so enterprises with CIOs who still believe that nobody ever got fired for buying IBM.

In Part One of this review, I covered IBM’s machine learning moves in IBM Cloud, which I would characterize as Shakespearean, as in Much Ado About Nothing.

Microsoft

Microsoft had quite a year in machine learning and deep learning. As I noted in Parts One and Two, in 2016 MSFT launched cognitive APIs in Azure for vision, speech, language, knowledge, and search; a managed service for Spark in Azure HDInsight; enhancements to Azure Machine Learning and Version 2.0 of its deep learning framework, rebranded as Microsoft Cognitive Toolkit.

That’s just for starters.

In January, Microsoft announced Microsoft R Server, a rebranding of the product it acquired with Revolution Analytics in 2015. Microsoft R Server includes an enhanced R distribution, a scalable back-end, and integration tools. During the year, Microsoft two major releases for R Server. In Release 8, the company added push-down integration with Spark. Release 9 updated the Spark integration for Spark 2.0, and added MicrosoftML, a new R package for machine learning.

Microsoft announced SQL Server 2016 in March with embedded SQL Server R Services. On the Revolutions blog, David Smith reports on the launch. Tomaž Kaštrun explains what you can do with R services in SQL Server.

In November, after an extended preview, Microsoft announced the general availability of R Server for Azure HDInsight, a scale-out implementation of R integrated with Spark clusters created from HDInsight.

Also in Azure, Microsoft added a Linux version of the Data Science Virtual Machine (DSVM). Previously available as a Windows instance, DSVM includes Revolution R Open, Anaconda, Visual Studio Community Edition, PowerBI Desktop, SQL Server Express and the Azure SDK.

PowerBI, Microsoft’s powerful visualization tool, added R support in August. In ComputerWorld, Sharon Machlis, an R user, enthused. More here, on the Revolutions blog.

R Tools for Visual Studio launched to public preview in March, and to general availability in September. Also in September, Microsoft released the Microsoft R Client, a free data science tool that works with Microsoft R Open and the ScaleR distributed back end.

Microsoft data scientists Gopi Krishna Kumar, Hang Zhang and Jacob Spoelstra developed a methodology for data science, which they presented at the Microsoft Machine Learning and Data Science Summit 2016 in September. David Smith reports. The method, which the authors call Team Data Science Process, includes a standard directory structure for managing project artifacts using a system such as Git. It also includes open source utilities to support the process.

Other than that, it was a quiet year in Redmond.

Oracle

Oracle has a surprisingly robust set of machine learning tools that appeal to Oracle-centric organizations. They include:

Oracle Data Mining (ODM), a suite of machine learning algorithms that run as native SQL functions in Oracle Database.

Oracle Data Miner, a client application for ODM with a business user interface.

Oracle R Distribution (ORD), an enhanced free R distribution.

Oracle R Enterprise (ORE), Oracle R Distribution packaged with tools to integrate R with Oracle Database.

Oracle R Advanced Analytics for Hadoop (ORAAH), a set of R bindings with native algorithms and an interface to Spark.

Oracle claims that ORAAH’s native algorithms are faster than Spark, but ORAAH has only two algorithms, so nobody cares. Oracle OEMs Cloudera, so the Spark release is at least one major release behind the rest of the world.

Other than some dot releases for the components cited above, I don’t see a lot of movement for Oracle in 2016.

SAP

SAP introduced an update to its predictive analytics capabilities, now branded as SAP Business Objects Predictive Analytics 3.0. This product includes two separate automation capabilities, one branded as Predictive Factory, the second as HANA Automated Predictive Library. Predictive Factory, like SAS Factory Miner, is a scripting tool that enables a data scientist to create a modeling pipeline and schedules it for execution; it does not automate the data science process itself.  HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts.

HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts. It’s a product that might appeal to SAP HANA bigots and nobody else.

SAP acquired KXEN and its InfiniteInsight software in 2014. Customer satisfaction promptly dropped through the floor, and SAP trails all other advanced analytics vendors rated in a Gartner survey. Legacy InfiniteInsight customers fall into two camps: (a) those whose IT organizations are heavily invested in SAP, and (b) everyone else. The former seem to be sticking with the software as SAP integrates it into its product line; the latter are heading for the exits.

Teradata

Declining data warehouse vendor Teradata thinks of itself as an analytics powerhouse. In reality, most of its revenue comes from data warehousing, where the company gets high marks from analysts like Gartner.

You could say that Teradata has a commanding position at the bottom of the analytics stack.

Teradata’s executive leadership — if you can call it that — completely missed the implications of Hadoop and cloud computing. Instead, they bet that the Teradata brand was beloved by IT executives, who would keep on buying boxes in bulk. As a result of that blinkered view of the world, the company today is worth a third of what it was worth five years ago. Its product sales have declined for ten straight quarters, seven in a row at double digits.

After a dismal first quarter, Teradata’s board fired accepted the resignation of CEO Mike Koehler; longtime board member Victor Lund stepped into the breach. In September, at the Teradata Partners conference, Lund announced that Teradata would reposition itself as an “analytics solutions” firm.

That may not sit well with SAS, Teradata’s primary partner for advanced analytics software, which also views itself as an “analytic solutions” firm. The difference, of course, is that SAS has been delivering solutions for a long time and has street cred with executives because it actually has sophisticated business solutions, with actual software and intellectual property, while Teradata appears to have little more than big ideas and PowerPoint.

Pro tip for Teradata management: just because you want to move up the value chain does not mean that you have the ability to do so.

In other developments, the company announced that Aster finally supports Spark, two years after anyone might have cared. Teradata also announced that Aster’s analytics are now available for deployment in Hadoop. Aster on Hadoop is a bladeless knife without a handle — a commercial machine learning library that competes with umpteen open source libraries. Aster also competes with another Teradata partner, Fuzzy Logix, whose dbLytix library is six times richer and more mature.

If someone proposes to bet that “solutions” and unbundled Aster will reverse Teradata’s decline, take the under.

Other Tech Giants

We mention two remaining giants, Dell and HPE, only to note their passing from the scene.

HPE

HPE announced the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also granted equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets. The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for Haven or Vertica, this may be a good time to dust off your resume.

In March, HPE announced Haven OnDemand, available on Microsoft Azure. Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, initially branded as HAVEn and announced by HP in June 2013.  In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So the March announcement is a re-re-release of the software.

Three years into its product life cycle, Haven hasn’t exactly caught on with data scientists. Just 2 out of 2,895 respondents to the KDnuggets 2016 Data Science Software Usage poll and none in the O’Reilly 2016 Data Science Salary Survey said they use the software. Adding insult to injury, Haven failed to make KDnuggets’ list of the top 50 machine learning APIs, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

Vertica still has some traction with data lovers whose analysis needs are simple enough to satisfy with SQL. Currently, it’s the 28th most popular relational database, according to DB-Engines, which is about on par with Netezza and Greenplum and a lot better than Aster. Expect this ranking to drop like a stone in the hands of Micro Focus.

Dell/EMC

Dell entered the advanced analytics business by acquiring Statsoft in 2014, a move that impressed nobody. In 2016, Dell exited by selling its software division to private equity investors.

Goodbye, Dell. We hardly knew ye.

The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

The Year in Machine Learning (Part One)

This is the first installment in a four-part review of 2016 in machine learning and deep learning.

In the first post, we look back at ML/DL news organized in five high-level topic areas:

  • Concerns about bias
  • Interpretable models
  • Deep learning accelerates
  • Supercomputing goes mainstream
  • Cloud platforms build ML/DL stacks

In Part Two, we cover developments in each of the leading open source machine learning and deep learning projects.

Parts Three and Four will review the machine learning and deep learning moves of commercial software vendors.

Concerns About Bias

As organizations expand the use of machine learning for profiling and automated decisions, there is growing concern about the potential for bias. In 2016, reports in the media documented racial bias in predictive models used for criminal sentencing, discriminatory pricing in automated auto insurance quotes, an image classifier that learned “whiteness” as an attribute of beauty, and hidden stereotypes in Google’s word2vec algorithm.

Two bestsellers were published in 2016 that address the issue. The first, Cathy O’Neil’s Weapons of Math Destruction, is a candidate for the National Book Award. In a review for The Wall Street Journal, Jo Craven McGinty summarizes O’Neil’s arguments as “algorithms aren’t biased, but the people who build them may be.”

A second book, Virtual Competition, written by Ariel Ezrachi and Maurice Stucke, focuses on the ways that machine learning and algorithmic decisions can promote price discrimination and collusion. Burton Malkiel notes in his review that the work “displays a deep understanding of the internet world and is outstandingly researched. The polymath authors illustrate their arguments with relevant case law as well as references to studies in economics and behavioral psychology.”

Most working data scientists are deeply concerned about bias in the work they do. Bias, after all, is a form of error, and a biased algorithm is an inaccurate algorithm. The organizations that employ data scientists, however, may not commit the resources needed for testing and validation, which is how we detect and correct bias. Moreover, people in business suits often exaggerate the accuracy and precision of predictive models or promote their use for inappropriate applications.

In Europe, GDPR creates an incentive for organizations that use machine learning to take the potential for bias more seriously. We’ll be hearing more about GDPR in 2017.

Interpretable Models

Speaking of GDPR, beginning in 2018, organizations that use machine learning to drive automated decisions must be prepared to explain those decisions to the affected subjects and to regulators. As a result, in 2016 we saw considerable interest in efforts to develop interpretable machine learning algorithms.

— The MIT Computer Science and Artificial Intelligence Laboratory announced progress in developing neural networks that deliver explanations for their predictions.

— At the International Joint Conference on Artificial Intelligence, David Gunning summarized work to date on explainability.

— MIT selected machine learning startup Rulex as a finalist in its Innovation Showcase. Rulex implements a technique called Switching Neural Networks to learn interpretable rule sets for classification and regression.

— In O’Reilly Radar, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin explained Local Interpretable Model-Agnostic Explanations (LIME), a technique that explains the predictions of any machine learning classifier.

The Wall Street Journal reported on an effort by Capital One to develop machine learning techniques that account for the reasoning behind their decisions.

In Nautilus, Aaron M. Bornstein asked: Is artificial intelligence permanently inscrutable?  There are several issues, including a lack of clarity about what “interpretability” means.

It is important to draw a distinction between “interpretability by inspection” versus “functional” interpretability. We do not evaluate an automobile by disassembling its engine and examining the parts; we get behind the wheel and take it for a drive. At some point, we’re all going to have to get behind the idea that you evaluate machine learning models by how they behave and not by examining their parts.

Deep Learning Accelerates

In a September Fortune article, Roger Parloff explains why deep learning is suddenly changing your life. Neural networks and deep learning are not new techniques; we see practical applications emerge now for three reasons:

— Computing power is cheap and getting cheaper; see the discussion below on supercomputing.

— Deep learning works well in “cognitive” applications, such as image classification, speech recognition, and language translation.

— Researchers are finding new ways to design and train deep learning models.

In 2016, the field of DL-driven cognitive applications reached new milestones:

— A Microsoft team developed a system that recognizes conversational speech as well as humans do. The team used convolutional and long short-term memory (LSTM) neural networks built with Microsoft Cognitive Toolkit (CNTK).

— On the Google Research Blog, a Google Brain team announced the launch of the Google Neural Machine Translation System, a system based on deep learning that is currently used for 18 million translations per day.

— In TechCrunch, Ken Weiner reported on advances in DL-driven image recognition and how they will transform business.

Venture capitalists aggressively funded startups that leverage deep learning in applications, especially those that can position themselves in the market for cognitive solutions:

Affectiva, which uses deep learning to read facial expressions in digital video, closed on a $14 million “D” round led by Fenox Venture Capital.

Clarifai, a startup that offers a DL-driven image and video recognition service, landed a $30 million Series B round led by Menlo Ventures.

Zebra Medical Vision, an Israeli startup, uses DL to examine medical images and diagnose diseases of the bones, brain, cardiovascular system, liver, and lungs. Zebra disclosed a $12 million venture round led by Intermountain Health.

There is an emerging ecosystem of startups that are building businesses on deep learning. Here are six examples:

Deep Genomics, based in Toronto, uses deep learning to understand diseases, disease mutations and genetic therapies.

— Cybersecurity startup Deep Instinct uses deep learning to predict, prevent, and detect threats to enterprise computing systems.

Ditto Labs uses deep learning to identify brands and logos in images posted to social media.

Enlitic offers DL-based patient triage, disease screening, and clinical support to make medical professionals more productive.

— Gridspace provides conversational speech recognition systems based on deep learning.

Indico offers DL-driven tools for text and image analysis in social media.

And, in a sign that commercial development of deep learning isn’t all hype and bubbles, NLP startup Idibon ran out of money and shut down. We can expect further consolidation in the DL tools market as major vendors with deep pockets ramp up their programs. The greatest opportunity for new entrants will be in specialized applications, where the founders can deliver domain expertise and packaged solutions to well-defined problems.

Supercomputing Goes Mainstream

To make deep learning practical, you need a lot of computing horsepower. In 2016, hardware vendors introduced powerful new platforms that are purpose-built for machine learning and deep learning.

While GPUs are currently in the lead, there is a serious debate under way about the relative merits of GPUs and FPGAs for deep learning. Anand Joshi explains the FPGA challenge. In The Next Platform, Nicole Hemsoth describes the potential of a hybrid approach that leverages both types of accelerators. During the year, Microsoft announced plans to use Altera FPGAs, and Baidu said it intends to standardize on Xilinx FPGAs.

NVIDIA Launches the DGX-1

NVIDIA had a monster 2016, tripling its market value in the course of the year. The company released the DGX-1, a deep learning supercomputer. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

NVIDIA also revealed a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow, and Theano.

Tech media salivated:

MIT Technology Review interviewed NVIDIA CEO Jen-Hsun Huang, who is now Wall Street’s favorite tech celebrity.

Separately, Karl Freund reports on NVIDIA’s announcements at the SC16 supercomputing show.

Early users of the DGX-1 include BenevolentAI, PartnersHealthCare, Argonne and Oak Ridge Labs, New York University, Stanford University, the University of Toronto, SAP, Fidelity Labs, Baidu, and the Swiss National Supercomputing Centre. Nicole Hemsoth explains how NVIDIA supports cancer research with its deep learning supercomputers.

Cray Releases the Urika-GX

Cray launched the Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high-performance network interconnect. Cray ships 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

Intel Responds

The headline on the Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. Intel acquired Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reported a price tag of $408 million. The customary tech media unicorn story storm ensues.

Intel said it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Paul Alcorn offers additional detail on Intel’s new Xeon CPU and Deep Learning Inference Accelerator. In Fortune, Aaron Pressman argues that Intel’s strategy for machine learning and AI is smart, but lags NVIDIA. Nicole Hemsoth describes Intel’s approach as “war on GPUs.”

Separately, Intel acquired Movidius, the folks who put a deep learning chip on a memory stick.

Cloud Platforms Build ML/DL Stacks

Machine learning use cases are inherently well-suited to cloud platforms. Workloads are ad hoc and project oriented; model training requires huge bursts of computing power for a short period. Inference workloads are a different matter, which is one of many reasons one should always distinguish between training and inference when choosing platforms.

Amazon Web Services

After a head fake earlier in the year when it publishing DSSTNE, a deep learning project that nobody wants, AWS announces that it will standardize on MXNet for deep learning. Separately, AWS launched three new machine learning managed services:

Rekognition, for image recognition

Polly, for text to speech

Lex, a conversational chatbot development platform

In 2014, AWS was first to market among the cloud platforms with GPU-accelerated computing services. In 2016, AWS added P2 instances with up to 16 Tesla K8- GPUs.

Microsoft Azure

Released in 2015 as CNTK, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit and released Version 2.0, with a new Python API and many other enhancements. The company also launched 22 cognitive APIs in Azure for vision, speech, language, knowledge, and search. Separately, MSFT released its managed service for Spark in Azure HDInsight and continued to enhance Azure Machine Learning.

MSFT also announced the Azure N-Series compute instances powered by NVIDIA GPUs for general availability in December.

Azure is one part of MSFT’s overall strategy in advanced analytics, which I’ll cover in Part Three of this review.

Google Cloud

In February, Google released TensorFlow Serving, an open source inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explained.

Later in the Spring, Google announced that it was building its own deep learning chips, or Tensor Processing Units (TPUs). In Forbes, HPC expert Karl Freund dissected Google’s announcement. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs.

Google launched a dedicated team in October to drive Google Cloud Machine Learning, and announced a slew of enhancements to its services:

— Google Cloud Jobs API provides businesses with capabilities to find, match and recommend jobs to candidates. Currently available in a limited alpha.

Cloud Vision API now runs on Google’s custom Tensor Processing Units; prices reduced by 80%.

Cloud Translation API will be available in two editions, Standard and Premium.

Cloud Natural Language API graduates to general availability.

In 2017, GPU-accelerated instances will be available for the Google Compute Engine and Google Cloud Machine Learning. Details here.

IBM Cloud

In 2016, IBM contributed heavily to the growing volume of fake news.

At the Spark Summit in June, IBM announced a service called the IBM Data Science Experience to great fanfare. Experienced observers found the announcement puzzling; the press release described a managed service for Apache Spark with a Jupyter IDE, but IBM already had a managed service for Apache Spark with a Jupyter IDE.

In November, IBM quietly released the service without a press release, which is understandable since there was nothing to crow about. Sure enough, it’s a Spark service with a Jupyter IDE, but also includes an R service with RStudio, some astroturf “community” documents and “curated” data sources that are available for free from a hundred different places. Big Whoop.

In IBM’s other big machine learning move, the company rebranded an existing SPSS service as Watson Machine Learning. Analysts fell all over themselves raving about the new service, apparently without actually logging in and inspecting it.

screen-shot-2016-10-30-at-11-05-33-am

Of course, IBM says that it has big plans to enhance the service. It’s nice that IBM has plans. We should all aspire to bigger and better things, but keep in mind that while IBM is very good at rebranding stuff other people built, it has never in its history developed a commercially successful software product for advanced analytics.

IBM Cloud is part of a broader strategy for IBM, so I’ll have more to say about the company in Part Three of this review.

How to Steal a Predictive Model

In the Proceedings of the 25th USENIX Security Symposium, Florian Tramer et. al. describe how to “steal” machine learning models via Prediction APIs. This finding won’t surprise anyone in the business, but Andy Greenberg at Wired and Thomas Claburn at The Register express their amazement.

Here’s how you “steal” a model:

— The prediction API tells you what variables the model uses; the packaging for a prediction API will say something like “submit X1 and X2, we will return a prediction for Y”; so you know that X1 and X2 are the variables in the model. The developer can try to fool you by directing you to submit a hundred variables even though it only needs two, but that’s not likely; most developers make the prediction API as parsimonious as possible.

— Use an experimental design to create test records with a range of values for each variable in the model. You won’t need many records; the number depends on the number of variables in the model and the degree of granularity you want.

— Now, ping the API with each test record and collect the results.

— With the data you just collected, you can estimate a model that approximates the model behind the prediction API.

The authors of the USENIX paper tested this approach with BigML and Amazon Machine Learning, succeeding in both cases. BigML objects; Amazon sleeps.

Legally, it may not be stealing. Model coefficients are intellectual property. If someone hacks into your model repository and steals the model file, or bribes one of your data scientists into providing the coefficients, that is theft. But while IP owners can assert a right over their actual code, it is much harder to assert a right to an application’s observable behavior. Reverse-engineering is legal in the U.S. and the European Union so long as the party that performs the work has legal possession of the relevant artifacts. If someone lawfully purchases predictions from your prediction API, they can reverse-engineer your model.

Restrictive licenses offer limited protection. Intellectual property owners can assert a claim against reverse-engineering if the predictions are under an end-user license that prohibits the practice. The fine print will please your Legal department, but is virtually impossible to enforce. Predictions, unlike other forms of intellectual property, aren’t watermarked; they’re just numbers.

Pricing plays a role. While it may be technically feasible to reverse-engineer a predictive model, it may be prohibitively expensive to do so. Models that predict behavior with financial implications, such as consumer credit risk models, are expensive. Arguably, the best way to prevent reverse-engineering is to charge a non-cancellable annual subscription fee for access to the API rather than selling predictions by the record. In any event, the risk of reverse-engineering should be a consideration in pricing.

Encryption may be necessary. If you want to do business with trusted parties over an open API, a hashing algorithm can scramble the prediction in a way that makes reverse-engineering impossible. Of course, the customer must be able to decrypt the prediction at their end of the transaction, with a key transmitted separately or from a common random seed.

Access control is key. The key point of the USENIX authors is that if your prediction API is available “in the wild,” you might as well call it an open source model because reverse-engineering is easy to do. Of course, if you are in the business of selling predictions, you already have some form of access control so you can meter usage and charge an account. Bad actors, however, have credit cards; so, if you are concerned about your predictive model’s IP, you’re going to have to establish tighter control over access to the prediction API.

Big Analytics Roundup (April 11, 2016)

Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.

We have three roundups from Strata + Hadoop World, Rashomon style:

  • Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
  • Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
  • Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.

— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.

— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.

— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:

Screen Shot 2016-04-10 at 7.52.54 AM

— Forrester publishes its 2016 “Wave” for Big Data Streaming Analytics. You can go here and buy it for $2,495, get a free copy here, or just look at the picture below.

Screen Shot 2016-04-10 at 3.52.54 PM

— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.

— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.

NVIDIA Unveils Deep Learning Chip

— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.

— Selected media reports:

— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.

Explainers

— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.

— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.

— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.

— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.

— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.

— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.

Perspectives

— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.

— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.

— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.

— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.

— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.

— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.

— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.

— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.

— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.

Open Source Announcements

— Qubole releases SQL optimizer Quark to open source.

— Flink releases version 1.0.1, a maintenance release.

— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.

— Airbnb open sources Caravel, a data exploration package.

— Apache Tajo announces Release 0.11.2, which should please its user.

— LinkedIn releases Dr. Elephant to open source.

Commercial Announcements

— Databricks announces the agenda for Spark Summit 2016 in SFO.

— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.

— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.

2016 Big Analytics Predictions Roundup

Before publishing my own predictions for 2016 later this week, I thought it would be fun to round up published predictions on analytics and Big Data.  Looking through this list, I see a few patterns:

— Streaming is hot.  Analysts do not seem to understand distinctions between streaming data, streaming analytics and real-time decisioning.

— “Data Science” continues to be a term that means whatever you like.

— Security and anti-fraud analytics will be a thing in 2016.  (They were also a thing in 2015.)

— Industry analysts are divided about whether or not the analytics talent crunch will persist.

— IoT is a great concept for selling data management tools, but few know how to make sense of it.

On ZDNet, Andrew Brust summarizes 60 predictions from 17 executives and sees the following:

  1. Increased adoption of streaming analytics
  2. Maturation of IoT technologies
  3. Value and maturity in Big Data products
  4. Increased deployment of artificial intelligence and machine learning

On KDnuggets, Gregory Piatetsky reports on five predictions for 2016 from Tom Davenport of the International Institute of Analytics.  (Webinar replay here.)

  1. Cognitive technology will be the next thing after automated analytics.
  2. Analytical microservices will facilitate embedded analytics.
  3. Data Science and predictive analytics will merge.
  4. The analytics talent crunch will ease due to increased enrollment in graduate programs.
  5. Analytics will focus on data curation and management.

Davenport is smoking something if he thinks cognitive computing will be a thing in 2016.

In Forbes, Gil Press synthesizes the IIA’s predictions (above) with predictions from Forrester, IDC and Gartner to get six predictions:

  1. Analytics will be embedded everywhere.
  2. Machine learning will replace manual data wrangling.
  3. The shortage of analytics talent will persist.
  4. Analytics projects will be riskier than typical IT projects.
  5. Cognitive computing will be the next buzzword.  (Press clearly does not agree with Davenport).
  6. Data monetization will take off.

Predictions (2) and (3) conflict with one another; since analysts spend 80% of their time data wrangling, tooling that automates this step will relieve the talent shortage.

On Datanami, Alex Woodie wades through “dozens” of predictions and publishes the 33 most interesting.  Many of these are self-serving, obvious or nonsensical, so I will do the work Woodie’s editor did not do and distill the list to five:

  1. Streaming analytics will mature and prove its worth.
  2. Apache Kafka will be an essential integration point in enterprise infrastructure.
  3. Business user access to Hadoop data will improve.
  4. Spark will significantly displace MapReduce for Hadoop workloads.
  5. Spark processing outside of Hadoop will also increase significantly.

Teryn O’Brien of Silicon Angle reports on a webinar hosted by Alteryx that included Bob Laurent of Alteryx, Clarke Patterson of Cloudera and Francois Ajenstat of Tableau.  The panel offered three predictions:

  1. Analyst jobs will be hot and analysts will be everyday heroes.
  2. Spark, the cloud and IoT will be big in 2016.
  3. Advanced analytics will play a key role in the Presidential election.

On ITPortal, Dell’s Todd O’Brien predicts three things for 2016:

  1. The role of Citizen Data Scientists will expand and evolve.  (Me: WTF?)
  2. Analytics will significantly affect vertical markets, especially manufacturing.
  3. All innovation will trace back to analytics

On the first point, I think that O’Brien is trying to say that companies should buy analytics software that is easy to use, like what Dell offers.

On the FICO blog, FICO’s chief analytics officer Scott Zoldi offers five predictions for 2016:

  1. Streaming analytics will come of age in 2016.
  2. “Prescriptive analytics” (his term for anomaly detection) will be a must-have security technology.
  3. “Lifestyle analytics” (predictions embedded in consumer interactions) will integrate prescriptive analytics into daily life.
  4. Businesses will rethink Big Data governance.
  5. Fake data scientists will emerge.

On a SAS blog, Polly Mitchell-Guthrie predicts five things:

  1. Machine learning (will be) established in the enterprise.
  2. IOT hype hits reality.
  3. Big Data moves beyond hype.
  4. Analytics improve cybersecurity.
  5. Analytics drives increased industry-academic interaction.

It’s standard practice at SAS to call any new IT trend “hype.”

In a press release, the health analytics vendor SCIO Health Analytics makes four predictions for 2016:

  1. Greater focus on educating health consumers.
  2. Demand for more precision in health analytics.
  3. More time will be spent on reimbursement strategies.
  4. The need for data and transparency across domains will increase.

Prediction #1 may be true, but it’s not really about health analytics.

On the Talend blog, CMO Ashley Stirrup predicts four things:

  1. Real-time analytics will take center stage
  2. New business threats will emerge
  3. CIO turnover will accelerate
  4. Businesses will retool

#2 and #4 aren’t really predictions, they simply state the obvious.

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.