The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub. In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages. Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December. The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks. It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

Python 3.5 support
iOS support
Microsoft Windows support (selected functions)
CUDA 8 support
HDFS support
k-Means clustering
WALS matrix factorization
Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license. Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++. Users interact with Caffe through a Python API or through a command line interface. Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

16 responses to “The Year in Machine Learning (Part Two)”

Chris Nicholson

January 2, 2017 at 1:55 pm

Hi Thomas –

Thanks for including us in your roundup.

The mission statement about bringing Google’s deep learning to the rest of the world dates from 2014, and they only released TensorFlow in late 2015. We actually don’t say that anymore. To put a finer point on it, Google has open-sourced some tools tools. But the infrasctructure that Tensorflow requires to run fast, the proprietary TPUs, and the massive datasets Google uses to train its nets are not public and may never be. Our framework is optimized to run on publicly available chips, and is faster than TensorFlow on multiple GPUs. We also offer a higher-level API. TensorFlow as you may know is relatively low-level like Theano, and works best with a framework like Keras on top.

What you term a tiny venture round was actually a decent-sized seed round, the first of what may be many, and larger, funding rounds. Everybody’s got to start somewhere. 😉

Seven of our employees have created profiles on Crunchbase, where I assume you got that data. We actually have 16, and they are listed on our team page: https://skymind.ai/about Our team includes PhDs, a former principle software architect at Cloudera, and an engineer who helped design chips at NVIDIA, among others.

You’re right to point out that data scientists prefer Python and R. We will be giving them a Python API soon, and currently they can import models to DL4J from Python frameworks: https://deeplearning4j.org/model-import-keras

Businesses need an AI stack that provides the right tools for at least three teams: data engineers building the pipelines and storage, data scientists, and DevOps. Only the second group works predominantly in Python or R. The first and third groups use a lot of JVM-based tools such as Hadoop, Spark, Kafka, ElasticSearch, Pig, Hive and HBase. We give all three a cross-team solution that allows data scientists to work in Python and then deploy to production on the JVM, where Deeplearning4j is the dominant deep-learning framework. We’re number 6 over all.

Deep learning frameworks need to optimize for other problems beyond easy prototyping. This includes integrations and containerization. We’re the only DL framework that certified on CDH and HDP. We’re dockerized to run on any OS. And we’ve worked with the teams at NVIDIA, Intel and IBM to run fast and well on GPUs, CPUs and Power chips.

If you believe updates and activity are indicators of a healthy project, please check out our Github and Gitter pages, which are pretty active:

https://github.com/deeplearning4j
https://gitter.im/deeplearning4j/deeplearning4j

Chris Nicholson
Skymind.ai

Reply
1. Thomas W. Dinsmore
  
  January 2, 2017 at 3:32 pm
  
  Chris,
  
  Thank you for reading. I appreciate the clarifications. A few comments:
  
  — Crunchbase characterizes your 2016 funding as a venture round, which makes sense since most of the investors in the latest round rarely do seed rounds, and Skymind did a couple of seed rounds in 2015.
  
  — I agree with you that Google has designed TensorFlow to drive inference business to its cloud platform. Where I disagree with you is in thinking that there is a great benefit for organizations to build out their own deep learning back ends rather than training in the cloud.
  
  — Can you provide a link to support the claim that DL4J is faster than TensorFlow? If it’s a legitimate benchmark, I’ll publish it. A word of caution, though — unless DL4J is an order of magnitude faster, nobody will care.
  
  — A Python API will be great when you deliver it. You are correct that data scientists are just one part of the team. They are, however, the most important part, and machine learning projects that lack R or Python APIs will go nowhere.
  
  — DL4J’s Hadoop integration is a strength for inference, but not a compelling strength for training. Model training workloads are moving to the cloud; frameworks that support this will thrive, others will lag behind.
  
  — A higher-level API is a good thing if you take it to the logical conclusion and abstract it completely from the back end, as Keras has done. That is the direction of deep learning.
  
  Every entrepreneur believes that his solution is the killer app. While I can see that Skymind has a distinct approach to deep learning, I’m not convinced that it’s a winning approach. But time will tell.
  
  Reply
  1. Chris Nicholson
    
    January 2, 2017 at 5:02 pm
    
    Hi Thomas –
    
    Thanks for your thoughtful response.
    
    > Crunchbase characterizes your 2016 funding as a venture round, which makes sense since most of the investors in the latest round rarely do seed rounds, and Skymind did a couple of seed rounds in 2015.
    
    Crunchbase is a crowd-sourced database, so not all the information in the profiles is up to date or initially correct. I’ve fixed the profile now. In April we raised money from a group of angels and smaller funds, including Ron Conway’s seed fund SV Angel, as well as some strategics and larger funds as you noted. The round came at the end of YC. I’d qualify the 2015 money as pre-seed, which is not an available tag in Crunchbase, but that’s just splitting hairs.
    
    > I agree with you that Google has designed TensorFlow to drive inference business to its cloud platform. Where I disagree with you is in thinking that there is a great benefit for organizations to build out their own deep learning back ends rather than training in the cloud.
    
    It has been widely written that about 85% of all workloads run on prem at the moment, even though many of them will move to cloud eventually. So yes, cloud is the future, but on prem is the present, even if those figures are overblown, and companies want deep learning now, which means it has to be on prem. The best solution will be platform neutral, as we are, running on prem, cloud and hybrid during the intermediate stages. In our case, the backend is already built and works on top of technology they’ve adopted, like Hadoop, Spark, Kafka etc. TensorFlow doesn’t pay a lot of attention to the JVM stack.
    
    One of things that has tripped up ML/DL in the cloud, especially as a service, is data gravity. The companies with budgets have very large datasets, and those are costly to move by definition. Any time you have to choose between moving your data to an algorithm, or an algorithm to your data, the latter is more efficient, all other things being equal. You probably also know that large organizations apply different security measures to various tiers of data. Some of it they’ll send to the cloud soon. Other datasets they don’t even want connected to the Internet. So there will be data to process on prem for a long time to come, and the tools that customers adopt there they will bring with them to the cloud.
    
    > Can you provide a link to support the claim that DL4J is faster than TensorFlow? If it’s a legitimate benchmark, I’ll publish it. A word of caution, though — unless DL4J is an order of magnitude faster, nobody will care.
    
    Here are the results and specs for the benchmarks we’ve run. https://github.com/deeplearning4j/dl4j-benchmark They’re reproducible, but obviously not conducted by an objective third party, which would be preferable. I mostly agree with you that only order of magnitude gains matter in speed, at least at the practitioner level. That said, the efficiency of deep-learning software has a huge impact on cost whether your are running on the cloud or buying your own hardware. So it’s not merely a matter of time but of budget, where gains of less than 10x can have an impact for the people with the purse strings.
    
    > A Python API will be great when you deliver it. You are correct that data scientists are just one part of the team. They are, however, the most important part, and machine learning projects that lack R or Python APIs will go nowhere.
    
    We will be using Keras as our Python API, hooking it up to the DL4J backend with py4j. Work will restart soon, with the holidays over: https://github.com/deeplearning4j/deeplearning4j/issues/2556
    
    > DL4J’s Hadoop integration is a strength for inference, but not a compelling strength for training. Model training workloads are moving to the cloud; frameworks that support this will thrive, others will lag behind.
    
    For training, we use Spark as a data access layer for fast ETL, pulling data out of HDFS. Spark is inefficient as a computation layer, but very useful for orchestrating multiple host threads over many chips.
    
    > A higher-level API is a good thing if you take it to the logical conclusion and abstract it completely from the back end, as Keras has done. That is the direction of deep learning.
    
    We have done that already with our Java and Scala APIs. If you are curious to see them, Java examples are here: https://github.com/deeplearning4j/dl4j-examples and the Scala API mirrors Keras/Torch https://github.com/deeplearning4j/scalnet
    
    > Every entrepreneur believes that his solution is the killer app. While I can see that Skymind has a distinct approach to deep learning, I’m not convinced that it’s a winning approach. But time will tell.
    
    We believe we occupy an important niche, and that niche is where deep learning meets the production environment. Many other libraries are very good tools, but in a sense they are just libraries. The makers of Tensorflow, Theano, Caffe, Keras and MxNet do not offer commercial support or sign SLAs, and given their penchant for research, they are unlikely to do so. The organizations purporting to support those libraries did not create them, and will have a harder time maintaining and extending them. In open source, the biggest committers have the most authority, and the companies working with TF have committed very little. We created DL4J and support it, and for enterprise customers seeking to de-risk tech adoption, this is an important consideration. Our model is “Cloudera for deep learning”: the classic open-source playbook of support, services and training. Google doesn’t like to get its hands messy with enterprise support.
    
    The other deep-learning libraries are “just libraries” in another sense as well, in that they focus on the training stage. As I’m sure you know, a lot of the work that goes into building deep learning solutions is the data pipeline. We’ve built an open-source data preprocessing library called DataVec that handles binary data well, and can normalize, clean and vectorize most major data types, like images and video, sound and voice, text, and time series. https://github.com/deeplearning4j/DataVec It’s like Trifacta’s Wrangler but open source. With DataVec, you can create persistent and reusable data pipelines for both training and inference, rather than one-off jobs each time. We created a fast tensor library for n-dimensional arrays using ND4J, JavaCPP and libnd4j to bridge the gap between the JVM, C++ and academic hardware acceleration. https://github.com/deeplearning4j/nd4j. And a model evaluation tool called Arbiter: https://github.com/deeplearning4j/Arbiter.
    
    Together, these libraries are more like an integrated deep learning environment (IDLE is an unfortunate acronym, but there you go…). That means Kafka connectors, and packaging these libs with DCOS and Mesos, among other things. For the inference stage, we use Lightbend’s Lagom to expose neural net models as a micro-service that autoscales elastically with traffic, and communicates via a REST API. A lot of libraries aren’t solving for all the problems that surround model training.
    
    Best,
    
    Chris
Thomas W. Dinsmore

January 2, 2017 at 6:45 pm

Chris,

I’ve updated the story based on your feedback. Let me know if you think it’s a fair assessment.

Some additional comments:

— 85% of all workloads may be on-premises, but the relevant measure is analytic workloads, and machine learning workloads in particular. I have not seen good statistics showing the percentage of machine learning workloads that run in the cloud, but anecdotal evidence says it’s extremely high. For many firms, model training in the cloud isn’t AN option — it’s the ONLY option.

— Cloud is the primary platform for model training because the workloads are inherently lumpy and unpredictable. You need a huge amount of processing power for a short period, and most training projects are ad hoc and time-boxed. For most organizations, it simply does not make sense to build out an on-premises platform for model training, even with data center virtualization.

— The data gravity and security arguments against cloud are red herrings. All data moves at some point in the process, and the cost of data movement is just one factor among many to consider. Data is more secure in the public cloud than it is on premises, and there are work practices data scientists can adopt (such as hashing and anonymization) to reduce or eliminate the risk of data breaches.

As I note in the updated story, I’m skeptical that your strategy will succeed, but it’s a credible strategy.

Regards,

Thomas

Reply
Chris Nicholson

January 2, 2017 at 7:01 pm

Many thanks, Thomas. We see a lot of training happening on prem, although you may be right that the vast majority is running on the cloud. We work there, too. On prem, many lines of business have access to the resources of a Hadoop cluster managed by central IT, usually CPUs with the occasional GPU mixed in, and we let them train models on those resources by sending in the model config as a Hadoop job/JAR file. For those teams, who don’t control their company’s hardware purchases or cloud access/budget, this is their only option to experiment with new DL tools. I don’t think the data gravity argument is a red herring. MLaaS will probably succeed first with customers already generating and storing their data in the same cloud where the service is offered. Data security may be a red herring, but you’re talking about risk-averse organizations that are path dependent on some tech commitments they made long ago and have invested heavily in. I think the cloud vendors have a long way to go to reassure people about security, and there’s some exciting work going on in ML around learning from encrypted data. Unfortunately, that comes with additional computation costs.

Reply
Thomas W. Dinsmore

January 2, 2017 at 7:32 pm

Chris,

Anything can work in the cloud, but you should be seeking out partners who can offer a managed service in the cloud, so it becomes an integral part of your strategy.

Data science teams with clout choose their platforms. Show me a data science team that is stuck with whatever IT provides, and I’ll show you a data science team that hasn’t figured out how to deliver business value. Teams like that will never succeed with deep learning.

If data gravity mattered, we’d be hearing about all of the people who use IBM Intelligent Miner in DB2 and Teradata Warehouse Miner in Teradata. Training models inside a data store only eliminates data movement if all of the data you need is already in the data store, which is never the case. Nobody runs deep learning on production databases — all data moves at least once, so unless you believe that the IT organization perfectly anticipates business needs when it creates its data lake/data warehouse/data mart, every analysis project requires data movement. Believe that and I have a bridge to sell you.

You make a good point about ML in the cloud working with data sources in the cloud, but a growing volume of data sources ARE in the cloud, especially in fields like digital marketing.

Risk averse? I work with banks and insurance companies that have shifted all or most of their model training to the cloud. If an organization is so risk-averse and platform-centric that they aren’t willing to work in the cloud, what are they doing with deep learning? They should stick with linear regression; it’s much safer.

Re security, you don’t actually have to train the models on the encrypted data to ensure good security, but that’s a different discussion.

Reply
Adam Gibson

January 2, 2017 at 8:18 pm

Hi Thomas, Chris’ cofounder here.
>> Anything can work in the cloud, but you should be seeking out partners who can offer a managed service in the cloud, so it becomes an integral part of your strategy.
We will be doing that with microsoft here soon.

>> If data gravity mattered, …

It does to a certain extent for things like cyber security applications. While you’re saying data moves once, indeed it does: from oracle to HDFS. Actually, it’s often replicated there. Batch scoring for predictive analytics is a very common use case. Data scientists do all of their analytics in a data lake. That lake still has security controls in place.

>> but a growing volume of data sources ARE in the cloud, especially in fields like digital marketing.

While this is important, it will still take time for this to work. 1 thing I would counter with here is that a lot of the services doing machine learning as a service where the data sources are tend to have a lot of lock in. This is where the competition will heat up though. A multi cloud offering is actually very appealing to reduce lock in. This keeps the components neutral. We see the importance of this by having 1 of our primary deployment mechanisms be containers.

>> I work with banks and insurance companies that have shifted all or most of their model training to the cloud. If an organization is so risk-averse and platform-centric that they aren’t willing to work in the cloud, what are they doing with deep learning? They should stick with linear regression; it’s much safer.

Half of it is bureaucracy. A lot of our market share is actually in asia where data controls are very different from western markets. Many of these organizations are on rules engines. 1 point I’d make (and I am biased for saying this, so feel free to ignore) is that deep learning is enough of an accuracy win with representation learning that some of these organizations will actually consider deep learning solutions. This is some of what we’re seeing now.

>> Security

I think we’re mainly talking about data access at this point. By that I mean anything with roles and an access control list. Training on encrypted data is an interesting idea though. Hope that helps!

Reply
1. Thomas W. Dinsmore
  
  January 2, 2017 at 9:19 pm
  
  Adam,
  
  Great news about your plans for a partnership with MSFT, I’ll be watching for an announcement.
  
  Re: data movement. As I’ve noted at least twice in this thread, model training and model scoring are two completely different things. It’s logical to implement data scoring in the data store. It’s not logical to implement model training in the data store. Model training workloads are moving to the cloud. Cyber security applications are not an exception to that rule.
  
  Re: data sources in the cloud. This already HAS happened. In digital marketing, it happened years ago.
  
  Also, you should not confuse MLaaS with the broader practice of provisioning ML in the cloud. Lots of folks are standing up their data science platforms on Amazon instances, and they could move their platform to Azure or Google or whatever tomorrow without too much effort; Linux is Linux, after all. (Qubole’s data science-as-a-service platform runs in all three.) It’s hard to argue that cloud poses a general risk of lock-in when all of the cloud vendors support open standards and open source software. Of course, if you use Oracle on Amazon, you’re stuck with Oracle.
  
  Reply
The biggest R stories from 2016 – Mubashir Qasim

January 4, 2017 at 2:18 am

[…] ML/DL blog also features a roundup of 2016 news from the R Project, Python, and other open source data science […]

Reply
The Year in Machine Learning (Part Three) | ML/DL

January 9, 2017 at 12:23 am

[…] emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and […]

Reply
The Year in Machine Learning (Part Four) | ML/DL

January 16, 2017 at 12:45 am

[…] Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and […]

Reply
The Year in SQL Engines | ML/DL

February 1, 2017 at 12:33 am

[…] September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally […]

Reply
A look back at the year in R and Microsoft – Use-R!Use-R!

February 1, 2017 at 5:18 pm

[…] Part 2: Open source machine learning and deep learning projects, including R, Python, Spark, H20 and Tensorflow […]

Reply
A look back at the year in R and Microsoft | A bunch of data

February 2, 2017 at 5:48 am

[…] Part 2: Open source machine learning and deep learning projects, including R, Python, Spark, H20 and Tensorflow […]

Reply
A look back at the year in R and Microsoft – Cloud Data Architect

February 3, 2017 at 1:33 am

[…] Part 2: Open source machine learning and deep learning projects, including R, Python, Spark, H20 and Tensorflow […]

Reply
Confluence: Predictive Analytics

May 2, 2017 at 9:10 am

External Articles – Up your Data Game!

A collection of articles on Predictive Analytics a

Reply

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Thomas Dinsmore's Blog