The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (June 27, 2016)

We have announcements from BlueData, Databricks, and DataStax this week, plus a nice crop of explainers. Also, a bit of catch-up, something from May that I missed: Bob Hayes publishes an interesting summary of his recent survey of data scientists. Includes an infographic and slides.

Thiemo Fetzer asks: did the weather affect the Brexit vote? Spoiler: he says no.

Presented without comment: Medical Information Records, Inc, says it uses Microsoft Azure Cloud to reduce postoperative nausea and vomiting.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Explainers

— On the Databricks blog, Denny Lee and Jules Damji explain key Spark terms.

— Adrian Colyer explains chatbots.

— Aaron Schumacher explains how to get started with TensorFlow.

— Allan Engelhardt explains Microsoft’s R-based analytics capabilities.

— ThinkReactive’s Deenar Torasker explains visualization using HTML5, SVG, CSS, D3 and Javascript InfoVis Toolkit.

— Brandon Butler explains what’s inside Cisco’s Tetration analytics platform.

— On the BlueData blog, Anant Chintamaneni explains BlueData in the public cloud.

— Manjeet Chayel explains how to analyze streaming data from Kinesis with Spark Streaming and Zeppelin.

— More Spark Streaming: on the Cloudera blog, Jam Kunigk explains how to detect web traffic anomalies with Flume, Spark Streaming, and Impala.

Perspectives

— Robert Hof interviews Hortonworks’ CEO Rob Bearden but does not ask him about the company’s market value, currently about a third of what it was at IPO.

— GridGain plants a piece by CEO Dmitriy Setrakyan suggesting that Apache Spark and Apache Ignite work well together.

— Dave Ramel touts something called Koverse, which does everything.

— In The Register, Billy MacInnes assesses the new IBM. He doesn’t approve.

— In a surprisingly ill-informed piece, Serdar Yegulalp argues that four languages pose a challenge to Python: Swift, Go, Julia, and R.

— In Forbes, Bernard Marr summarizes a study which proposes to explain why analytics investments have yet to pay off. The study does not live up to its premise, as it fails to show that analytics investments have not paid off.

— Srini Penchikala reviews Big Data Analytics with Spark and interviews the author.

Commercial Announcements

— Databricks announces a strategic partnership agreement and investment from In-Q-Tel, a not-for-profit organization that supports the U.S. Intelligence Community.

— DataStax announces DataStax Enterprise 5.0, with new stuff. I don’t see anything really exciting not previously announced.

— BlueData announces availability of its EPIC Big-Data-as-a-Service on public cloud — AWS, Azure, Google and “other”.

Big Analytics Roundup (May 23, 2016)

Google announces that it has designed an application-specific integrated circuit (ASIC) expressly for deep neural nets. Tech press goes bananas. The chips, branded Tensor Processing Units (TPUs) require fewer transistors per operation, so Google can fit more operations per second into the chip. In about a year of operation, Google has achieved an order of magnitude improvement in performance per watt for machine learning.

Google’s Felipe Hoffa summarizes Mark Litwintschik’s work benchmarking different platforms with the New York City Taxi and Limo Commission’s public dataset of 1.1 billion trips. So far, Mark has tested PostgreSQL on AWS, ElasticSearch on AWS, Spark on AWS EMR, Redshift, Google BigQuery, Presto on AWS and Presto on Cloud Dataproc. Results make Google look good, but you should read Mark’s original posts.

Meanwhile, IBM fires more people. More here and here.

Open Data Science Conference

The second annual Open Data Science Conference (ODSC) East met in Boston over the weekend. Attendance doubled from last year, to 2,400.

Registration was a snafu, because the conference organizers did not accurately predict walk-in traffic or staffing needs. The jokes write themselves.

Content was excellent. Keynoters included Stefan Karpinski (Julia co-creator), Kirk Borne of Booz Allen Hamilton, Ingo Mierswa, CTO of RapidMiner and Lukas Biewald, CEO of Crowdflower. Track leaders included JJ Allaire and Joe Cheng of RStudio, Usama Fayyad of Barclays and John Thompson of the US Census Bureau. Sponsors included Basis Technology, CartoDB, CrowdFlower, Dataiku, DataRobot, Dato, Exaptive, Facebook, H2O.ai, MassMutual, McKinsey, Metis, Microsoft, RapidMiner, SFL Scientific and Wayfair.

Prompted by a tweet, I stopped at the Dataiku table. The conversation went like this:

  • Me: What does Dataiku do, in 25 words or less?
  • Dataiku: DataRobot.
  • Me: What?
  • Dataiku: We do what DataRobot does.

At this point, it was clear to me that Mr. Dataiku either did not know what DataRobot does, or thought I don’t know what DataRobot does. So I changed the subject.

The next ODSC event is in October, in London.

Explainers

— Michael Armbrust and Tathagata Das explain Structured Streaming in Spark 2.0

— Adrian Colyer goes 5 for 5 for the week:

— Tim Hunter, Hossein Falaki and Joseph Bradley explain HyperLogLog and Quantiles in Spark.

— Microsoft’s Raymond Laghaeian explains how to use Azure ML predictions in Google Spreadsheet.

Perspectives

— Serdar Yegulalp cites PayScale data in noting that if you know Scala, Go, Python and Spark you can expect to make more money.

— Tim Spann weighs the advantages of Java and Scala, and explains DL4J.

— Sam Dean celebrates Drill’s first anniversary.

— Taylor Goetz delivers a brief history of Apache Storm.

Open Source Announcements

— MongoDB releases a new Spark Connector.

— Apache Tajo announces Release 0.11.3, with five bug fixes.

— Apache Mahout announces Release 0.12.1, a maintenance release that resolves an issue with Flink integration.

Commercial Announcements

— RedPoint Global snags a $12 million “C” round.

— TIBCO announces something called Accelerator for Apache Spark, a bundle of tools that connect TIBCO products with open source packages. While TIBCO refers to this component as open source, the software is available only to TIBCO customers, which means it isn’t Free and Open Source.

— MapR applauds itself.

Big Analytics Roundup (May 2, 2016)

Movidius ups the ante for trade show trinkets by releasing what journos describe as supercomputing, neural computing power, vision processing, deep learning, and artificial intelligence on a USB drive.  Roundup here.

Movidius-Fathom-Key-Product-shot

Last November, IBM’s Paul Zikopoulos snarked at Cloudera for not supporting SparkR. Cloudera’s Sean Owen, responding to a query in the Cloudera Community, notes that SparkR “does not work with other resource managers,” and does not work unless R is installed on the data nodes. Sean also notes that Cloudera cannot redistribute R because it is under GPL license. Data scientist Iraklis Tsatsoulis explains how to make SparkR work in Cloudera. Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example — but it is based on actual working experience with the product, which IBM clearly does not have.

Turning to important matters, a group at the Technical University of Munich has a machine learning engine that predicts who will die in Game of Thrones. Not very well, it seems; they blew it on Roose Bolton. Oops, spoiler.

Screen Shot 2016-05-02 at 1.21.19 PM

Explainers

— Adrian Colyer explains GeePS, a Deep Learning framework for clusters of GPUs. Put that on a thumb drive and we can talk.

— On the Altiscale blog Professor Jimmy Lin compares local installations, virtual machine, IaaS providers and Altiscale’s Hadoop-as-a-Service offering for teaching students about Big Data. Spoiler: he likes Altiscale.

— Two benchmarks from the Cloudera Engineering Blog:

  • Devadutta Ghat et.al. explain results from benchmarking Impala 2.5 with TPC queries. They claim an average speedup of 4.35X over Impala 2.3 for TPC-DS.
  • Allstate’s Don Drake explains results of a test comparing Spark 1.6 performance with Avro and Parquet, with CSV as a baseline. Drake ran a multi-step benchmark with a narrow table and a wide table. Results: the Spark job ran faster with Parquet than Avro, markedly so for the wide data set, which makes sense since it’s columnar. Also, performance with CSV sucked.

— Three items from MapR’s Converge blog:

  • Nick Amato explains how to predict Airbnb listing prices with scikit-learn and Spark.
  • Mathieu Dumoulin explains Deep Learning with the CaffeOnSpark package.
  • Nicolas A Perez explains how to do Twitter sentiment analysis with Spark Streaming.

— Corentin Kerisit explains RDD partitioning in Spark.

Perspectives

— An anonymous blogger at CBInsights notes that big tech companies are paying big bucks for AI companies, so if you’re running a startup make sure you put AI in the name.

— Alexander Wissner-Gross weighs in on the “datasets versus algorithms” debate. My take: data trumps algorithms.

— Google streams engineer Tyler Akidau discusses streaming systems versus batch processing, which is like asking Mr. Fox for his perspective on chickens.

— David Weldon continues his series of interviews with people at Strata + Hadoop: Ravi Dharnikota of SnapLogic, who heard a lot of talk about streaming, Spark and data lakes.

— Alan Earls touts Amazon Machine Learning without understanding it.

Jack Vaughan interviews eBay’s Debashis Saha, who discusses Kylin and other stuff.

Open Source Announcements

— The Apache Software Foundation announces that Apache Apex has graduated to top level status. Apex, for streaming analytics, is the open source version of DataTorrent. Jessica Davis reports.

— North Bridge and Black Duck release their tenth annual survey of people who like open source.

— Apache Flink 1.0.2 ships with bug fixes and a new capability to integrate with RocksDB. So now, you can have Flink on Rocks.

Commercial Announcements

— Google’s DeepMind AI unit announces that they will use TensorFlow instead of Torch for all future work.

— Three guys exit Pivotal, start a company named SnappyData, land a tiny “A” round from Pivotal and GE Digital and propose to build something like GemFire, but on Spark. More here.

— Levyx announces a small “A” round. Levyx offers a version of Spark optimized to run on solid state/Flash memory.

— Tiny consulting firm Xentaurs announces a partnership with Mesosphere. And not just any partnership; it’s a strategic partnership. Actually, they just joined the DC/OS community.

Big Analytics Roundup (November 16, 2015)

Just three main stories this week: possible trouble for a pair of analytic startups; Google releases TensorFlow to open source; and H2O delivers new capabilities at its annual meeting.

In other news, the Spark team announces Release 1.5.2, a maintenance release; and the Mahout guy announces Release 0.11.1, with bug fixes and performance improvements. (h/t Hadoop Weekly)

Two items of note from the Databricks blog:

— Darin McBeath describes Elsevier’s Spark use case and introduces spark-xml-utils, a Spark package contributed by his team.  The package enables the Spark user to filter documents based on an Path expression, return specific nodes for an Path/XQuery expression and transform documents using an XLST stylesheet.

— Rachit Agarwal and Anurag Khandelwal of Berkeley’s AMPLab introduce Succinct, a distributed datastore for queries on compressed data.   They announce release of Succinct Spark, a Spark package that enables search, count, range and random access queries on compressed RDDs.  The authors claim a 75X performance advantage over native Spark using Succinct as a document store,

Three interesting stories on streaming data:

  • In a podcast, Data Artisans CTO Stephan Ewen discusses Flink, Spark and the Kappa architecture.
  • Techalpine’s Kaushik Pal compares Spark and Flink for streaming data.
  • Will McGinnis helps you get started with Python and Flink.

(1) Analytic Startups in Trouble

In The Information, Steve Nellis and Peter Schulz explain why startups return to the funding well frequently — and why those that don’t may be in trouble.  Venture funding isn’t a perfect indicator of success, but is often the only indicator available.  On the list: Skytree Software and Alpine Data Labs.

(2) Google Releases TensorFlow for Machine Learning

On the Google Research blog, Google announces open source availability of TensorFlow.  TensorFlow is Google’s second generation machine learning system; it supports Deep Learning as well as any computation that can be expressed as a flow graph.   Read this white paper for details of the system.  At present, there are Python and C++ APIs;  Google notes that the C++ API may offer some performance advantages.

Video intro here.

In Wired, Cade Metz reports; Erik T. Mueller dismisses; and Metz returns to note that Deep Learning can leverage GPUs, and that AI’s future is in data, as if we didn’t know these things already.

On Slate, Will Oremus feels the buzz.

On his eponymous blog, Sachin Joglekar explains how to do k-means clustering with TensorFlow.

Separately, in VentureBeat, Jordan Novet rounds up open source frameworks for Deep Learning.

(3) H2O.ai Releases Steam

It’s not a metaphor.  At its second annual H2O World event, H2O releases Steam, an open source data science hub that bundles model selection, model management and model scoring into a single container for elastic deployment.

On the H2O Blog, Yotam Levy wraps Day One, Day Two and Day Three of the H2O World event.  Speaker videos are here, slides here.  (Registration required.)  Some notable presentations:

— H2O: Tomas Nykodym on GLM; Mark Landry on GBM and Random Forests; Arno Candel on Deep Learning; Erin LaDell on Ensemble Modeling.

— Michal Malohlava of H2O and Richard Garris of Databricks explain how to run H2O on Databricks Cloud.  Separately, Michal demonstrates Sparkling Water, a Spark package that enables a Spark user to call H2O algorithms; Nidhi Mehta leads a hands-on with PySparkling Water;  and Xavier Tordoir of Data Fellas exhibits Interactive Genomes Clustering with Sparkling Water on the Spark Notebook.

— Szilard Pafka of Epoch summarizes his work to date benchmarking R, Python, Vowpal Wabbit, H2O, xgboost and Spark MLLib.  As reported previously, Pafka’s benchmarks show that H2O and xgboost are the best performers; they are faster and deliver more accurate models.

As reported in last week’s roundup, H2O.ai also announces a $20 million “B” round.