The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Big Analytics Roundup (July 5, 2016)

Quite a few open source announcements this week. One of the most interesting is Apache Bahir, which includes a number of bits spun out from Apache Spark. It’s another indicator of the size and strength of Spark, in case anyone needs a reminder.

In other news, Altiscale and H2O.ai concurrently develop time travel: both vendors claim to support Spark 2.0, which isn’t generally available yet. The currently available Spark 2.0 preview release is not a stable release and the Spark team does not guarantee API stability. So at minimum anyone claiming to support Spark 2.0 will have to retest with the GA release.

Andrew Brust summarizes news from Hadoop Summit.

Microsoft’s Bill Jacobs explains Apache Spark integration through Microsoft R Server.  (Short version: Microsoft R previously pushed processing down to MapReduce, and now pushes down to Spark.) In a test, Microsoft found that shifting from MapReduce to Spark produced a 6X speedup, which is similar to what IBM achieved when it did the same thing with SPSS Analytics Server. Bill’s claim of 125X speedup is suspicious — he compares the performance of Microsoft R’s ScaleR distributed GLM algorithm running in a five-node Spark cluster with running GLM with an unspecified CRAN package on a single machine.

Owen O’Malley benchmarks file formats, concludes nothing. But it was fun!  Pro tip: if you’re going to spend time running benchmarks, use a standard TPC protocol.

Denny Lee introduces Databricks’ new Guide to getting started with Spark on Databricks.

Top Read/Watch

On YouTube and SlideShare: Slim Baltagi, Director of Enterprise Architecture at Capital One, presents his analysis of major trends in big analytics at Hadoop Summit.

Explainers

— In the second of a three-part series, Databricks’ Bill Chambers explains how to build data science applications on Databricks. Part one is here.

— William Lyon explains graph analysis with Neo4j and Game of Thrones, concludes that Lancel Lannister isn’t very important to the narrative.

graph-of-thrones

— On the AWS Big Data Blog, Sai Sriparasa explains how to transfer data from EMR to RDS with Sqoop.

— In part one of a series, LinkedIn’s Kartik Paramasivam disses Lambda, explains how to solve hard problems in stream processing with Apache Samza.

— Hortonworks’ Vinay Shukla and others explain the roadmap for Apache Zeppelin.

— Rajat Jaiswal explains Azure Machine Learning in the first of a multi-part series. It’s on DZone, which means the content was ripped from some other source, but I can’t find the original.

— A blogger named junkcharts explains the importance of simplicity in visualization.

Perspectives

— Roger Schank, who wrote the book on cognitive computing, parses IBM’s claims for Watson. He isn’t impressed.

— Werther Krause offers some pretty good recommendations for building a data science team.

Open Source Announcements

— The Apache Software Foundation announces Apache Bahir as a top-level project. Bahir aims to curate extensions for distributed analytic platforms. Initial bits include toolkits for streaming akka, streaming mqtt, streaming twitter and streamingmq. The team includes 16 committers from Databricks, 4 from UC Berkeley, 3 from Cloudera and 13 others. Sam dean reports.

— H2O.ai announces Sparkling Water 2.0. Sparkling Water is an H2O API for Spark, and a registered Spark package. Stories here, here, here, and here. Among the claimed enhancements:

  • Support for Apache Spark 2.0 and “backward compatibility with all previous versions.”
  • The ability to run Apache Spark and Scala through H2O’s web-based Flow UI.
  • Support for the Apache Zeppelin notebook.
  • H2O feature improvements and visualizations for MLlib algorithms, including the ability to score feature importance.
  • The ability to build Ensembles using H2O plus MLlib algorithms.
  • The power to export MLlib models as POJOs (Plain Old Java Objects).

— Alluxio (née Tachyon) announces Release 1.1. (Alluxio is an open source project for in-memory virtual distributed storage). Key bits include performance improvements, including master metadata scalability, worker scalability and better support for random I/O; improved access control features; usability improvements; and integration with Google Compute Engine.

— Apache Drill announces Release 1.7.0, with bug fixes and minor improvements.

— Qubole announces Quark, an open source project that optimizes SQL across storage platforms.

— MongoDB releases its own connector for Spark, supplementing the existing package developed by Stratio.

Commercial Announcements

— Altiscale claims support for Spark 2.0.

— AtScale announces a reseller agreement with Hortonworks.

— GridGain Systems announces Professional Edition 1.6, the commercially licensed enhanced version of Apache Ignite. Release 1.6 includes native support for Apache Cassandra.

— Hortonworks announces Microsoft Azure HDInsight as its premier cloud solution. They should have noted that Azure is Hortonworks only cloud solution.

— Zoomdata announces certification on the MapR Converged Data Platform.

Big Analytics Roundup (June 20, 2016)

Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.

On his personal blog, Michael Malak recaps the Spark Summit.

Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.

On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.

In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:

Screen Shot 2016-06-20 at 12.24.07 PM

And yes, there are a few holdouts in the lower left quadrants.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

IBM Data Science Experience

Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement.  The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.

Screen Shot 2016-06-20 at 11.17.54 AM

It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.

Evaluating the release conceptually:

  • IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
  • Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
  • R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
  • A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.

Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.

Explainers

— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.

— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.

— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.

— In TechRepublic, Hope Reese explains machine learning to smart people. For everyone else, there’s this.

— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.

— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.

— Jessica Davis compiles a listicle of Tech Giants who embrace open source.

— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.

Perspectives

— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.

— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.

Open Source Announcements

— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.

— Apache Mahout announces version 0.12.2, a maintenance release.

— Apache SystemML (incubating) announces release 0.10.0.

Commercial Announcements

— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.

— Databricks announces availability of its managed Spark service on AWS GovCloud (US).

— Qubole announces QDS HBase-as-a-Service on AWS.

Big Analytics Roundup (June 13, 2016)

Spark Summit 2016 met last week in SFO. There were many cool things; I will publish a separate report when presentations and videos are available.

KDnuggets releases results of its annual poll on data science software. Key findings:

  • Python use is up 51%, almost catches up to R, the #1 choice.
  • Excel and Tableau usage are up 47% and 49%, respectively.
  • Spark usage is up 91%, overtakes Hadoop.
  • SAS is down big time, drops from the top ten.

Meanwhile, Alex Woodie wraps statistics on Spark adoption, and Qubole’s Ari Amster reports on Spark usage among Qubole users.

Tim Spann recaps the week in Hadoop.

Spark Summit: Roundup of Roundups

— On the Databricks blog, Wayne Chan, Dave Wang, Jules Damji and Denny Lee recap highlights from the Summit.

— Jessica Davis rounds up the highlights.

— Jack Vaughan surrounds the story, quotes some old guy.

— Sam Dean summarizes what you need to know.

— Alex Handy collects the key bits.

— Andrew Brust separately corrals Day One and Day Two.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Spark Summit Europe, Brussels, October 25-27 (closing date TBA)

Top Read

Adrian Colyer summarizes a paper on identifying architectural debt in software.

Explainers

— Deenar Torasker explains the new capabilities of HDFS.

— Ron Bodkin explains key considerations when designing continuous apps, in the second of a three part series. Part one is here.

— On his eponymous blog, Jesse Steinweg-Woods explains Gradient Boosted Trees with XGBoost in Python.

— Adam Warski explains how Kafka Streams fits into the stream processing landscape.

Perspectives

— H2O.ai’s Vinod Iyengar objects to what he calls the fragmentation of Spark support, correctly noting that Cloudera and Hortonworks support different versions of Spark in their distributions. Of course, nobody is obligated to use Spark with Cloudera and Hortonworks.

— From the Spark Summit on YouTube: Ben Lorica leads a panel discussion of incredibly smart and distinguished people, plus some old guy.

— Altiscale’s Barbara Lewis presents ten use cases for Big Data.

— Tim Wallis believes that AI will relieve boredom.

— Sam Dean touts Grappa, Drill and Kafka as successors to Spark. Grappa is going nowhere. Drill is great if all you want to do is SQL, and Kafka is great if all you want to do is streaming. Pro tip: there are no real-world analytic applications where all you want to do is streaming.

— Allen Downey opines that statistical tests are inflexible and opaque. Funny, my college roommate said the same thing when he flunked his Stat 101 mid-term.

Open Source Announcements

— LinkedIn announces release of PhotonML, a machine learning library for Spark. Feature detail here.

— Google releases TensorFlow 0.9.0, with iOS support. Speculation about deep learning on your phone ensues.

— Twitter donates DistributedLog to Apache.

Commercial Announcements

— Databricks announces general availability for the Databricks Community Edition, and completion of the first phase of Databricks Enterprise Security framework.

— Microsoft announces general availability for its managed Spark service in HDInsight, and summer availability for the Spark pushdown capability in R Server. The company also announced PowerBI support for Spark Streaming, which is confusing for those who thought PowerBI already supported Spark Streaming.

— IBM announces limited preview of a managed service branded as the Data Science Experience. IBM is coy about the details; the service definitely includes Spark, Jupyter and RStudio, H2O and “curated data sets”, and may include other bits. The service itself looks promising, but IBM’s claim to offer the “first development environment for Apache Spark” is BS.

— In an oddly opaque press release, H2O announces that it is “working with” IBM. H2O is open source software, and IBM requires no permission from H2O.ai for use or distribution; presumably, H2O will offer support contracts to users. H2O.ai did not respond to request for comment.

— Splice Machine announces plans to go open source; a company insider says they plan to donate the software to Apache. Dave Ramel reports.

Big Analytics Roundup (May 2, 2016)

Movidius ups the ante for trade show trinkets by releasing what journos describe as supercomputing, neural computing power, vision processing, deep learning, and artificial intelligence on a USB drive.  Roundup here.

Movidius-Fathom-Key-Product-shot

Last November, IBM’s Paul Zikopoulos snarked at Cloudera for not supporting SparkR. Cloudera’s Sean Owen, responding to a query in the Cloudera Community, notes that SparkR “does not work with other resource managers,” and does not work unless R is installed on the data nodes. Sean also notes that Cloudera cannot redistribute R because it is under GPL license. Data scientist Iraklis Tsatsoulis explains how to make SparkR work in Cloudera. Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example — but it is based on actual working experience with the product, which IBM clearly does not have.

Turning to important matters, a group at the Technical University of Munich has a machine learning engine that predicts who will die in Game of Thrones. Not very well, it seems; they blew it on Roose Bolton. Oops, spoiler.

Screen Shot 2016-05-02 at 1.21.19 PM

Explainers

— Adrian Colyer explains GeePS, a Deep Learning framework for clusters of GPUs. Put that on a thumb drive and we can talk.

— On the Altiscale blog Professor Jimmy Lin compares local installations, virtual machine, IaaS providers and Altiscale’s Hadoop-as-a-Service offering for teaching students about Big Data. Spoiler: he likes Altiscale.

— Two benchmarks from the Cloudera Engineering Blog:

  • Devadutta Ghat et.al. explain results from benchmarking Impala 2.5 with TPC queries. They claim an average speedup of 4.35X over Impala 2.3 for TPC-DS.
  • Allstate’s Don Drake explains results of a test comparing Spark 1.6 performance with Avro and Parquet, with CSV as a baseline. Drake ran a multi-step benchmark with a narrow table and a wide table. Results: the Spark job ran faster with Parquet than Avro, markedly so for the wide data set, which makes sense since it’s columnar. Also, performance with CSV sucked.

— Three items from MapR’s Converge blog:

  • Nick Amato explains how to predict Airbnb listing prices with scikit-learn and Spark.
  • Mathieu Dumoulin explains Deep Learning with the CaffeOnSpark package.
  • Nicolas A Perez explains how to do Twitter sentiment analysis with Spark Streaming.

— Corentin Kerisit explains RDD partitioning in Spark.

Perspectives

— An anonymous blogger at CBInsights notes that big tech companies are paying big bucks for AI companies, so if you’re running a startup make sure you put AI in the name.

— Alexander Wissner-Gross weighs in on the “datasets versus algorithms” debate. My take: data trumps algorithms.

— Google streams engineer Tyler Akidau discusses streaming systems versus batch processing, which is like asking Mr. Fox for his perspective on chickens.

— David Weldon continues his series of interviews with people at Strata + Hadoop: Ravi Dharnikota of SnapLogic, who heard a lot of talk about streaming, Spark and data lakes.

— Alan Earls touts Amazon Machine Learning without understanding it.

Jack Vaughan interviews eBay’s Debashis Saha, who discusses Kylin and other stuff.

Open Source Announcements

— The Apache Software Foundation announces that Apache Apex has graduated to top level status. Apex, for streaming analytics, is the open source version of DataTorrent. Jessica Davis reports.

— North Bridge and Black Duck release their tenth annual survey of people who like open source.

— Apache Flink 1.0.2 ships with bug fixes and a new capability to integrate with RocksDB. So now, you can have Flink on Rocks.

Commercial Announcements

— Google’s DeepMind AI unit announces that they will use TensorFlow instead of Torch for all future work.

— Three guys exit Pivotal, start a company named SnappyData, land a tiny “A” round from Pivotal and GE Digital and propose to build something like GemFire, but on Spark. More here.

— Levyx announces a small “A” round. Levyx offers a version of Spark optimized to run on solid state/Flash memory.

— Tiny consulting firm Xentaurs announces a partnership with Mesosphere. And not just any partnership; it’s a strategic partnership. Actually, they just joined the DC/OS community.

Big Analytics Roundup (April 18, 2016)

In hard news this week, Storm hits a milestone with Release 1.0, Google releases TensorFlow 0.8 with distributed computing support, and DataStax announces DataStax Enterprise Graph. And, following on NVIDIA’s DGX-1 announcement last week there are a number of items on Deep Learning featured below.

Deep Learning

— Adrian Colyer summarizes a paper that summarizes 900 other papers on Deep Learning.

— Data Science Central compiles a slew of links on Deep Learning.

— Nicole Hemsoth interviews NVIDIA Veep Marc Hamilton, who ruminates on the convergence of supercomputing and Deep Learning.

Explainers

— On the Pivotal Big Data blog, Alexey Grischchenko explains what’s up with Apache Hawq, the SQL-on-Hadoop-and-Greenplum engine that is now an Apache Incubator project. According to OpenHub, there’s a lot of activity on Hawq, and contributions are up sharply since it went Apache.

— In KDnuggets, Microsoft’s Brandon Rohrer publishes a handy pocket guide to data science.

— Nicholas A. Perez explains custom streaming sources in Spark.

— Ian Pointer explains Apache Beam, and how it aspires to be the uber-API.

— Abie Reifer explains Microsoft Azure HDInsight.

— Yong Feng of IBM’s Spark Technology Center explains results of a test run with Spark on Mesos.

— Gopal Wunnava explains geospatial intelligence with SparkR on Amazon EMR.

— IBM’s Fred Reiss explains SystemML, for those who missed his presentation at Spark Summit East.

— For masochistic sabremetricians, Nick Amato explains baseball statistics with Hive and Pig.

Perspectives

— Serdar Yegulalp reviews Apache Storm 1.0. He likes it.

— DataArtisans’ Kostas Tzoumas explains counting in streams, then touts Flink.

— Timothy Prickett Morgan reports on HPE’s efforts to put Spark on a Superdome. Results are interesting. But as with IBM running Spark on a mainframe, such efforts overlook a key benefit of Hadoop and Spark: the ability to avoid dealing with the likes of HPE and IBM.

— Katharine Kearnan interviews Nick Pentreath, one of the two Spark Committers IBM has hired. He predicts that in Spark 2.0, the ML pipeline API approaches parity with the MLlib API. Interestingly, he doesn’t expect a lot from SparkR.

— In Forbes, Chris Wilder recaps his visit to Google Cloud Platform NEXT 2016.

— Andrew Brust summarizes Hortonworks’ recent announcements, sees an emerging duopoly of Cloudera and Hortonworks. I’m not inclined to dismiss MapR and AWS so easily.

— Craig Stedman comments on Pivotal’s exit from the Hadoop distribution market, quotes some old guy wondering how much longer IBM will keep BigInsights alive. My take on Pivotal: honestly, I thought they exited a year ago.

— Cloud platform Altiscale’s Raymie Stata surveys Hadoop’s history, sees movement to the cloud.

— James Nunns wonders if the top Hadoop distributors can steal the show from Spark at Hadoop Summit 2016. If you count the number of times the word “Spark” appears in Hortonworks’ announcement, the answer is no.

— Ajay Khanna opines that absent data quality and metadata management, your data lake will turn into a data swamp.

— Nick Bishop interviews MSFT’s research chief, who assures him that AI is too stupid to wipe us out. I worry more about the chemtrails.

Open Source Announcements

— Apache Storm announces Release 1.0.0, with many enhancements. According to OpenHub, Storm is picking up steam, with 127 active contributors in the past 12 months.

— Google announces TensorFlow 0.8, with distributed computing support and new libraries for user-defined distributed models.

— Apache Mahout announces release of Mahout 0.12.0, with Flink bindings to the Samsara engine. Contributors from DataArtisans did most of the work, as most other contributors have long since exited this project.

Commercial Announcements

— DataStax announces DataStax Enterprise Graph (DSE Graph), built on Apache Cassandra and Apache Tinkerpop (a graph computing framework.) A year ago, Datastax acquired Aurelius, the commercial venture behind Titan, an open source distributed graph database; Titan uses Cassandra as a back end. DSE Graph includes extensions found in DataStax Enterprise, including security, search, analytics and monitoring tools. Alex Handy reports.

— Databricks announces new content for its Community Edition:

— Hortonworks previews HDP 2.4.2. Key bits:

  • Spark 1.6.1.
  • Spark SQL certified with ODBC.
  • Bug fixes for Spark/Oozie connection for Kerberos-enabled clusters.
  • Spark Streaming with Apache Kafka in a Kerberos-enabled cluster.
  • Spark SQL with ORC performance improvements.
  • Final technical preview of Apache Zeppelin with Kerberos, LDAP and identity propagation.

— Hortonworks also announces that Pivotal HDP is officially dead. Pivotal announces nothing.

— Teradata announces that its Think Big subsidiary is expanding its data lake and managed service offerings using Apache Spark. This is good news for the eight consultants at Think Big with Spark credentials, as it means less time spent on the bench. Meanwhile, Think Big contributes a distributed K-Modes in PySpark to open source, the first such contribution since 2014. For some reason, they did not contribute it to Spark packages.

— Atigeo, a “compassionate technology company”, announces that is has added Spark 1.6 to its xPatterns platform.

— Lucidworks announces release of Lucidworks View, a component that simplifies development of applications on Solr and Spark.

— DataRPM, “Cognitive Data Science” company with very little money announces partnership with Tamr, a data integration company with lots of money.

Big Analytics Roundup (April 4, 2016)

Strata + Hadoop World sparks a number of commercial announcements: AtScale has a new release, Microsoft previews R Server on HDInsight, and IBM puts Spark on a mainframe, FWIW. We also have a nice harvest of explainers and perspectives.

Slides from Strata available here.

The folks at Domino Data ask: Is XGBoost 10X faster than H2O? We’ll never know the answer, since they took down the post. I’m guessing the answer is “no.”

Screen Shot 2016-04-04 at 10.47.32 AM

Databricks offers a collection of popular blog posts on Apache Spark as an eBook.

Explainers

On the Google Cloud Big Data Blog, Eric Anderson and Marian Dvorsky compare autoscaling in Dataflow/Beam to Spark and Hadoop. (h/t William Vambenepe)

Miles Yucht and Reynold Xin explain DeepSpark, a convolutional neural network that automates software development processes, such as writing test cases, fixing bugs and so forth.

Databricks’ Jules Damji explains how to process JSON data with Spark Datasets and DataFrames.

On the Airbnb engineering blog, Ricardo Bion explains how to scale data science with R.

Eduardo Ariño De La Rubia explains how The Climate Corporation created a high-throughput data science machine.

DataArtisans’ Kostas Tzoumas explains Flink internals, and how Flink counts elements in streams.

On the Insight Data Engineering blog, Daniel Blazevski explains Flink quadtrees.

H2O.ai’s Erin LeDell explains scalable ensemble learning with H2O. Also at Strata, Arno Candel explains why Deep Learning is eating your lunch.

On the Dataiku blog, someone named Margot explains automated model deployment with Data Science Studio.

On the DataTorrent blog, David Yan explains latency calculations in Apache Apex.

Christopher Crosbie explains SparkR on EMR, on the AWS Big Data blog.

Perspectives

Jack Vaughan notes the prominence of streaming analytics at Strata, quotes some old guy who thinks streaming is a thing.

On the Cloudera Vision Blog, Dan Sturman describes Cloudera’s response to what he characterizes as a software quality challenge.

Cloud vendor Altiscale’s Raymie Stata asks which is best for Spark and Hadoop: cloud or on-premises. Spoiler: he thinks you should choose cloud.

On LinkedIn, consultant Rick van der Lans touts Apache Drill.

Wikibon releases forecasts of Spark adoption and the Big Data market. You can either pay Wikibon for a subscription, or read George Leopold’s summary here or Mike Wheatley’s summary here.

Alex Woodie recaps Doug Cutting’s keynoter at Strata+Hadoop.

On the tech blog for Berlin-based online retailer Zalando, Javier Lopez and Mihail Vieru recap a recently completed Flink versus Spark bakeoff. They like Flink’s low latency which, as a fashion retailer, they totally think they need. The bottom line, though, seems to be that DataArtisans is just a few stops away on the U-Bahn, so they chose Flink.

Brandon Butler summarizes the Microsoft and Google challenges to Amazon in the cloud.

InfoWorld’s Martin Heller reviews Databricks’ Spark service, likes it.

In TechCrunch, Josh Klahr lists seven things to watch for at Strata + Hadoop World, which is still worth reading even though the show came and went.

Talend CMO Ashley Stirrup suggests you sharpen your customer reflexes with Apache Spark. If you want to improve your actual reflexes, read this.

Open Source Announcements

ASF announces Apache NiFi 0.6.0, with Kerberos authentication for its REST API and support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra. (h/t Hadoop Weekly)

Commercial Announcements

OLAP-on-Hadoop vendor AtScale announces release 4.0. Key new bits: fine-grained security that links every query to an end user and an intelligent query optimizer that pushes down either as SQL or as MDX depending on end user tool. AtScale has also added to its platform integration, now supports  Business Objects, Cognos, Excel, Jaspersoft, Qlik, MicroStrategy, PowerBI, Spotfire, and Tableau on CDH, HDP, HDInsights and MapR with Hive/Tez, Impala and Spark SQL and an impressive list of data storage formats. Mike Wheatley reports.

Data integration startup Tamr announces “compatibility” with Spark. The press release does not specify whether that means connectivity, push-down integration or something else. Tamr is not certified by Databricks, and has not published anything on Spark Packages.

Pouring new wine into old bottles, IBM delivers Spark on a mainframe, as promised last July.  IBM touts this as a way to perform analysis of your data “in place”, which is great if all of your data is stuck on a mainframe.

IBM partners with Lightbend, the company formerly known as Typesafe, to deliver Scala training through the Big Data University.

Altiscale announces partnership with Tableau, will add visualization to its managed service for Big Data.

Databricks announces availability of APIs to automate Spark infrastructure. On the Databricks blog, Dave Wang explains.

Microsoft announces preview of R Server for HDInsight and an update to Apache Spark for Azure HDInsight. R Server for HDInsight is a rebranded version of Revolution Analytics’ ScaleR acquired last year. R Server is a distributed machine learning platform with push-down integration to MapReduce and Spark and an R API.

Flink promoter DataArtisans announces a 5.5 million Euro Series A financing round led by Intel Capital.

Dataiku announces a new release of Data Science Studio. The press release touts some new features, but I’ll refrain from commenting until the company posts release notes.

Big Analytics Roundup (March 28, 2016)

Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.

— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.

— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.

— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)

Explainers

— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.

— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.

— Frances Perry and Tyler Akidau explain runners in Apache Beam.

— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.

— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.

— Two interesting posts on PySpark:

  • On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
  • On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.

Perspectives

— Eric Kavanagh delivers a nice overview of the history of open source analytics.

— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.

— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.

— Ian Allison profiles Seldon, an open source machine learning platform that specializes in content and product recommenders.

— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.

— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.

— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.

— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.

Open Source Announcements

— AirBNB donates Airflow, a workflow automation system, to Apache.

— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.

— Several Apache projects have new releases:

  • Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
  • BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
  • OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4.  On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
  • SQL engine MRQL releases version 0.6, with new features for incremental query processing.

Commercial Announcements

— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.

— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.

— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.

— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.