The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Big Analytics Roundup (July 25, 2016)

We have some more summer reading this week; plus, Splice Machine announces availability of its open source Community Edition, and Google launches two new machine learning APIs. There are so many Spark stories I’ve created a special section for them. Plus we have the usual explainers, perspectives, and news.

Quant headhunter Linda Burtch repeats her survey of working analysts in her network. Preference for using SAS has steadily declined over the three years she has conducted the poll; this year a clear majority chose R or Python over SAS. Preference for open source correlates with education; the more you know, the less likely you are to use SAS.

Oracle, IBM, SAP, and Microsoft have all reported Q2 revenue and earnings, but Teradata is still crunching the numbers. I’ll do a general earnings roundup when TDC gets around to reporting its numbers. TDC’s stock price has outperformed the others since June 30, which suggests the market expects a good second quarter. Meanwhile, TDC acquires another consultancy and reveals who bought Aprimo.

Summer Reading

Adrian Colyer lists his five favorite papers from the past several months and outlines his philosophy, which you must read. And here is another link to last week’s top paper on data bazaars versus data cathedrals.

Splice Machine Shifts to Open Core

Hadoop-based RDBMS vendor Splice Machine announces general availability for its open source community edition and offers a sandbox hosted on AWS.  Sam Dean approves; Andrew Brust reports; Dave Ramel explains. Jack Germain describes Splice Machine’s changing business model.

Spark Stories

— Databricks’ Spark survey is still accepting responses. Go and fill it out if you have not done so already.

— The Spark PMC has voted favorably on a release candidate for Spark 2.0, which is now in packaging for general availability.

— On the Databricks blog, Jules Damji corrals Spark news from the past two weeks.

— Alex Woodie touts LevyxSpark, an enhanced Spark distribution based on open source Apache Spark. LevyxSpark includes some open source enhancements, plus Levyx Helium, an SSD-based key-value store.

— In a webcast, Alexander Ulanov summarizes options for deep learning on Spark.

— Sam Weaver explains how to use the new MongoDB connector for Spark.

Explainers

— Nita Dembla and Gopal Vijayaraghavan explain improvements in Hive 2.1.

— Siddharth Anand introduces Apache Airflow (Incubating), a platform to author, schedule, and monitor DAGs. Sounds like Apache Beam.

— Data Artisans’ Stephan Ewan explains savepoints in Apache Flink.

Perspectives

— Jack Clark profiles Google’s land grab in deep learning. Short version: TensorFlow is blowing away Caffe, Torch, Theano, dl4j, CNTK, and DSSTNE.

— Greg Satell theorizes about Google’s open source strategy as if a “razor and blades” strategy is something new and brilliant.

— In Fortune, Barb Darrow profiles cloud computing’s disruptive impact.

— Sam Dean confuses machine learning with artificial intelligence.

— Syncsort’s Paige Roberts interviews Dr. Ellen Friedman.

— Drew Breunig poses a theory about the business implications of machine learning.

— BuzzFeed’s Adam Kelleher attempts to explain bias, fails.

— IBM exec Rob Thomas co-authors a blog about machine learning. It’s about what you would expect from an IBM exec.

Open Source News

— Open source columnar storage engine Apache Kudu graduates to top-level status.

— Apache Chukwa announces Release 0.8, with security bug fixes, FWIW. Chukwa captures logs from distributed systems for monitoring and analysis. No, I never heard of it either.

Commercial Announcements

— Google announces open beta for its Cloud Natural Language and Cloud Speech APIs.

Hardware News

— Inspur, which claims to be China’s largest server manufacturer, announces availability of the Memory1 line of servers for big analytics. Inspur uses high-capacity flash DIMMs and memory expansion software to deliver up to 2TB of memory per server and up to 80TB per rack.

— Startup Wave Computing announces plans for a family of deep learning computers. Good luck to them. The history of computing isn’t kind to special purpose machines, which tend to eventually get buried by general purpose machines.

Funding News

— Redis Labs lands a $14 million “C” round led by Bain Capital and Carmel Ventures. Redis claims 6,200 enterprise customers and 55,000 accounts for its cloud service.

— Sift Security emerges from stealth, announces $3.25 million in angel funding. Sift uses graph analytics running on Spark and TitanDB to identify linked threats and incidents.

Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Big Analytics Roundup (June 13, 2016)

Spark Summit 2016 met last week in SFO. There were many cool things; I will publish a separate report when presentations and videos are available.

KDnuggets releases results of its annual poll on data science software. Key findings:

  • Python use is up 51%, almost catches up to R, the #1 choice.
  • Excel and Tableau usage are up 47% and 49%, respectively.
  • Spark usage is up 91%, overtakes Hadoop.
  • SAS is down big time, drops from the top ten.

Meanwhile, Alex Woodie wraps statistics on Spark adoption, and Qubole’s Ari Amster reports on Spark usage among Qubole users.

Tim Spann recaps the week in Hadoop.

Spark Summit: Roundup of Roundups

— On the Databricks blog, Wayne Chan, Dave Wang, Jules Damji and Denny Lee recap highlights from the Summit.

— Jessica Davis rounds up the highlights.

— Jack Vaughan surrounds the story, quotes some old guy.

— Sam Dean summarizes what you need to know.

— Alex Handy collects the key bits.

— Andrew Brust separately corrals Day One and Day Two.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Spark Summit Europe, Brussels, October 25-27 (closing date TBA)

Top Read

Adrian Colyer summarizes a paper on identifying architectural debt in software.

Explainers

— Deenar Torasker explains the new capabilities of HDFS.

— Ron Bodkin explains key considerations when designing continuous apps, in the second of a three part series. Part one is here.

— On his eponymous blog, Jesse Steinweg-Woods explains Gradient Boosted Trees with XGBoost in Python.

— Adam Warski explains how Kafka Streams fits into the stream processing landscape.

Perspectives

— H2O.ai’s Vinod Iyengar objects to what he calls the fragmentation of Spark support, correctly noting that Cloudera and Hortonworks support different versions of Spark in their distributions. Of course, nobody is obligated to use Spark with Cloudera and Hortonworks.

— From the Spark Summit on YouTube: Ben Lorica leads a panel discussion of incredibly smart and distinguished people, plus some old guy.

— Altiscale’s Barbara Lewis presents ten use cases for Big Data.

— Tim Wallis believes that AI will relieve boredom.

— Sam Dean touts Grappa, Drill and Kafka as successors to Spark. Grappa is going nowhere. Drill is great if all you want to do is SQL, and Kafka is great if all you want to do is streaming. Pro tip: there are no real-world analytic applications where all you want to do is streaming.

— Allen Downey opines that statistical tests are inflexible and opaque. Funny, my college roommate said the same thing when he flunked his Stat 101 mid-term.

Open Source Announcements

— LinkedIn announces release of PhotonML, a machine learning library for Spark. Feature detail here.

— Google releases TensorFlow 0.9.0, with iOS support. Speculation about deep learning on your phone ensues.

— Twitter donates DistributedLog to Apache.

Commercial Announcements

— Databricks announces general availability for the Databricks Community Edition, and completion of the first phase of Databricks Enterprise Security framework.

— Microsoft announces general availability for its managed Spark service in HDInsight, and summer availability for the Spark pushdown capability in R Server. The company also announced PowerBI support for Spark Streaming, which is confusing for those who thought PowerBI already supported Spark Streaming.

— IBM announces limited preview of a managed service branded as the Data Science Experience. IBM is coy about the details; the service definitely includes Spark, Jupyter and RStudio, H2O and “curated data sets”, and may include other bits. The service itself looks promising, but IBM’s claim to offer the “first development environment for Apache Spark” is BS.

— In an oddly opaque press release, H2O announces that it is “working with” IBM. H2O is open source software, and IBM requires no permission from H2O.ai for use or distribution; presumably, H2O will offer support contracts to users. H2O.ai did not respond to request for comment.

— Splice Machine announces plans to go open source; a company insider says they plan to donate the software to Apache. Dave Ramel reports.

Big Analytics Roundup (May 9, 2016)

The big news this week: Teradata’s CEO Mike Keough walks the plank. TDC stock rises 21% on dismal numbers, which demonstrates how much Wall Street values leadership.

CRN releases its fourth annual Big Data 100 in listicle form to maximize clicks. Criteria for inclusion are “editor’s picks”, so whatever. I got through the As before giving up.

Dave Ramel details five leading Apache Big Data projects: Spark, Tez, Bigtop, REEF and Storm. What? It’s a nice summary of each, but Ramel is a slave to Apache’s silly classifications.

Bullshit Benchmarks

Here are four rules for benchmarks.

  1. Use a standard test protocol, such as TPC-DS.
  2. When there is no available standard, test multiple use cases. Make a decent effort to try a variety of workloads.
  3. Communicate with sponsors for all benchmarked software, or communicate with none of them.
  4. Publish your code and your data. (There’s this thing called GitHub….)

The ironically named Mammoth Data (current headcount: 15) violates all four rules in a Google-commissioned “study,” which concludes that Cloud Dataflow runs one use case faster than Spark. Professional cat herder Andrew Oliver replaces his Mammoth CEO hat with his analyst hat and touts the results.

Go to the back of the class, Andrew. Run more use cases, discuss results with the Spark team as well as the Google team, then let us know what you learned. I don’t doubt that Dataflow is a nifty tool, and look forward to seeing a benchmark we can trust.

Explainers

— Adrian Colyer focuses on time series:

  • Gorilla: a fast, scalable  in-memory time series database.
  • BTrDB (Berkeley Tree Database), optimized storage for time series processing.
  • The Tarzan algorithm, a technique that discovers surprising patterns in a time series database. (Fixed link — h/t Oliver Vagner).

— On BrightTalk, Databricks’ Reynold Xin explains the new bits in Spark 2.0, to be released soon.

— On the DataRobot blog, Quantopian’s Thomas Wiecki explains how to predict out-of-sample performance for trading algorithms.

— Indeed.com’s Preetha Appan explains algorithms and architecture for recommendation engines.

— In a webcast, Sean Owen and Yann Delacourt explain real-time analytics with Spark.

— Microsoft’s Lixun Zhang explains the differences among open source R, Microsoft R Open and Microsoft R Server.

Perspectives

— In Datanami, George Leopold profiles DataRobot, a machine learning startup. One point he gets wrong, DataRobot runs on Hadoop in the cloud and it runs on Hadoop on premises.

— On the Google Cloud blog, Tyler Akidau offers Google’s perspective on why they moved Cloud Dataflow development to Apache Beam. DataArtisans chirps support. Here’s what OpenHub has to say about Apache Beam:

Screen Shot 2016-05-09 at 11.01.28 AM

— In WSJ’s CIO Journal, Steven Norton interviews Airbnb’s Mike Curtis, who name-drops Apache Spark. In the same venue, Clint Boulton previously reported that Airbnb uses Spark in its Aerosolve project.

— Jim O’Reilly offers a summary of the differences among AWS, Azure and Google Cloud.

— On the Qubole blog, Monique Chmiel tries to summarize the pros and cons of Python, R and Scala for Big Data, and largely fails. None of the three is suitable for Big Data on its own, so you have to evaluate them for their APIs to scalable platforms like Spark. As of today, the Spark APIs for Scala and Python are clearly superior to the R API.

Commercial Announcements

News from commercial software providers, as well as commercial vendors that operate on an open source software model.

— Hortonworks announces that it lost $1.59 for every dollar it sold in Q1, which is slightly better than the $1.85 it lost in Q1 of 2015. At that rate, look for HDP to break even in 2018 or so, unless they run out of cash first. Wall Street drives stock down 18%.

— Teradata fires CEO, Wall Street celebrates. Don’t party too hard, guys; the numbers still stink.

Stuff I Really Don’t Care About

— Basho releases Riak TS to open source.

Big Analytics Roundup (April 25, 2016)

Mesosphere wins the internet this week with its announcement that it has open sourced DC/OS, its datacenter virtualization project built around Apache Mesos. While not an “analytics” project per se, DC/OS has the potential to transform how organizations provision and deploy their analytics platforms.

In a nutshell, Apache Mesos distributes workloads across physical IT resources. DC/OS adds a container orchestration platform; installation, management and monitoring tools; and improvements to networking, security, load balancing, security and other areas. For more details about DC/OS and why it matters, read this white paper by Benjamin Hindman and Edward Hsu of Mesosphere.

Mesosphere has assembled an alliance of 61 launch partners, including tech vendors, systems integrators and potential users. Big brands include Accenture, Capgemini, Cisco, EMC, HPE, Microsoft, MapR, Microsoft and Verizon. Notable startups include Alluxio, Canonical, Confluent, Lightbend and MemSQL.

Analysts chime in:

  • Gavin Clarke thinks Google forced Mesosphere’s hand by open sourcing Kubernetes.
  • Mike Wheatley, notes that many of the components were already open source.
  • On TechCrunch, Frederic Lardinois reports and comments.
  • In Computerworld, John Ribeiro reports.
  • Janakiram MSV wonders if DC/OS will emerge as an alternative to Kubernetes.
  • Sam Dean surveys the project and interviews Ben Hindman.
  • George Leopold notes the scope of the DC/OS ecosystem.
  • Joao Lima reports.

DC/OS ships with more than 30 open source packages ready to install as DC/OS services. Notable among them: Cassandra, Elasticsearch, Kafka, MemSQL, Spark, Storm and Zeppelin.

Explainers

— Andrie de Vries explains how he scraped CRAN to trace the growth in R packages.

— On the Cloudera Engineering blog, David Alves explains how to use Impala and Kudu for analytic workloads.

— Michael Hunger and William Lyon explain how they analyzed the Panama Papers with Neo4j.

— On the Microsoft Azure blog, Liam Cavanagh explains how to optimize document search in Azure.

— Adrian Colyer of the morning paper summarizes five papers on word vectors, reviews Global Vectors for Word Representation, delivers an overview of Deep Learning and covers ImageNet classification with deep convolutional neural networks.

— Mario Inchiosa and Roni Burd explain how Microsoft R Server delivers an R interface to Spark in HDInsight.

Perspectives

— In MIT Technology review, Tom Simonite interviews Google’s Jeff Dean, contributor to Spanner, Translate, BigTable, MapReduce, Google Brain. LevelDB and TensorFlow. They discuss the future of machine learning.

— David Weldon went to Strata and interviewed some people:

  • Ali Hodroj of GigaSpaces, a cloud enabling company. Hodroj is bullish on cloud.
  • H2O.ai’s Arno Candel, who is surprised that so many people are talking about Spark.
  • Nikita Ivanov of GridGain, who says that people are excited about in-memory computing.
  • DataArtisans’ Kostas Tzoumas, who thinks that more people would use Flink if they were better educated.

— Alex Woodie touts Apache Beam, the open source implementation of Google’s Cloud Dataflow, which aspires to unify everything.

— James Nunns surveys ten Big Analytics startups: Confluent, H2O.ai, AtScale, Interana, Tamr, Wavefront, BlueTalon, Cazena, DataTorrent and Databricks.

— In Silicon Angle, Wikibon’s Paul Gillin interviews Wikibon’s George Gilbert, who is bullish on Spark.

— John Leonard ruminates on Hadoop, noting the proliferation of cute animal logos, and the challenges of the open source business model.

— Sam Dean notices that there are quite a few new open source tools for machine learning.

— Jack Vaughan summarizes the educational challenges posed by machine learning.

Commercial Announcements

— Dataiku announces availability of Data Science Studio on Microsoft Azure.

— GridGain announces availability of a support package for Apache Ignite that includes its Professional Edition — essentially the same as Apache Ignite, with more frequent maintenance releases and some LGPL libraries.

— MemSQL announces closing on a $36 million “C” round. All existing investors participated, plus two new investors.

Big Analytics Roundup (April 4, 2016)

Strata + Hadoop World sparks a number of commercial announcements: AtScale has a new release, Microsoft previews R Server on HDInsight, and IBM puts Spark on a mainframe, FWIW. We also have a nice harvest of explainers and perspectives.

Slides from Strata available here.

The folks at Domino Data ask: Is XGBoost 10X faster than H2O? We’ll never know the answer, since they took down the post. I’m guessing the answer is “no.”

Screen Shot 2016-04-04 at 10.47.32 AM

Databricks offers a collection of popular blog posts on Apache Spark as an eBook.

Explainers

On the Google Cloud Big Data Blog, Eric Anderson and Marian Dvorsky compare autoscaling in Dataflow/Beam to Spark and Hadoop. (h/t William Vambenepe)

Miles Yucht and Reynold Xin explain DeepSpark, a convolutional neural network that automates software development processes, such as writing test cases, fixing bugs and so forth.

Databricks’ Jules Damji explains how to process JSON data with Spark Datasets and DataFrames.

On the Airbnb engineering blog, Ricardo Bion explains how to scale data science with R.

Eduardo Ariño De La Rubia explains how The Climate Corporation created a high-throughput data science machine.

DataArtisans’ Kostas Tzoumas explains Flink internals, and how Flink counts elements in streams.

On the Insight Data Engineering blog, Daniel Blazevski explains Flink quadtrees.

H2O.ai’s Erin LeDell explains scalable ensemble learning with H2O. Also at Strata, Arno Candel explains why Deep Learning is eating your lunch.

On the Dataiku blog, someone named Margot explains automated model deployment with Data Science Studio.

On the DataTorrent blog, David Yan explains latency calculations in Apache Apex.

Christopher Crosbie explains SparkR on EMR, on the AWS Big Data blog.

Perspectives

Jack Vaughan notes the prominence of streaming analytics at Strata, quotes some old guy who thinks streaming is a thing.

On the Cloudera Vision Blog, Dan Sturman describes Cloudera’s response to what he characterizes as a software quality challenge.

Cloud vendor Altiscale’s Raymie Stata asks which is best for Spark and Hadoop: cloud or on-premises. Spoiler: he thinks you should choose cloud.

On LinkedIn, consultant Rick van der Lans touts Apache Drill.

Wikibon releases forecasts of Spark adoption and the Big Data market. You can either pay Wikibon for a subscription, or read George Leopold’s summary here or Mike Wheatley’s summary here.

Alex Woodie recaps Doug Cutting’s keynoter at Strata+Hadoop.

On the tech blog for Berlin-based online retailer Zalando, Javier Lopez and Mihail Vieru recap a recently completed Flink versus Spark bakeoff. They like Flink’s low latency which, as a fashion retailer, they totally think they need. The bottom line, though, seems to be that DataArtisans is just a few stops away on the U-Bahn, so they chose Flink.

Brandon Butler summarizes the Microsoft and Google challenges to Amazon in the cloud.

InfoWorld’s Martin Heller reviews Databricks’ Spark service, likes it.

In TechCrunch, Josh Klahr lists seven things to watch for at Strata + Hadoop World, which is still worth reading even though the show came and went.

Talend CMO Ashley Stirrup suggests you sharpen your customer reflexes with Apache Spark. If you want to improve your actual reflexes, read this.

Open Source Announcements

ASF announces Apache NiFi 0.6.0, with Kerberos authentication for its REST API and support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra. (h/t Hadoop Weekly)

Commercial Announcements

OLAP-on-Hadoop vendor AtScale announces release 4.0. Key new bits: fine-grained security that links every query to an end user and an intelligent query optimizer that pushes down either as SQL or as MDX depending on end user tool. AtScale has also added to its platform integration, now supports  Business Objects, Cognos, Excel, Jaspersoft, Qlik, MicroStrategy, PowerBI, Spotfire, and Tableau on CDH, HDP, HDInsights and MapR with Hive/Tez, Impala and Spark SQL and an impressive list of data storage formats. Mike Wheatley reports.

Data integration startup Tamr announces “compatibility” with Spark. The press release does not specify whether that means connectivity, push-down integration or something else. Tamr is not certified by Databricks, and has not published anything on Spark Packages.

Pouring new wine into old bottles, IBM delivers Spark on a mainframe, as promised last July.  IBM touts this as a way to perform analysis of your data “in place”, which is great if all of your data is stuck on a mainframe.

IBM partners with Lightbend, the company formerly known as Typesafe, to deliver Scala training through the Big Data University.

Altiscale announces partnership with Tableau, will add visualization to its managed service for Big Data.

Databricks announces availability of APIs to automate Spark infrastructure. On the Databricks blog, Dave Wang explains.

Microsoft announces preview of R Server for HDInsight and an update to Apache Spark for Azure HDInsight. R Server for HDInsight is a rebranded version of Revolution Analytics’ ScaleR acquired last year. R Server is a distributed machine learning platform with push-down integration to MapReduce and Spark and an R API.

Flink promoter DataArtisans announces a 5.5 million Euro Series A financing round led by Intel Capital.

Dataiku announces a new release of Data Science Studio. The press release touts some new features, but I’ll refrain from commenting until the company posts release notes.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.