Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

screen-shot-2017-01-31-at-1-04-43-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, et.al. ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

screen-shot-2017-01-31-at-2-27-15-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

screen-shot-2017-01-31-at-2-52-27-pm
Source: Open Hub https://www.openhub.net/

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ 2.0.0.0 in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.

Presto

Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

The Year in Machine Learning (Part Two)

This is the second installment in a four-part review of 2016 in machine learning and deep learning. Part One, here, covered general trends. In Part Two, we review the year in open source machine learning and deep learning projects. Parts Three and Four will cover commercial machine learning and deep learning software and services.

There are thousands of open source projects on the market today, and we cannot cover them all. We’ve selected the most relevant projects based on usage reported in surveys of data scientists, as well as development activity recorded in OpenHub.  In this post, we limit the scope to projects with a non-profit governance structure, and those offered by commercial ventures that do not also provide licensed software. Part Three will include software vendors who offer open source “community” editions together with commercially licensed software.

R and Python maintained their leadership as primary tools for open data science. The Python versus R debate continued amid an emerging consensus that data scientists should consider learning both. R has a stronger library of statistics and machine learning techniques and is agiler when working with small data. Python is better suited to developing applications, and the Python open source license is less restrictive for commercial application development.

Not surprisingly, deep learning frameworks were the most dynamic category, with TensorFlow, Microsoft Cognitive, and MXNet taking leadership away from more mature tools like Caffe and Torch. It’s remarkable that deep learning tools introduced as recently as 2014 now seem long in the tooth.

The R Project

The R user community continued to expand in 2016. It ranked second only to SQL in the 2016 O’Reilly Data Science Salary Survey; first in the KDNuggets poll; and first in the Rexer survey. R ranked fifth in the IEEE Spectrum ranking.

R functionality grew at a rapid pace. In April, Microsoft’s Andrie de Vries reported that there were more than 8,000 packages in CRAN, R’s primary repository for contributed packages. As of mid-December, there are 9,737 packages.  Machine learning packages in CRAN continued to grow in number and functionality.

The R Consortium, a Collaborative Project of the Linux Foundation, made some progress in 2016. IBM and ESRI joined the Consortium, whose membership now also includes Alteryx, Avant, DataCamp, Google, Ketchum Trading, Mango Solutions, Microsoft, Oracle, RStudio, and TIBCO. There are now three working groups and eight funded projects.

Hadley Wickham had a good year. One of the top contributors to the R project, Wickham co-wrote R for Data Science and released tidyverse 1.0.0 in September. In The tidy tools manifesto, Wickham explained the four basic principles to a tidy API.

Max Kuhn, the author of Applied Predictive Modeling and developer of the caret package for machine learning, joined RStudio in November. RStudio previously hired Joseph Rickert away from Microsoft.

AT&T Labs is doing some impressive work with R, including the development of a distributed back-end for out-of-core processing with Hadoop and other data platforms. At the UseR! Conference, Simon Urbanek presented a summary.

It is impossible to enumerate all of the interesting analysis performed in R this year. David Robinson’s analysis of Donald Trump’s tweets resonated; using tidyverse, tidytext, and twitteR, Robinson was able to distinguish between the candidate’s “voice” and that of his staffers on the same account.

On the Revolutions blog, Microsoft’s David Smith surveyed the growing role of women in the R community.

Microsoft and Oracle continued to support enhanced R distributions; we’ll cover these in Part Three of this survey.

Python

Among data scientists surveyed in the 2016 KDNuggets poll, 46% said they use Python for analytics, data mining, data science or machine learning projects in the past twelve months. That figure was up from 30% in 2015, and second only to R. In the 2016 O’Reilly Data Science Salary Survey, Python ranked third behind SQL and R.

Python Software Foundation (PSF) expanded the number and dollar value of its grants. PSF awarded many small grants to groups around the world that promote Python education and training. Other larger grants went to projects such as the design of the Python in Education site, improvements to the packaging ecosystem (see below), support for the Python 3.6 beta 1 release sprint, and support for major Python conferences.

The Python Packaging Authority launched the Warehouse project to replace the existing Python Packaging Index (PyPI.) Goals of the project include updating the visual identity, making packages more discoverable and improving support for package users and maintainers.

PSF released Python 3.6.0 and Python 2.7.13 in December.  The scikit-learn team released Version 0.18 with many enhancements and bug fixes; maintenance release Version 0.18.1 followed soon after that.

Many of the key developments for machine learning in Python were in the form of Python APIs to external packages, such as Spark, TensorFlow, H2O, and Theano. We cover these separately below.

Continuum Analytics expanded its commercial support for Python during the year and added commercially licensed software extensions which we will cover in Part Three.

Apache Software Foundation

There are ten Apache projects with machine learning capabilities. Of these, Spark has the most users, active contributors, commits, and lines of code added. Flink is a close second in active development, although most Flink devotees care more about its event-based streaming than its machine learning capabilities.

Top-Level Projects

There are four top-level Apache projects with machine learning functionality: Spark, Flink, Mahout, and OpenNLP.

Apache Spark

The Spark team delivered Spark 2.0, a major release, and six maintenance releases. Key enhancements to Spark’s machine learning capabilities in this release included additional algorithms in the DataFrames-based API, in PySpark and in SparkR, as well as support for saving and loading ML models and pipelines. The DataFrames-based API is now the primary interface for machine learning in Spark, although the team will continue to support the RDD-based API.

GraphX, Spark’s graph engine, remained static. Spark 2.0 included many other enhancements to Spark’s SQL and Streaming capabilities.

Third parties added 24 machine learning packages to Spark Packages in 2016.

The Spark user community continued to expand. Databricks reported 30% growth in Spark Summit attendees and 240% growth in Spark Meetup members. 18% of respondents to Databricks’ annual user survey reported using Spark’s machine learning library in production, up from 13% in 2015. Among data scientists surveyed in the 2016 KDNuggets poll, 22% said they use Spark; in the 2016 O’Reilly Data Science Salary Survey, 21% of the respondents reported using Spark.

The Databricks survey also showed that 61% of users work with Spark in the public cloud, up from 51% in 2015. As of December 2016, there are Spark services available from each of the major public cloud providers (AWS, Microsoft, IBM and Google), plus value-added managed services for data scientists from Databricks, Qubole, Altiscale and Domino Data.

Apache Flink

dataArtisans’ Mike Winters reviewed Flink’s accomplishments in 2016 without using the words “machine learning.” That’s because Flink’s ML library is still pretty limited, no doubt because Flink’s streaming runtime is the primary user attraction.

While there are many use cases for scoring data streams with predictive models, there are few real-world use cases for training predictive models on data streams. Machine learning models are useful when they generalize to a population, which is only possible when the process that creates the data is in a steady state. If a process is in a steady state, it makes no difference whether you train on batched data or streaming data; the latest event falls into the same mathematical space as previous events. If recent events produce major changes to the model, the process is not in a steady state, so we can’t rely on the model to predict future events.

Flink does not yet support PMML model import, a relatively straightforward enhancement that would enable users to generate predictions on streaming data with models built elsewhere. Most streaming engines support this capability.

There may be use cases where Flink’s event-based streaming is superior to Spark’s micro-batching. For the most part, though, Flink strikes me as an elegant solution looking for a problem to solve.

Apache Mahout

The Mahout team released four double-dot releases. Key enhancements include the Samsara math environment and support for Flink as a back end. Most of the single machine and MapReduce algorithms are deprecated, so what’s left is a library of matrix operators for Spark, H2O, and Flink.

Apache OpenNLP

OpenNLP is a machine learning toolkit for processing natural language text. It’s not dead; it’s just resting.

Incubator Projects

In 2016, two machine learning projects entered the Apache Incubator, while no projects graduated, leaving six in process at the end of the year: SystemML, PredictionIO, MADLib, SINGA, Hivemall, and SAMOA. SystemML and Hivemall are the best bets to graduate in 2017.

Apache SystemML

SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research beginning in 2010. IBM donated the code to Apache in 2015; since then, IBM has committed resources to developing the project. All of the major contributors are IBM employees, which begs the question: what is the point of open-sourcing software if you don’t attract a community of contributors?

The team delivered three releases in 2016, adding algorithms and other features, including deep learning and GPU support. Given the support from IBM, it seems likely that the project will hit Release 1.0 this year and graduate to top-level status.

Usage remains light among people not employed by IBM. There is no “Powered By SystemML” page, which implies that nobody else uses it. IBM added SystemML to BigInsights this year, which expands the potential reach to IBM-loyal enterprises if there are any of those left. It’s possible that IBM uses the software in some of its other products.

Apache PredictionIO

PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch. An eponymous startup began work on the project in 2013; Salesforce acquired the company earlier this year and donated the assets to Apache. Apache PredictionIO entered the Apache Incubator in May.

Apache PredictionIO includes many templates for “prebuilt” applications that use machine learning. These include an assortment of recommenders, lead scoring, churn prediction, electric load forecasting, sentiment analysis, and many others.

Since entering the Incubator, the team has delivered several minor releases. Development activity is light, however, which suggests that Salesforce isn’t doing much with this.

Apache SINGA

SINGA is a distributed deep learning project originally developed at the National University of Singapore and donated to Apache in 2015. The platform currently supports feed-forward models, convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks.  It includes a stochastic gradient descent algorithm for model training.

The team has delivered three versions in 2016, culminating with Release 1.0.0 in September. The release number suggests that the team thinks the project will soon graduate to top-level status; they’d better catch up with paperwork, however, since they haven’t filed status reports with Apache in eighteen months.

Apache MADLib

MADLib is a library of machine learning functions that run in PostgreSQL, Greenplum Database and Apache HAWQ (incubating). Work began in 2010 as a collaboration between researchers at UC-Berkeley and data scientists at EMC Greenplum (now Pivotal Software). Pivotal donated the software assets to the Apache Software Foundation in 2015, and the project entered Apache incubator status.

In 2016, the team delivered three minor releases. The active contributor base is tiny, averaging three contributors per month.

According to a survey conducted by the team, most users have deployed the software on Greenplum database. Since Greenplum currently ranks 35th in the DB-Engines popularity ranking and is sinking fast, this project doesn’t have anywhere to go unless the team can port it to a broader set of platforms.

Apache Hivemall

Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team organized in September 2016 and plans an initial release in Q1 2017.

Given the relatively mature state of the code, large installed base for Hive, and high representation of Spark committers on the PMC, Hivemall is a good bet for top-level status in 2017.

Apache SAMOA

SAMOA entered the Apache Incubator two years ago and died. It’s a set of distributed streaming machine learning algorithms that run on top of S4, Storm, and Samza.

As noted above, under Flink, there isn’t much demand for streaming machine learning. S4 is moribund, Storm is old news and Samza is going nowhere; so, you can think of SAMOA as like an Estate Wagon built on an Edsel chassis. Unless the project team wants to port the code to Spark or Flink, this project is toast.

Machine Learning Projects

This category includes general-purpose machine learning platforms that support an assortment of algorithms for classification, regression, clustering and association. Based on reported usage and development activity, we cover H2O, XGBoost, and Weka in this category.

Three additional projects are worth noting, as they offer graphical user interfaces and appeal to business users. KNIME and RapidMiner provide open-source editions of their software together with commercially licensed versions; we cover these in Part Three of this survey. Orange is a project of the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Vowpal Wabbit gets an honorable mention. Known to Kaggleists as a fast and efficient learner, VW’s user base is currently too small to warrant full coverage. The project is now domiciled at Microsoft Research. It will be interesting to see if MSFT does anything with it.

H2O

H2O is an open source machine learning project of H2O.ai, a commercial venture. (We’ll cover H2O.ai’s business accomplishments in Part Three of this report.)

In 2016, the H2O team updated Sparkling Water for compatibility with Spark 2.0. Sparkling Water enables data scientists to combine Spark’s data ingestion and ETL capabilities with H2O machine learning algorithms. The team also delivered the first release of Steam, a component that supports model management and deployment at scale, and a preview of Deep Water for deep learning.

For 2017, H2O.ai plans to add an automated machine learning capability and deliver a production release of Deep Water, with support for TensorFlow, MXNet and Caffe back ends.

According to H2O.ai, H2O more than doubled its user base in 2016.

XGBoost

A project of the University of Washington’s Distributed Machine Learning Common (DMLC), XGBoost is an optimized distributed gradient boosting library used by top data scientists, who appreciate its scalability and accuracy. Tianqi Chen and Carlos Guestrin published a paper earlier this year describing the algorithm. Machine learning startups DataRobot and Dataiku added XGBoost to their platforms in 2016.

Weka

Weka is a collection of machine learning algorithms written in Java, developed at the University of Waikato in New Zealand and distributed under GPU license. Pentaho and RapidMiner include the software in their commercial products.

We include Weka in this review because it is still used by a significant minority of data scientists; 11% of those surveyed in the annual KDnuggets poll said they use the software. However, reported usage is declining rapidly, and development has virtually flatlined in the past few years, which suggests that this project may go the way of the eponymous flightless bird.

Deep Learning Frameworks

We include in this category software whose primary purpose is deep learning. Many general-purpose machine learning packages also support deep learning, but the packages listed here are purpose-built for the task.

Since they were introduced in late 2015, Google’s TensorFlow and Microsoft’s Cognitive Toolkit have rocketed from nothing to leadership in the category. With backing from Amazon and others, MXNet is coming on strong, while Theano and Keras have active communities in the Python world. Meanwhile, older and more mature frameworks, such as Caffe, DL4J, and Torch, are getting buried by the new kids on the block.

Money talks; commercial support matters. It’s a safe bet that projects backed by Google, Microsoft and Amazon will pull away from the pack in 2017.

TensorFlow

TensorFlow is the leading deep learning framework, measured by reported usage or by development activity. Launched in 2015, Google’s deep learning platform went from zero to leadership in record time.

In April, Google released TensorFlow 0.8, with support for distributed processing. The development team shipped four additional releases during the year, with many additional enhancements, including:

  • Python 3.5 support
  • iOS support
  • Microsoft Windows support (selected functions)
  • CUDA 8 support
  • HDFS support
  • k-Means clustering
  • WALS matrix factorization
  • Iterative solvers for linear equations, linear least squares, eigenvalues and singular values

Also in April, DeepMind, Google’s AI research group, announced plans to switch from Torch to TensorFlow.

Google released its image captioning model in TensorFlow in September. The Google Brain team reported that this model correctly identified 94% of the images in the ImageNet 2012 benchmark.

In December, Constellation Research selected TensorFlow as 2016’s best innovation in enterprise software, citing its extensive use in projects throughout Google and strong developer community.

Microsoft Cognitive Toolkit

In 2016, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit (MCT) and released Version 2.0 to beta, with a new Python API and many other enhancements. In VentureBeat, Jordan Novet reports.

At the Neural Information Processing Systems (NIPS) Conference in early December, Cray announced that it successfully ran MCT on a Cray XC50 supercomputer with more than 1,000 NVIDIA Tesla P100 GPU accelerators.

Separately, Microsoft and NVIDIA announced a collaborative effort to support MCT on Tesla GPUs in Azure or on-premises, and on the NVIDIA DGX-1 supercomputer with Pascal GPUs.

Theano

Theano, a project of the Montreal Institute for Learning Algorithms at the University of Montreal, is a Python library for computationally intensive scientific investigation. It allows users to efficiently define, optimize and evaluate mathematical expressions with multi-dimensional arrays. (Reference here.) Like CNTK and TensorFlow, Theano represents neural networks as a symbolic graph.

The team released Theano 0.8 in March, with support for multiple GPUs. Two additional double-dot releases during the year added support for CuDNN v.5 and fixed bugs.

MXNet

MXNet, a scalable deep learning library, is another project of the University of Washington’s Distributed Machine Learning Common (DMLC). It runs on CPUs, GPUs, clusters, desktops and mobile phones, and supports APIs for Python, R, Scala, Julia, Matlab, and Javascript.

The big news for MXNet in 2016 was its selection by Amazon Web Services. Craig Matsumoto reports; Serdar Yegulalp explains; Eric David dives deeper; Martin Heller reviews.

Keras

Keras is a high-level neural networks library that runs on TensorFlow or Theano. Originally authored by Google’s Francois Chollet, Keras had more than 200 active contributors in 2016.

In the Huffington Post, Chollet explains how Keras differs from other DL frameworks. Short version: Keras abstracts deep learning architecture from the computational back end, which made it easy to port from Theano to TensorFlow.

DL4J

Updated, based on comments from Skymind CEO Chris Nicholson.

Deeplearning4j (DL4J) is a project of Skymind, a commercial venture. IT is an open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J runs on distributed GPUs and CPUs. Skymind benchmarks well against Caffe, TensorFlow, and Torch.

While Amazon, Google, and Microsoft promote deep learning on their cloud platforms, Skymind seeks to deliver deep learning on standard enterprise architecture, for organizations that want to train models on premises. I’m skeptical that’s a winning strategy, but it’s a credible strategy. Skymind landed a generous seed round in September, which should keep the lights on long enough to find out. Intel will like a deep learning framework that runs on Xeon boxes, so there’s a possible exit.

Skymind proposes to use Keras for a Python API, which will make the project more accessible to data scientists.

Caffe

Caffe, a project of the Berkeley Vision and Learning Center (BVLC) is a deep learning framework released under an open source BSD license.  Stemming from BVLC’s work in vision and image recognition, Caffe’s core strength is its ability to model a Convolutional Neural Network (CNN). Caffe is written in C++.  Users interact with Caffe through a Python API or through a command line interface.  Deep learning models trained in Caffe can be compiled for operation on most devices, including Windows.

I don’t see any significant news for Caffe in 2016.

Databricks Releases Spark Survey

In a press release and blog post, Databricks announces results from its 2016 Spark Survey. Databricks surveyed 1,615 Spark users and prospective users in July, 2016 Respondents include data engineers, data scientists, architects, technical managers, and academics.

Key findings from the survey:

  • Spark SQL remains the most widely used component.
    • 88% use Spark SQL
    • 71% use Spark Streaming
    • 71% use MLlib (machine learning)
  • Respondents value Spark’s performance and advanced analytics.
    • 91% rate performance very important
    • 82% rate advanced analytics very important
    • 76% rate ease of programming very important
    • 69% rate ease of deployment very important
    • 51% rate real-time streaming very important
  • Production use has increased markedly since 2015.
    • 40% use SQL in production, up from 24%
    • 38% use DataFrames in production, up from 15%
    • 22% use streaming in production, up from 14%
    • 18% use machine learning, up from 13%
  • So has usage in the public cloud.
    • 61% said they use Spark in the public cloud, up from 51% in 2015.
  • Usage of Spark deployed on-premises has declined.
    • 42% use Spark in a standalone deployment, down from 48%
    • 36% use Spark under YARN, down from 40%
    • 7% use Spark on Apache Mesos, down from 11%
  • The Scala API remains the most popular, followed closely by the Python API.
    • 65% use Scala, down from 71% in 2015
    • 62% use Python, up from 58%
    • 44% use SQL, up from 36%
    • 29% use Java, down from 31%
    • 20% use R, up from 18%
  • While Linux remains the most popular OS, Mac and Windows usage is growing rapidly.
    • 74% use Linux/Unix, down from 75% in 2015
    • 32% use Windows, up from 23%
    • 22% use Mac OSX, up from 14%

The report also includes statistics about the Spark community at large.

— Databricks reports growth in the contributor base from 600 in 2015 to 1,000 in 2016, a figure that does not seem to square with the statistics reported in OpenHub.

— Spark Meetup membership grew from 66,000 in 2015 to 225,000 in 2016.

— Spark Summit attendance grew from 3,912 to 5,100.

For a copy of the report and an infographic, go here.

Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Spark 2.0 Released

The Apache Spark team announces the production release of Spark 2.0.0.  Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.

Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.

The Spark team guarantees API stability for all production releases in the Spark 2.X line.

Highlights

Spark Summit: Matei Zaharia summarizes highlights of the release. Slides here.

— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.

— Reynold Xin explains technical details of Spark 2.0.

SQL Processing

Key Changes

New and updated APIs:

  • In Scala and Java, the DataFrame and DataSet APIs are unified.
  • In Python and R, DataFrame is the main programming interface (due to lack of type safety).
  • For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
  • Enhancements to the Accumulator and Aggregator APIs.

Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:

  • Native SQL parser supports ANSI SQL and HiveQL.
  • Native DDL command implementations.
  • Subquery support.
  • View canonicalization support.

Additional new features:

  • Native CSV support
  • Off-heap memory management for caching and runtime.
  • Hive-style bucketing.
  • Approximate summary statistics.

Performance enhancements:

  • Speedups of 2X-10X for common SQL and DataFrame operators.
  • Improved performance with Parquet and ORC.
  • Improvements to Catalyst query optimizer for common workloads.
  • Improved performance for window functions.
  • Automatic file coalescing for native data sources.

Explainers

Spark Summit: Andrew Or explains memory management in Spark 2.0+. Slides here.

Spark Summit: Databrick’s Michael Armbrust explains structured analysis in Spark: DataFrames, Datasets, and Streaming. Slides here.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— On KDnuggets, Paige Roberts explains Project Tungsten.

 Sameer Agarwal, Davies Liu, and Reynold Xin dive deeply into Spark 2.0’s second generation Tungsten engine. This paper inspired Tungsten’s design.

Spark Summit: Yin Huai dives deeply into Catalyst, the Spark optimizer. Slides here.

— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.

Spark Summit: AMPLab’s Ankur Dave explains GraphFrames for graph queries in Spark SQL. Slides here.

Spark Streaming

Key Changes

Spark 2.0 includes an experimental release of Structured Streaming.

Explainers

Spark Summit: Tathagata Das explains Structured Streaming. Slides here.

— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.

— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.

Machine Learning

Key Changes

The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.

ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.

Additional techniques supported vary by API:

  • DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
  • PySpark: LDA, GMM, Generalized linear regression
  • SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.

Explainers

Spark Summit: Joseph Bradley previews machine learning in Spark 2.0. Slides here.

— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.

— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.

SparkR

Key Changes

SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.

As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.

Explainers

Spark Summit: Xiangrui Meng explains the latest developments in SparkR. Slides here.

— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.

— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.

Big Analytics Roundup (July 25, 2016)

We have some more summer reading this week; plus, Splice Machine announces availability of its open source Community Edition, and Google launches two new machine learning APIs. There are so many Spark stories I’ve created a special section for them. Plus we have the usual explainers, perspectives, and news.

Quant headhunter Linda Burtch repeats her survey of working analysts in her network. Preference for using SAS has steadily declined over the three years she has conducted the poll; this year a clear majority chose R or Python over SAS. Preference for open source correlates with education; the more you know, the less likely you are to use SAS.

Oracle, IBM, SAP, and Microsoft have all reported Q2 revenue and earnings, but Teradata is still crunching the numbers. I’ll do a general earnings roundup when TDC gets around to reporting its numbers. TDC’s stock price has outperformed the others since June 30, which suggests the market expects a good second quarter. Meanwhile, TDC acquires another consultancy and reveals who bought Aprimo.

Summer Reading

Adrian Colyer lists his five favorite papers from the past several months and outlines his philosophy, which you must read. And here is another link to last week’s top paper on data bazaars versus data cathedrals.

Splice Machine Shifts to Open Core

Hadoop-based RDBMS vendor Splice Machine announces general availability for its open source community edition and offers a sandbox hosted on AWS.  Sam Dean approves; Andrew Brust reports; Dave Ramel explains. Jack Germain describes Splice Machine’s changing business model.

Spark Stories

— Databricks’ Spark survey is still accepting responses. Go and fill it out if you have not done so already.

— The Spark PMC has voted favorably on a release candidate for Spark 2.0, which is now in packaging for general availability.

— On the Databricks blog, Jules Damji corrals Spark news from the past two weeks.

— Alex Woodie touts LevyxSpark, an enhanced Spark distribution based on open source Apache Spark. LevyxSpark includes some open source enhancements, plus Levyx Helium, an SSD-based key-value store.

— In a webcast, Alexander Ulanov summarizes options for deep learning on Spark.

— Sam Weaver explains how to use the new MongoDB connector for Spark.

Explainers

— Nita Dembla and Gopal Vijayaraghavan explain improvements in Hive 2.1.

— Siddharth Anand introduces Apache Airflow (Incubating), a platform to author, schedule, and monitor DAGs. Sounds like Apache Beam.

— Data Artisans’ Stephan Ewan explains savepoints in Apache Flink.

Perspectives

— Jack Clark profiles Google’s land grab in deep learning. Short version: TensorFlow is blowing away Caffe, Torch, Theano, dl4j, CNTK, and DSSTNE.

— Greg Satell theorizes about Google’s open source strategy as if a “razor and blades” strategy is something new and brilliant.

— In Fortune, Barb Darrow profiles cloud computing’s disruptive impact.

— Sam Dean confuses machine learning with artificial intelligence.

— Syncsort’s Paige Roberts interviews Dr. Ellen Friedman.

— Drew Breunig poses a theory about the business implications of machine learning.

— BuzzFeed’s Adam Kelleher attempts to explain bias, fails.

— IBM exec Rob Thomas co-authors a blog about machine learning. It’s about what you would expect from an IBM exec.

Open Source News

— Open source columnar storage engine Apache Kudu graduates to top-level status.

— Apache Chukwa announces Release 0.8, with security bug fixes, FWIW. Chukwa captures logs from distributed systems for monitoring and analysis. No, I never heard of it either.

Commercial Announcements

— Google announces open beta for its Cloud Natural Language and Cloud Speech APIs.

Hardware News

— Inspur, which claims to be China’s largest server manufacturer, announces availability of the Memory1 line of servers for big analytics. Inspur uses high-capacity flash DIMMs and memory expansion software to deliver up to 2TB of memory per server and up to 80TB per rack.

— Startup Wave Computing announces plans for a family of deep learning computers. Good luck to them. The history of computing isn’t kind to special purpose machines, which tend to eventually get buried by general purpose machines.

Funding News

— Redis Labs lands a $14 million “C” round led by Bain Capital and Carmel Ventures. Redis claims 6,200 enterprise customers and 55,000 accounts for its cloud service.

— Sift Security emerges from stealth, announces $3.25 million in angel funding. Sift uses graph analytics running on Spark and TitanDB to identify linked threats and incidents.

Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.