Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

screen-shot-2017-01-31-at-1-04-43-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, et.al. ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

screen-shot-2017-01-31-at-2-27-15-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

screen-shot-2017-01-31-at-2-52-27-pm
Source: Open Hub https://www.openhub.net/

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ 2.0.0.0 in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.

Presto

Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

Big Analytics Roundup (May 16, 2016)

This week we have more insight into Spark 2.0, scheduled for release just before Spark Summit 2016. (Yes, I’m going.) Also, kudos to BI-on-Hadoop startup AtScale for a new round of funding; Amazon releases YADLF (Yet Another Deep Learning Framework); and there are a number of new faces at H2O.ai.

Plus, we have an extended review of the Palantir story.

Buzzfeed on Palantir

Last week, I deemed Buzzfeed’s story on Palantir too dumb to link. (“Forget it, Jake. It’s Buzzfeed.”) Buzzfeed “news” reporter William Alden, who was all over a story about maggots in Facebook lunches, breathlessly mines a cache of “secret internal documents” and discovers:

  • Palantir expects employee turnover of around 20% for 2016.
  • Palantir lost some clients.
  • Palantir books more work than it bills.

Does Palantir have an employee turnover problem?  No. A 20% turnover rate is slightly above the 17% reported for all industries in 2015, and about on track for Silicon Valley. (There are companies in SV with 100% turnover rates.) On Glassdoor, employees give Palantir high marks.

Does Palantir have a client retention problem? Not exactly. The story cites four clients — American Express, Coca-Cola, Kimberley-Clark and Nasdaq — who engaged Palantir to conduct a pilot, then decided not to proceed with a long-term contract. In other words, lost sales and not cancelled contracts. The document Buzzfeed obtained is Palantir’s won/lost analysis, which shows that the company is attempting to learn from its lost sales.

Does Palantir have a revenue problem? No. Palantir’s 2015 revenue was up 50% from the previous year. Buzzfeed obsesses over the difference between Palantir’s bookings of $1.7 billion and its revenue of $420 million. A high book-to-bill ratio  is typical for consultancies that pursue large multi-year projects; it is a sign of strong demand for the company’s services. Under GAAP accounting, companies can accrue revenue only as work is performed, even if they bill the work in advance. Note that consulting giant Accenture’s bookings exceed its revenue for its most recent quarter.

Does Palantir have a profitability problem? Possibly. Buzzfeed reports that the company lost $80 million last year on revenue of $420 million. Consulting margins tend to be fairly high, so a loss means that Palantir is “investing” in a lot of unbillable work. It’s hard to say if these “investments” will pay off. Palantir closed another round of funding in December, 2015, so people with more and better information than Buzzfeed obviously think they will, and are backing up their belief with cash.

By the way, you know who has an actual revenue problem? Buzzfeed.

Roger Peng attempts to draw lessons for data scientists from the Buzzfeed story, without questioning its premises. He should stick to Biostatistics.

Spark 2.0

— Databricks announces preview of Apache Spark 2.0 on Databricks Community Edition.

— From last week: Reynold Xin explains what’s new in Spark 2.0.

— Dave Ramel summarizes the new features, including faster SQL; consolidation of the Dataset and DataFrame APIs; support for ANSI (2003) SQL; and Structured Streaming, an integrated view of tables and streams.

— Now that Spark 2.0 is in preview, MapR offers Spark 1.6.1.

Explainers

— Four from Adrian Colyer:

— Richard Williamson explains how to build a streaming prediction engine with Spark, MADlib, Kudu and Impala.

— On the Cloudera Vision blog, Santosh Kumar explains Hive-on-Spark.

— DataStax’ Dani Traphagen explains data processing with Spark and Cassandra.

— In ZDNet, Andrew Brust explains Microsoft’s R strategy, and gets it right.

Perspectives

— For a planted article in Linux.com, Pam Baker interviews IBM’s Mike Breslin, who answer questions nobody is asking about using Spark and Cloudant.

— Joyce Wells recaps a presentation by Booz Allen’s Jair Aguirre, who touts Apache Drill.

— Alex Woodie attends the Apache: Big Data 2016 conference and discovers open source projects.

— In Business Insider, Sam Shead describes FBLearnerFlow, a workbench for machine learning and AI.

— Leslie D’Monte describes some ways companies use machine learning in their operations.

Open Source Announcements

— Google announces release to open source of SyntaxNet, a framework for natural language understanding. Included in the release: an English parser dubbed Parsey McParseface. Journalists respond to the latter like dogs to a squirrel.

— Amazon releases yet another deep learning framework, this one branded as “Deep Scalable Sparse Tensor Network Engine (DSSTNE)” or “Destiny”. Stephanie Condon reports.

— Salesforce donates PredictionIO to Apache.

— Apache Storm announces two new maintenance releases:

  • Storm 0.10.1 has bug fixes.
  • Storm 1.0.1 has performance improvements and bug fixes.

— Apache Flink announces Release 1.0.3, with bug fixes and improved documentation.

— Apache Apex pushes a release to resolve a security issue.

Commercial Announcements

— BI-on-Hadoop startup AtScale announces an $11 million “B” round. Media coverage here.

— H2O.ai announces new hires with a strong orientation towards visualization, suggesting the company plans to add a more robust user interface to its best-in-class machine learning engine.

Big Analytics Roundup (April 4, 2016)

Strata + Hadoop World sparks a number of commercial announcements: AtScale has a new release, Microsoft previews R Server on HDInsight, and IBM puts Spark on a mainframe, FWIW. We also have a nice harvest of explainers and perspectives.

Slides from Strata available here.

The folks at Domino Data ask: Is XGBoost 10X faster than H2O? We’ll never know the answer, since they took down the post. I’m guessing the answer is “no.”

Screen Shot 2016-04-04 at 10.47.32 AM

Databricks offers a collection of popular blog posts on Apache Spark as an eBook.

Explainers

On the Google Cloud Big Data Blog, Eric Anderson and Marian Dvorsky compare autoscaling in Dataflow/Beam to Spark and Hadoop. (h/t William Vambenepe)

Miles Yucht and Reynold Xin explain DeepSpark, a convolutional neural network that automates software development processes, such as writing test cases, fixing bugs and so forth.

Databricks’ Jules Damji explains how to process JSON data with Spark Datasets and DataFrames.

On the Airbnb engineering blog, Ricardo Bion explains how to scale data science with R.

Eduardo Ariño De La Rubia explains how The Climate Corporation created a high-throughput data science machine.

DataArtisans’ Kostas Tzoumas explains Flink internals, and how Flink counts elements in streams.

On the Insight Data Engineering blog, Daniel Blazevski explains Flink quadtrees.

H2O.ai’s Erin LeDell explains scalable ensemble learning with H2O. Also at Strata, Arno Candel explains why Deep Learning is eating your lunch.

On the Dataiku blog, someone named Margot explains automated model deployment with Data Science Studio.

On the DataTorrent blog, David Yan explains latency calculations in Apache Apex.

Christopher Crosbie explains SparkR on EMR, on the AWS Big Data blog.

Perspectives

Jack Vaughan notes the prominence of streaming analytics at Strata, quotes some old guy who thinks streaming is a thing.

On the Cloudera Vision Blog, Dan Sturman describes Cloudera’s response to what he characterizes as a software quality challenge.

Cloud vendor Altiscale’s Raymie Stata asks which is best for Spark and Hadoop: cloud or on-premises. Spoiler: he thinks you should choose cloud.

On LinkedIn, consultant Rick van der Lans touts Apache Drill.

Wikibon releases forecasts of Spark adoption and the Big Data market. You can either pay Wikibon for a subscription, or read George Leopold’s summary here or Mike Wheatley’s summary here.

Alex Woodie recaps Doug Cutting’s keynoter at Strata+Hadoop.

On the tech blog for Berlin-based online retailer Zalando, Javier Lopez and Mihail Vieru recap a recently completed Flink versus Spark bakeoff. They like Flink’s low latency which, as a fashion retailer, they totally think they need. The bottom line, though, seems to be that DataArtisans is just a few stops away on the U-Bahn, so they chose Flink.

Brandon Butler summarizes the Microsoft and Google challenges to Amazon in the cloud.

InfoWorld’s Martin Heller reviews Databricks’ Spark service, likes it.

In TechCrunch, Josh Klahr lists seven things to watch for at Strata + Hadoop World, which is still worth reading even though the show came and went.

Talend CMO Ashley Stirrup suggests you sharpen your customer reflexes with Apache Spark. If you want to improve your actual reflexes, read this.

Open Source Announcements

ASF announces Apache NiFi 0.6.0, with Kerberos authentication for its REST API and support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra. (h/t Hadoop Weekly)

Commercial Announcements

OLAP-on-Hadoop vendor AtScale announces release 4.0. Key new bits: fine-grained security that links every query to an end user and an intelligent query optimizer that pushes down either as SQL or as MDX depending on end user tool. AtScale has also added to its platform integration, now supports  Business Objects, Cognos, Excel, Jaspersoft, Qlik, MicroStrategy, PowerBI, Spotfire, and Tableau on CDH, HDP, HDInsights and MapR with Hive/Tez, Impala and Spark SQL and an impressive list of data storage formats. Mike Wheatley reports.

Data integration startup Tamr announces “compatibility” with Spark. The press release does not specify whether that means connectivity, push-down integration or something else. Tamr is not certified by Databricks, and has not published anything on Spark Packages.

Pouring new wine into old bottles, IBM delivers Spark on a mainframe, as promised last July.  IBM touts this as a way to perform analysis of your data “in place”, which is great if all of your data is stuck on a mainframe.

IBM partners with Lightbend, the company formerly known as Typesafe, to deliver Scala training through the Big Data University.

Altiscale announces partnership with Tableau, will add visualization to its managed service for Big Data.

Databricks announces availability of APIs to automate Spark infrastructure. On the Databricks blog, Dave Wang explains.

Microsoft announces preview of R Server for HDInsight and an update to Apache Spark for Azure HDInsight. R Server for HDInsight is a rebranded version of Revolution Analytics’ ScaleR acquired last year. R Server is a distributed machine learning platform with push-down integration to MapReduce and Spark and an R API.

Flink promoter DataArtisans announces a 5.5 million Euro Series A financing round led by Intel Capital.

Dataiku announces a new release of Data Science Studio. The press release touts some new features, but I’ll refrain from commenting until the company posts release notes.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)