2015 in Big Analytics
Looking back at 2015, a few stories stand out:
- Steady progress for Spark, punctuated by two big announcements.
- Solid growth in cloud-based machine learning, led by Microsoft.
- Expanding options for SQL and OLAP on Hadoop.
In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April. I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.
IBM Embraces Spark
IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit. IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base. It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”
When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe. Color me prophetic.
It’s too early to tell what substantive contributions IBM will make to Spark. Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September. This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.
All that said, IBM brings respectability, and the assurance that Spark is ready for prime time. This is priceless. Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.
Cloudera Announces “One Platform” Initiative
In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated. As with the IBM announcement, the symbolism matters. Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true. It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.
The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop. According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos. It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish. But if Cloudera thinks “One Platform” will stem that tide, it is mistaken. It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.
Microsoft Builds Credibility in Analytics
In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics. The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets. Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.
Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML). First released in beta in June 2014, AML is both easy to use and powerful. The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts. Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.
Azure Machine Learning contrasts markedly with Amazon Machine Learning. Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love. Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon. If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.
SQL Engines Proliferate
At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL. Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.
Several things happened during the year:
- Hive on Tez established rough performance parity with the fast SQL engines.
- Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
- MapR promoted Drill, and invested in improvements to the software. Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
- Cloudera donated Impala to open source, and Pivotal donated Hawq.
- Teradata placed its chips on Presto.
While it’s great to see so many options emerge, Hive continues to win actual evaluations. Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance. Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.
The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.
OLAP on Hadoop Gets Real
For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options. The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status. Adoption is limited at present, but any project used by eBay and Baidu is worth a look.
The commercial option is AtScale, a company that emerged from stealth in April. Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools. It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.
Funding for Machine Learning
H2O.ai’s recently announced B round is significant for a couple of reasons. First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.
Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:
- Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times. It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
- Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem. Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.
Palantir continued to suck up capital like a whale feeding on krill.
Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.