Big Analytics Roundup (August 24, 2015)
Lots of Mesos news this week (thanks to MesosCon in Seattle), including reports that Microsoft wants to acquire Mesosphere.
Rashid Jamal surveys the battle space for the next generation big data analysis framework. Good overview of how some of the top projects and vendors are positioning themselves.
On LinkedIn, Bernard Marr reports on the “top ten” Hadoop distributions, including Cloudera, Amazon Web Services, Hortonworks, MapR, IBM BigInsights, Microsoft HDInsight, Intel, Datastax, Teradata and Pivotal HD. He should have stopped after MapR: IBM, Pivotal, Microsoft and Teradata are all OEM versions of Hortonworks HDP; Intel OEMS Cloudera (and so does Oracle, which he does not mention); and while Cassandra runs in Hadoop, Datastax does not support a Hadoop distribution.
On ZDNet, Andrew Brust touts five open source projects to watch: Flink, Samza, Twill, Ibis and Mahout-Samsara. He’s right about Flink; Samza is DOA; Ibis is simply a Python interface to Impala; Twill is a long shot. You can stick a fork in Mahout, it’s done.
The AllAnalytics blog is an astroturf “community-building” effort funded by SAS. Robert Allson, SAS’ “Graph Guy” just discovered that there are a lot of Apache Hadoop releases, so he uses SAS to graph them, FWIW. The frequent release cadence is one reason most enterprises prefer to use one of the commercial Hadoop distributions and not the Apache distribution.
Ted Dunning complains that some open source projects aren’t really open, noting that some projects are dominated by a single vendor. This sounds to me like wishful thinking. Like socialism, “community” software development does not work in practice — in the end, you get a bazaar-like mashup (e.g. the R project’s 6,000 packages) or a mess (e.g. Mahout). Spark is successful largely due to the clear leadership provided by Databricks; the same is true for Kafka and Confluent; and if Drill takes off it will be due largely to the support and backing of Dunning’s own MapR. Indeed, Hadoop would be little more than a science project today without the efforts of Cloudera, MapR and Hortonworks. The future of open source software is vendor-driven.
On KDnuggets, Techalpine’s Kaushik Pal summarizes Drill for newbies.
Apache Flink/Data Artisans
Slim Baltagi and Srini Palthepu of CapitalOne deliver a crash course on Flink via Slideshare.
In Business Insider, Matt Weinberger reports on Apple’s use of Mesos to make Siri run faster and cheaper.
At the MesosCon conference in Seattle, Basho and Cisco demonstrate the Riak key-value datastore on Mesos. Alex Woodie reports.
Matt Asay claims that Spark is doomed, asks what will supplant it, fails to answer. This is like pointing out that eventually the sun will flame out and life on Earth will end; it may be true, but in the short term we still have to do the dishes.
Huawei debuts three new models of its FusionServer data center servers with built-in Spark accelerators, claims 100X SQL performance boost versus commodity servers.
IBM continues to bang the drum for Spark on a mainframe (links here, here and here.) Yawn. Sorry, IBM, but one of the key selling points of Spark and the Hadoop ecosystem is the ability to avoid having to deal with the likes of IBM.
On the Dato blog, Krishna Sridhar delivers an introduction to distributed machine learning, touts Dato Distributed.
- Erin Ladell explains scalable machine learning with R and H2O.
- Chen Huang and Erin Ladell offer an intro to data science.