Big Analytics Roundup (August 24, 2015)

Lots of Mesos news this week (thanks to MesosCon in Seattle), including reports that Microsoft wants to acquire Mesosphere.

Rashid Jamal surveys the battle space for the next generation big data analysis framework.  Good overview of how some of the top projects and vendors are positioning themselves.

On LinkedIn, Bernard Marr reports on the “top ten” Hadoop distributions, including Cloudera, Amazon Web Services, Hortonworks, MapR, IBM BigInsights, Microsoft HDInsight, Intel, Datastax, Teradata and Pivotal HD.  He should have stopped after MapR: IBM, Pivotal, Microsoft and Teradata are all OEM versions of Hortonworks HDP; Intel OEMS Cloudera (and so does Oracle, which he does not mention); and while Cassandra runs in Hadoop, Datastax does not support a Hadoop distribution.

On ZDNet, Andrew Brust touts five open source projects to watch: Flink, Samza, Twill, Ibis and Mahout-Samsara.  He’s right about Flink; Samza is DOA; Ibis is simply a Python interface to Impala; Twill is a long shot.  You can stick a fork in Mahout, it’s done.

The AllAnalytics blog is an astroturf “community-building” effort funded by SAS.  Robert Allson, SAS’ “Graph Guy” just discovered that there are a lot of Apache Hadoop releases, so he uses SAS to graph them, FWIW.  The frequent release cadence is one reason most enterprises prefer to use one of the commercial Hadoop distributions and not the Apache distribution.

Ted Dunning complains that some open source projects aren’t really open, noting that some projects are dominated by a single vendor.   This sounds to me like wishful thinking.  Like socialism, “community” software development does not work in practice — in the end, you get a bazaar-like mashup (e.g. the R project’s 6,000 packages) or a mess (e.g. Mahout).   Spark is successful largely due to the clear leadership provided by Databricks; the same is true for Kafka and Confluent; and if Drill takes off it will be due largely to the support and backing of Dunning’s own MapR.  Indeed, Hadoop would be little more than a science project today without the efforts of Cloudera, MapR and Hortonworks.  The future of open source software is vendor-driven.

Via the morning paper, a series of papers from KDD 2015.  First up: a team of seven scientists use machine learning to identify “at-risk” students.

Apache Drill

On KDnuggets, Techalpine’s Kaushik Pal summarizes Drill for newbies.

Apache Flink/Data Artisans

Slim Baltagi and Srini Palthepu of CapitalOne deliver a crash course on Flink via Slideshare.

Apache Mesos/Mesosphere

Verizon selects Mesosphere DCOS, the open source cluster manager based on Apache Mesos.  Additional story here.

In Business Insider, Matt Weinberger reports on Apple’s use of Mesos to make Siri run faster and cheaper.

Microsoft announces partnership with Mesosphere to port Mesos to Windows Server.  TechCrunch reports rumors that Microsoft wants to buy Mesosphere.

At the MesosCon conference in Seattle, Basho and Cisco demonstrate the Riak key-value datastore on Mesos.  Alex Woodie reports.

Mesosphere announces plans for Mesosphere Infinity, an open source stack for streaming analytics, which will include Mesos, Akka, Spark, Kafka and Cassandra.  On ZDNet, Toby Wolpe explains why.

Apache Spark/Databricks

Spark 1.5 will be released soon; on Brighttalk, Spark release manager Patrick Wendell summarizes the new features.  A Spark 1.5 preview is now available in Databricks.

Olelsii Sliusarenko and Kevin McIntire of Grammarly Labs explain how they use Spark on EC2 with S3.  This is a good example of Spark running standalone, without Hadoop. (h/t Hadoop Weekly).

On the MapR blog, Ted Dunning recaps his session on streaming analytics from the 2015 Spark Summit.

Matt Asay claims that Spark is doomed, asks what will supplant it, fails to answer.   This is like pointing out that eventually the sun will flame out and life on Earth will end; it may be true, but in the short term we still have to do the dishes.

Huawei debuts three new models of its FusionServer data center servers with built-in Spark accelerators, claims 100X SQL performance boost versus commodity servers.

IBM continues to bang the drum for Spark on a mainframe (links here, here and here.)  Yawn.  Sorry, IBM, but one of the key selling points of Spark and the Hadoop ecosystem is the ability to avoid having to deal with the likes of IBM.


On the Dato blog, Krishna Sridhar delivers an introduction to distributed machine learning, touts Dato Distributed.


On Slideshare:

  • Erin Ladell explains scalable machine learning with R and H2O.
  • Chen Huang and Erin Ladell offer an intro to data science.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.