Big Analytics Roundup (June 29, 2015)

The Sparkalanche continues; plus we have new releases from Flink and H2O.  And, in case you thought Spark was the last word in Big Analytics, well, think again: here comes Splash, from AMPLab.

In the Wall Street Journal’s Saturday Essay, Sean Parker calls for philanthropists to focus on “hackable problems,” a message that should resonate with data scientists.  (Link may require registration.)

On her blog, Paige Roberts argues convincingly that Tez is toast.  Hortonworks, call your office.

Spark and the Ecosystem

Paul Barsch of Think Big — a Teradata company — speaks to one Spark user at Spark Summit who cannot say how Spark fits into an overall architecture.  Quoting Lewis Carroll, he leverages this sample of one to lecture us about doing stuff without clear goals.

How difficult is it to say how Spark fits into an architecture?  Either Barsch’s interlocutor is unusually obtuse, or Mr. Clueless was headed for the buffet and didn’t want to suffer a sales pitch.  In any case, this paper, on architecture in an agile world, hits the mark better than Alice in Wonderland; sometimes it makes sense to do something even if you don’t have a comprehensive plan.

You need strategy and architecture when you spend big bucks; for example, if you buy Teradata.  Spark’s sunk costs are low; it pays to try something, see if it works, and scale it later. You learn more from doing that than you will hiring consultants to create “architecture.”

Analytic Startups

Cloud warehouse vendor Snowflake lands $45M “C” round, exits beta so their sales reps can pester you even more.

OLAP on Hadoop vendor AtScale raises $7 million in an “A” round.

Knowledge@Wharton discusses tech unicorns and decacorns.

Fintech startup Credit Karma pulls a $175M “D” round on a valuation of $3.5 billion.

Apache Drill

On YouTube, an intro to Apache Drill with SAP Lumira on MapR with Nitin Bandugula of MapR, Angela Harvey of SAP and Kyle Porter of Simba Technologies.

Apache Flink

The Flink team releases 0.9.0, which includes a number of new bits:

  • Exactly-once fault tolerance for streaming data
  • Table API, a high-level abstraction for structured data
  • Graph processing API, with support for iterative graph processing and a library of algorithms (beta)
  • Machine learning API, featuring a pipeline approach, with linear regression and alternating least squares (beta)
  • Improved YARN support, plus YARN with Tez
  • Many bug fixes

For a backgrounder on Flink, read this.

Apache Spark

Tiziano Fagni has contributed a distributed implementation of the AdaBoost.MH and MP-Boost algorithms to Spark Packages.

On the MapR blog, Carol McDonald reveals how to use DataFrames to process tabular data.  On the same blog, Nitin Bandugula explains the significance of Spark in five minutes.

On Datanami, Alex Woodie describes Blazent’s use of Spark in its IT asset management app.

Search vendor Elastic releases Version 2.1 of Elasticsearch for Apache Hadoop with Spark support, blogs about it here.

On KDNuggets, Saurabh Agrawal and Prasad Pande introduce you to Spark.

On SearchAWS, Beth Pariseau writes up Amazon Web Services Spark support in EMR.  Alex Woodie notes that AWS does not charge for use the Spark software, which is big of them.

“The data warehouse…is history,” writes Charles Babcock in Information Week.  Reporting on Spark Summit, he notes that decisions are increasingly made in real time, and predicts a key role for Spark.

On TechCrunch, Timothy Howes of ClearStory Data argues that cloud giants like Spark because they think it will make data stickier.  That seems like a stretch.

Bernard Marr explores the Spark vs. Hadoop question, and gets lost.  Repeat after me: Spark does not compete with Hadoop, it competes with MapReduce.

Danny Bradbury adds his voice to the chorus of people who notice that IBM is investing in Spark.

In SiliconAngle, Betsy Amy-Vogt promotes IBM’s Spark hashtag.

In an exclusive, Andrew Or explains the new visualization tools in Spark 1.4.

Cloud Computing

Here are several good papers on cloud available through (registration required)

  • Ahmad and Khan systematically review cloud computing.
  • Bahman Rashidi et al compare Amazon Elastic MapReduce with Azure MapReduce, note significant differences.
  • The same authors survey cloud interoperability, conclude that this is hard.
  • Naik and Sarma propose a framework for mobile cloud.

H2O tweets latest releases of H2O and Sparkling Water.

On the Domino blog, Sean Lorenz dives into Deep Learning with H2O


In a triumph of acronyms, Software AG adds ADAPA to APAMA, gets PMML.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.