Big Analytics Roundup (June 29, 2015)
The Sparkalanche continues; plus we have new releases from Flink and H2O. And, in case you thought Spark was the last word in Big Analytics, well, think again: here comes Splash, from AMPLab.
In the Wall Street Journal’s Saturday Essay, Sean Parker calls for philanthropists to focus on “hackable problems,” a message that should resonate with data scientists. (Link may require registration.)
On her blog, Paige Roberts argues convincingly that Tez is toast. Hortonworks, call your office.
Spark and the Ecosystem
Paul Barsch of Think Big — a Teradata company — speaks to one Spark user at Spark Summit who cannot say how Spark fits into an overall architecture. Quoting Lewis Carroll, he leverages this sample of one to lecture us about doing stuff without clear goals.
How difficult is it to say how Spark fits into an architecture? Either Barsch’s interlocutor is unusually obtuse, or Mr. Clueless was headed for the buffet and didn’t want to suffer a sales pitch. In any case, this paper, on architecture in an agile world, hits the mark better than Alice in Wonderland; sometimes it makes sense to do something even if you don’t have a comprehensive plan.
You need strategy and architecture when you spend big bucks; for example, if you buy Teradata. Spark’s sunk costs are low; it pays to try something, see if it works, and scale it later. You learn more from doing that than you will hiring consultants to create “architecture.”
Cloud warehouse vendor Snowflake lands $45M “C” round, exits beta so their sales reps can pester you even more.
OLAP on Hadoop vendor AtScale raises $7 million in an “A” round.
Knowledge@Wharton discusses tech unicorns and decacorns.
Fintech startup Credit Karma pulls a $175M “D” round on a valuation of $3.5 billion.
On YouTube, an intro to Apache Drill with SAP Lumira on MapR with Nitin Bandugula of MapR, Angela Harvey of SAP and Kyle Porter of Simba Technologies.
The Flink team releases 0.9.0, which includes a number of new bits:
- Exactly-once fault tolerance for streaming data
- Table API, a high-level abstraction for structured data
- Graph processing API, with support for iterative graph processing and a library of algorithms (beta)
- Machine learning API, featuring a pipeline approach, with linear regression and alternating least squares (beta)
- Improved YARN support, plus YARN with Tez
- Many bug fixes
For a backgrounder on Flink, read this.
Tiziano Fagni has contributed a distributed implementation of the AdaBoost.MH and MP-Boost algorithms to Spark Packages.
On Datanami, Alex Woodie describes Blazent’s use of Spark in its IT asset management app.
On KDNuggets, Saurabh Agrawal and Prasad Pande introduce you to Spark.
“The data warehouse…is history,” writes Charles Babcock in Information Week. Reporting on Spark Summit, he notes that decisions are increasingly made in real time, and predicts a key role for Spark.
On TechCrunch, Timothy Howes of ClearStory Data argues that cloud giants like Spark because they think it will make data stickier. That seems like a stretch.
Bernard Marr explores the Spark vs. Hadoop question, and gets lost. Repeat after me: Spark does not compete with Hadoop, it competes with MapReduce.
Danny Bradbury adds his voice to the chorus of people who notice that IBM is investing in Spark.
In SiliconAngle, Betsy Amy-Vogt promotes IBM’s Spark hashtag.
In an exclusive, Andrew Or explains the new visualization tools in Spark 1.4.
Here are several good papers on cloud available through academia.edu (registration required)
- Ahmad and Khan systematically review cloud computing.
- Bahman Rashidi et al compare Amazon Elastic MapReduce with Azure MapReduce, note significant differences.
- The same authors survey cloud interoperability, conclude that this is hard.
- Naik and Sarma propose a framework for mobile cloud.
H2o.ai tweets latest releases of H2O and Sparkling Water.
On the Domino blog, Sean Lorenz dives into Deep Learning with H2O
In a triumph of acronyms, Software AG adds ADAPA to APAMA, gets PMML.