Big Analytics Roundup (June 22, 2015)

Last week’s Spark Summit is the big news driver for this roundup:

  • On the Databricks blog, Scott Walent recaps the summit here
  • Anmol Rajpurohit writes KDnuggets’ play-by-play for Day One and Day Two
  • My preliminary report is here; full report when slides are available from the sessions.

Spark will be one of several technologies featured at the inaugural In-Memory Computing Summit to be held in SFO June 29-30.

On KDnuggets, an interesting story from Gregory Piatetsky-Shapiro and Shashank Iyer.  The authors measure association among analytics tools using responses to their recent poll.  The strongest associations among the top 10 tools:

  • Spark and Hadoop
  • Python and Spark
  • Excel and SQL

Among the top 20 tools, the top associations are unsurprising:

  • SAS Enterprise Miner and Base SAS (cannot use the former with the latter)
  • IBM SPSS Modeler and IBM SPSS Statistics

The low associations are also interesting, if unsurprising:

  • Alteryx with everything else except Tableau
  • IBM SPSS with awk/gawk, scikit-learn and Spark
  • KNIME and Base SAS/Enterprise Miner
  • RapidMiner and Base SAS/Enterprise Miner

KNIME and RapidMiner are clearly positioned as low-cost SAS alternatives among relatively sophisticated analysts, while the Alteryx/Tableau combo is an entry-level offering for business users.

Analyst Reactions: Spark Summit

Doug Henschen wonders if Databricks will be eclipsed by IBM’s entry, citing IBM’s intent to offer Spark on its cloud platform BlueMix.  He fails to note that (1) Databricks Cloud is more than a vanilla Spark service; (2) Databricks already competes with a Spark Service from AWS; and (3) BlueMix is an ankle-biter.


Tony Baer uses Andrew Brust’s blog to flatten a straw man, arguing that Spark isn’t going to replace Hadoop — a position that no serious person has suggested or implied.  Even Spark diehards believe that there are use cases where MapReduce/HDFS makes sense.

Joe Panettieri trolls readers, asks if Spark can live up to “Big Data, Real Time Analytics Hype.”  From the evidence he presents, the answer is “yes”.

Amazon Web Services

Amazon Web Services announces Apache Spark on Amazon EMR service.  Stories here and here.  Note that AWS has offered Spark on EC2 for some time, so headlines like “AWS jumps on Spark bandwagon” are misleading.

On the AWS blog, Jeff Smith of Intent Media relates his company’s success with Spark.

Apache Kafka

On the Cloudera Vision blog, Jay Kreps of Confluent contributes the second part of his two-parter on using Kafka for real-time data streams.  Part One is here.

Apache Spark

On Slideshare, some presentations from the Spark Summit:

On the Databricks blog, Russell Spitzer and Wayne Chan of DataStax explain how to integrate Spark and Cassandra (h/t Hadoop Weekly)

Intent Media offers Yet Another Pipeline (aka Mario).


BlueData announces support for Hadoop and Spark on Docker.

Bright Computing

Bright Computing announces Bright Cluster Manager 7.1, which includes support for Spark 1.3.  At HPCS in Montreal, Ian Lumb of Bright presents an interesting HPC application for Spark.


Databricks announces general availability for Databricks Cloud.  Story here.

Matt Asay interviews Databricks VP Arsalan Tavakoli-Shiraji.

On Slideshare, Alex Tellez and Michal Malohlava demonstrate an integrated machine learning workflow that combines Spark MLLib’s Word2Vec with H2O’s Gradient Boosted Machine to classify text.  They leverage the Sparkling Water interface.


IBM announces “major” commitment to Apache Spark.  Stories herehere, here, here, here, here, here and here.

Joel Horwitz of IBM Analytics argues that Spark’s emergence reflects Moore’s Law and the declining cost of memory.  Well, yes — if memory is cheaper, we can use more of it.


MapR announces it will offer three Spark-based “quickstart” solutions for Real-time Security Log Analytics, Time Series Analytics and Genome Sequencing.

MapR also announces that Razorsight will offer a cloud-based predictive analytics solution on MapR with Apache Spark.

Also, Drill.


Paxata announces something.  I’ve read the press release several times, and can’t figure out what they are announcing.


SAS announces Factory Miner, an add-on to SAS Enterprise Miner that enables you to “turbocharge analytic model building” and add another line item to your SAS invoice.  The software appears to be a rebranding of SAS Rapid Predictive Modeler.


TypeSafe and Mesosphere partner to support Spark on Mesosphere’s Data Center Operating System (DCOS).  Stories here, here and here.


ZoomData announces its availability as an app on Databricks Cloud.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.