Big Analytics Roundup (June 22, 2015)
Last week’s Spark Summit is the big news driver for this roundup:
- On the Databricks blog, Scott Walent recaps the summit here
- Anmol Rajpurohit writes KDnuggets’ play-by-play for Day One and Day Two
- My preliminary report is here; full report when slides are available from the sessions.
On KDnuggets, an interesting story from Gregory Piatetsky-Shapiro and Shashank Iyer. The authors measure association among analytics tools using responses to their recent poll. The strongest associations among the top 10 tools:
- Spark and Hadoop
- Python and Spark
- Excel and SQL
Among the top 20 tools, the top associations are unsurprising:
- SAS Enterprise Miner and Base SAS (cannot use the former with the latter)
- IBM SPSS Modeler and IBM SPSS Statistics
The low associations are also interesting, if unsurprising:
- Alteryx with everything else except Tableau
- IBM SPSS with awk/gawk, scikit-learn and Spark
- KNIME and Base SAS/Enterprise Miner
- RapidMiner and Base SAS/Enterprise Miner
KNIME and RapidMiner are clearly positioned as low-cost SAS alternatives among relatively sophisticated analysts, while the Alteryx/Tableau combo is an entry-level offering for business users.
Analyst Reactions: Spark Summit
Doug Henschen wonders if Databricks will be eclipsed by IBM’s entry, citing IBM’s intent to offer Spark on its cloud platform BlueMix. He fails to note that (1) Databricks Cloud is more than a vanilla Spark service; (2) Databricks already competes with a Spark Service from AWS; and (3) BlueMix is an ankle-biter.
Tony Baer uses Andrew Brust’s blog to flatten a straw man, arguing that Spark isn’t going to replace Hadoop — a position that no serious person has suggested or implied. Even Spark diehards believe that there are use cases where MapReduce/HDFS makes sense.
Joe Panettieri trolls readers, asks if Spark can live up to “Big Data, Real Time Analytics Hype.” From the evidence he presents, the answer is “yes”.
Amazon Web Services
Amazon Web Services announces Apache Spark on Amazon EMR service. Stories here and here. Note that AWS has offered Spark on EC2 for some time, so headlines like “AWS jumps on Spark bandwagon” are misleading.
On the AWS blog, Jeff Smith of Intent Media relates his company’s success with Spark.
On Slideshare, some presentations from the Spark Summit:
Intent Media offers Yet Another Pipeline (aka Mario).
BlueData announces support for Hadoop and Spark on Docker.
Matt Asay interviews Databricks VP Arsalan Tavakoli-Shiraji.
On Slideshare, Alex Tellez and Michal Malohlava demonstrate an integrated machine learning workflow that combines Spark MLLib’s Word2Vec with H2O’s Gradient Boosted Machine to classify text. They leverage the Sparkling Water interface.
Joel Horwitz of IBM Analytics argues that Spark’s emergence reflects Moore’s Law and the declining cost of memory. Well, yes — if memory is cheaper, we can use more of it.
MapR announces it will offer three Spark-based “quickstart” solutions for Real-time Security Log Analytics, Time Series Analytics and Genome Sequencing.
Paxata announces something. I’ve read the press release several times, and can’t figure out what they are announcing.
SAS announces Factory Miner, an add-on to SAS Enterprise Miner that enables you to “turbocharge analytic model building” and add another line item to your SAS invoice. The software appears to be a rebranding of SAS Rapid Predictive Modeler.
ZoomData announces its availability as an app on Databricks Cloud.