Big Analytics Roundup (June 13, 2016)
Spark Summit 2016 met last week in SFO. There were many cool things; I will publish a separate report when presentations and videos are available.
KDnuggets releases results of its annual poll on data science software. Key findings:
- Python use is up 51%, almost catches up to R, the #1 choice.
- Excel and Tableau usage are up 47% and 49%, respectively.
- Spark usage is up 91%, overtakes Hadoop.
- SAS is down big time, drops from the top ten.
Tim Spann recaps the week in Hadoop.
Spark Summit: Roundup of Roundups
— On the Databricks blog, Wayne Chan, Dave Wang, Jules Damji and Denny Lee recap highlights from the Summit.
— Jessica Davis rounds up the highlights.
— Jack Vaughan surrounds the story, quotes some old guy.
— Sam Dean summarizes what you need to know.
— Alex Handy collects the key bits.
CFPs and Competitions
— Flink Forward 2016, Berlin, September 12-14 (due June 30)
— Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)
— Spark Summit Europe, Brussels, October 25-27 (closing date TBA)
Adrian Colyer summarizes a paper on identifying architectural debt in software.
— Deenar Torasker explains the new capabilities of HDFS.
— On his eponymous blog, Jesse Steinweg-Woods explains Gradient Boosted Trees with XGBoost in Python.
— Adam Warski explains how Kafka Streams fits into the stream processing landscape.
— H2O.ai’s Vinod Iyengar objects to what he calls the fragmentation of Spark support, correctly noting that Cloudera and Hortonworks support different versions of Spark in their distributions. Of course, nobody is obligated to use Spark with Cloudera and Hortonworks.
— From the Spark Summit on YouTube: Ben Lorica leads a panel discussion of incredibly smart and distinguished people, plus some old guy.
— Altiscale’s Barbara Lewis presents ten use cases for Big Data.
— Tim Wallis believes that AI will relieve boredom.
— Sam Dean touts Grappa, Drill and Kafka as successors to Spark. Grappa is going nowhere. Drill is great if all you want to do is SQL, and Kafka is great if all you want to do is streaming. Pro tip: there are no real-world analytic applications where all you want to do is streaming.
— Allen Downey opines that statistical tests are inflexible and opaque. Funny, my college roommate said the same thing when he flunked his Stat 101 mid-term.
Open Source Announcements
— Twitter donates DistributedLog to Apache.
— Microsoft announces general availability for its managed Spark service in HDInsight, and summer availability for the Spark pushdown capability in R Server. The company also announced PowerBI support for Spark Streaming, which is confusing for those who thought PowerBI already supported Spark Streaming.
— IBM announces limited preview of a managed service branded as the Data Science Experience. IBM is coy about the details; the service definitely includes Spark, Jupyter and RStudio, H2O and “curated data sets”, and may include other bits. The service itself looks promising, but IBM’s claim to offer the “first development environment for Apache Spark” is BS.
— In an oddly opaque press release, H2O announces that it is “working with” IBM. H2O is open source software, and IBM requires no permission from H2O.ai for use or distribution; presumably, H2O will offer support contracts to users. H2O.ai did not respond to request for comment.