Big Analytics Roundup (March 16, 2015)
Big Analytics news and analysis from around the web. Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.
A reminder to readers that Spark Summit East is coming up March 18-19.
- On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
- And again, the same combo for spatial analytics.
- Adam Riley blogs on testing Alteryx macros.
For an overview, see the Apache Spark Page.
- The Spark team announces availability of Spark 1.3.0. Release notes here. Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes. More on this release separately.
- On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
- Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
- In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
- Sandy Ryza, co-author of Advanced Analytics With Spark, writes on tuning Spark jobs, on the Cloudera Engineering blog
- Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data. Case study published here.
- Holden Karau publishes a Spark testing procedure on Git.
- On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.
High Performance Computing
- Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes. More coverage here. Ryft’s Christian Shrauder blogs about FGPA.
- Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
- Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
- AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning. Someone should write a book about that.
- Shaun McGirr posts on integrating RapidMiner and R.
- Tobias Malbrecht performs heroics with RapidMiner.
SQL on Hadoop
- On the Pivotal blog, a podcast about Hawq.
- The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
- TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB. According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
- Splice Machine has updated FAQs for its RDBMS-on-Hadoop.