Big Analytics Roundup (July 6, 2015)
If you’re wondering about the picture, it’s a 1958 Edsel Roundup.
In an O’Reilly video, mad scientist Paco Nathan introduces advanced math for business people.
In an excellent roundup on LinkedIn Pulse, PayPal’s Anil Madan captures 100 Big Data architecture papers.
If Lomb-Scargle Periodograms are your thing, Jake Vanderplas explains how to do them fast with Python.
On KDnuggets, Louis Dorard compares Azure Machine Learning and PredictionIO, which is like comparing apples to oranges.
The Drill team announces Release 1.1, which addresses 162 JIRAs incremental to the 1.0 release in May. Key bits:
- Automatic partitioning for Parquet files
- Window functions
- Hive storage plugin enhancements
- SQL UNION improvements
- New features for complex data
- Improved JDBC compatibility
- MongoDB 3.0 support
James Stanier of Brandwatch summarizes a recent discovery project using Drill.
The Flink team posts a design draft for Time and Order in streams processing.
On the Inovex blog, Hans-Peter Zorn and Jasir El-Sobhy compare Spark and Flink.
A nameless blogger at MoData compares Flink and Spark.
My two cents: Flink devotees need to find something other than pure streaming versus micro-batching if they are gunning for Spark. That argument hasn’t worked for Storm, and it won’t work for Flink, either.
Typesafe will host a webinar on Spark Streaming this Wednesday, July 8, featuring Tathagata Das and Dean Wampler. Register here.
On the Databricks blog, Vincenzo Selvaggio introduces PMML, explains PMML functionality in Spark 1.4.
Also on the Databricks blog, Kavitha Mariappan and Dave Wang describe MyFitnessPal’s production pipeline on Databricks Cloud.
In Database Trends and Applications, Adam Shepherd summarizes the features and benefits of Spark.
In Datanami, Alex Woodie describes WebTrends’ Big Analytics pipeline, which includes HDFS, Kafka and Spark.
Loraine Lawson, a “veteran technology reporter” promises “six facts” about Spark, delivers four facts, a prediction (“Spark will displace MapReduce”) and some nonsense from Nick Heudegger. Quoting Heudegger, Lawson notes that Spark “does not ship with a resource manager” although “you do tend to get that through Hadoop.” In other words: “never mind what I just said, I forgot about YARN.”
On SearchBusinessAnalytics, Ed Burns summarizes capabilities of the Spark libraries.
Alex Tellez and Michal Malohlava publish the second part of their two-part series modeling Craigslist job categories with H2O, Spark and Sparkling Water. Part one is here. The entire presentation is on Slideshare.
On Kaggle’s No Free Hunch blog, a profile of DataRobot’s Owen Zhang, currently the #1 Kaggler.
Elsewhere, David Smith rounds up the conference, calling it “the best ever.”