Big Analytics Roundup (July 6, 2015)

If you’re wondering about the picture, it’s a 1958 Edsel Roundup.

In an O’Reilly video, mad scientist Paco Nathan introduces advanced math for business people.

The team at launches a newsfeed called Deep Learning News, which aggregates blog posts, articles and other related content.  VentureBeat writes it up.

In an excellent roundup on LinkedIn Pulse, PayPal’s Anil Madan captures 100 Big Data architecture papers.

If Lomb-Scargle Periodograms are your thing, Jake Vanderplas explains how to do them fast with Python.

On KDnuggets, Louis Dorard compares Azure Machine Learning and PredictionIO, which is like comparing apples to oranges.

Apache Drill

The Drill team announces Release 1.1, which addresses 162 JIRAs incremental to the 1.0 release in May.  Key bits:

  • Automatic partitioning for Parquet files
  • Window functions
  • Hive storage plugin enhancements
  • SQL UNION improvements
  • New features for complex data
  • Improved JDBC compatibility
  • MongoDB 3.0 support

James Stanier of Brandwatch summarizes a recent discovery project using Drill.

Apache Flink

The Flink team posts a design draft for Time and Order in streams processing.

On the Inovex blog, Hans-Peter Zorn and Jasir El-Sobhy compare Spark and Flink.

A nameless blogger at MoData compares Flink and Spark.

My two cents: Flink devotees need to find something other than pure streaming versus micro-batching if they are gunning for Spark.  That argument hasn’t worked for Storm, and it won’t work for Flink, either.

Apache Spark

Typesafe will host a webinar on Spark Streaming this Wednesday, July 8, featuring Tathagata Das and Dean Wampler.  Register here.

On the Databricks blog, Vincenzo Selvaggio introduces PMML, explains PMML functionality in Spark 1.4.

Also on the Databricks blog, Kavitha Mariappan and Dave Wang describe MyFitnessPal’s production pipeline on Databricks Cloud.

In Database Trends and Applications, Adam Shepherd summarizes the features and benefits of Spark.

In Datanami, Alex Woodie describes WebTrends’ Big Analytics pipeline, which includes HDFS, Kafka and Spark.

Loraine Lawson, a “veteran technology reporter” promises “six facts” about Spark, delivers four facts, a prediction (“Spark will displace MapReduce”) and some nonsense from Nick Heudegger.   Quoting Heudegger, Lawson notes that Spark “does not ship with a resource manager” although “you do tend to get that through Hadoop.”  In other words: “never mind what I just said, I forgot about YARN.”

On SearchBusinessAnalytics, Ed Burns summarizes capabilities of the Spark libraries.

Alex Tellez and Michal Malohlava publish the second part of their two-part series modeling Craigslist job categories with H2O, Spark and Sparkling Water.  Part one is here.  The entire presentation is on Slideshare.


Kaggle announces two new competitions, for Caterpillar and the WAY Project.

On Kaggle’s No Free Hunch blog, a profile of DataRobot’s Owen Zhang, currently the #1 Kaggler.


The R User Group of Milano chronicles events at the UseR! conference in Denmark.  Day Zero; Day One; Day Two; Day Three.

Elsewhere, David Smith rounds up the conference, calling it “the best ever.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.