Big Analytics Roundup (August 17, 2015)
Catching up from vacation last week. Top stories: results of a SQL-on-Hadoop evaluation at Pearson; Google launches Dataflow (giving Flink a boost); while IBM shoehorns Spark onto a mainframe, Vertica gets the jump on IBM PureData with native Spark integration.
Kaggle announces two new competitions:
- Springleaf Financial, an Indiana credit union founded in 1920, has rebranded to target millenials. They want you to help them target their direct marketing. Contest ends October 19.
- Dato, the oddly-named analytic software company, co-sponsors a contest with StumbleUpon to classify web pages as “sponsored” or “organic”. You don’t have to use Dato software, but you’re eligible for a bonus if you do. Contest ends October 6.
Alex Woodie asks if Scala will take over the Big Data world, fails to answer the question. The correct answer is “no”. Data scientists prefer to work in higher-level languages like R and Python, and Python works well for application development.
On the Qubole blog, Sumit Arora summarizes results from a recent evaluation of SQL-on-Hadoop by Pearson, the global learning company. Arora’s team evaluated Spark SQL and Presto with Text, Avro, Parquet and ORC formats. They excluded Impala from testing due to lack of support for complex query types and Amazon S3; they did not consider Drill or Hive on Tez.
Pearson selected Spark SQL and Parquet file format, for reasons detailed in the article.
Marketing services provider Harte-Hanks announces selection of MapR and Drill for its open CRM platform.
Kostas Tzoumas of Data Artisans describes Flink’s low-latency and exactly-once stream processing architecture.
On KDnuggets, Tzoumas and Stephan Ewen argue the case for Flink for stream processing.
On Medium.com, Nezih Yigitbasi explains how to crunch Parquet files with Flink.
IBM continues to pour new wine into old bottles.
In case you haven’t heard about Spark, Andy Patrizio explains its appeal.
Two interesting items on the MapR blog:
- On the MapR blog, Joseph Blue asks if Harper Lee wrote To Kill a Mockingbird. He describes using Spark and Lucene to compare the first chapter of Go Set a Watchman with To Kill a Mockingbird; the analysis suggests two different authors.
- Nitin Bandugula details real-time use cases for Spark on Hadoop.
On the Cloudera blog, guest authors Sam Savage and Harry Powell describe the use case for Spark at Barclays. Note to SAS: when you’ve lost the banks….
Databricks guest blogger Olivier Girardot compares Pandas and Spark DataFrames.
…announces an upgrade of the Databricks platform, loosely called “Databricks 2.0”. (As a SaaS offering, Databricks maintains a two-week release cadence.) Key bits:
- Support for Spark 1.4
- Improved security and access control
- Developer notebook versioning
- Support for multi-tenancy
…posts videos from the 2015 Data Science Summit and Dato Conference.
On the Dato blog, Susan Romero explains how to use Dato Distributed to run pipelines developed in GraphLab Create in a distributed environment. Interesting to note that GraphLab Create does not run in a distributed environment itself, so you need to license Dato Distributed to put your models to work in Hadoop.
On Slideshare, Hank Roark publishes an intro to data science with H2O.
Google releases its Dataflow hosted cloud servicee, announces partnerships with ClearStory Data, Salesforce, SpringML and Tamr, plus SDKs from DataArtisans and Cloudera. Dataflow enables users to build data pipelines that integrate batch and streaming sources with a unified programming model.
Andrie de Vries compares the network structure of CRAN and BioConductor, concludes they are different. Nice graphics, though I’m missing the practical significance.