Category Archives: Architecture

Benchmark: Spark Beats MapReduce

A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads.  In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort. On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”.  This point refers to Databricks’ record-setting

Read more

O’Reilly Data Science Survey 2015

O’Reilly releases its 2015 Data Science Salary Survey.  The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey.  The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014. The authors note that the survey includes self-selected respondents from the O’Reilly audience and may not generalize to the

Read more

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed. First some definition.  The scope of this analysis includes software with the following properties: Support for supervised and unsupervised machine learning Support for distributed processing Open platform or multi-vendor

Read more

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures: (1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization. (2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop

Read more

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic. The question is important for four reasons: Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop; In many cases, the

Read more

Python for Analytics

A reader complains that I did not include Python in a survey of Machine Learning in Hadoop.  It’s a fair point.  There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match.  Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered.  In this post I review the

Read more
« Older Entries