A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads. In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort.
On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”. This point refers to Databricks’ record-setting entry in the 2014 Sort Benchmark run by Chris Nyberg, Mehul Shah and Naga Govindaraju.
However, Ramel appears to have overlooked section 3.3.1 of the paper, where the researchers explicitly address this question:
This difference is mainly because our cluster is connected using 1 Gbps Ethernet, as compared to a 10 Gbps Ethernet in, i.e., in our cluster configuration network can become a bottleneck for Sort in Spark.
In other words, had they deployed Spark on a cluster with high-speed network connections, it likely would run the Sort faster than MapReduce did.
I guess we’ll know when Nyberg et. al. release the 2015 GraySort results.
The IBM benchmark team found that k-means ran about 5X faster in Spark than in MapReduce. Ramel highlights the difference between this and the Spark team’s claim that machine learning algorithms run “up to” 100X faster.
The actual performance comparison shown on the Spark website compares logistic regression, which the IBM researchers did not test. One possible explanation — the Spark team may have tested against Mahout’s logistic regression algorithm, which runs on a single machine. It’s hard to say, since the Spark team provides no backup documentation for its performance claims. That needs to change.