Big Analytics Roundup (August 10, 2015)
Am on vacation…in lieu of latest news, this scheduled post is a mashup of interesting stories that have crossed my browser.
History of Machine Learning
The original link is dead, but this graph was included in a related KDnuggets story. I don’t completely agree with the author about the popularity of SVM, which was never as popular as Decision Trees.
Benchmarking Machine Learning Tools
Szilard Pafka benchmarks Spark versus R, Python, H2O, xgboost and Vowpal Wabbit across a number of popular algorithms. There are some issues with his approach — he uses a single use case and runs everything on a single node — but he deserves credit for this work.
Moreover, Pafka’s single node architecture does not invalidate the finding that Spark MLlib produces lower quality models compared to alternative tools. Spark team, call your office.
More on Spark
Luc Bartkowski explains how he tried to use Spark GraphX for his graph database, but ended up using plain old HBase.
Here’s a simple rule: if you build a model and it isn’t deployed, you failed. It’s your job as a data scientist to understand the deployment environment and build something that is deployable without a science project. Yes, sometimes the manager you work for is an idiot, but if you can’t see a clear path to value don’t do the project.
Remember the Netflix Challenge? Here‘s why it was a white elephant.
Big Data Architecture
Anil Madan’s roundup of Big Data architecture papers offers an excellent summary of Big Data architecture.
Marketing and Customer Analytics
Virginia Postrel summarizes recently published analysis that appears to show that some early adopters reliably pick losers.
Hadoop is not a thing — it’s an ecosystem. Merv Adrian demonstrates this point by tabulating the components supported by each of the major commercial distributions. Spoiler: everyone supports MapReduce, HDFS, YARN, HBase, Hive, Pig, Spark and Zookeeper.
In the early 1990s I messed around with neural networks for problems in credit card marketing and risk management; while impressed with the theory, I was never able to get results that were significantly better than other methods.
So I admit to reacting with skepticism to the recent Deep Learning buzz. Nevertheless, we’re seeing growing evidence that the method works for some problems, and it should be a standard part of the data science tool kit. Here are some stories:
In this interview, H2O’s Arno Candel explains the basics of Deep Learning.
On the Toptal blog, Ivan Vasilev offers a Deep Learning tutorial.
Derrick Harris writes a story on how startups leverage Deep Learning.
In Forbes, Anthony Wing Kosner indulges in metaphor, argues that Deep Learning will eat the world.
Unsurprisingly, the website deeplearning.net is all about Deep Learning.
On KDnuggets, Ran Bi asks whether Deep Learning will make other machine learning algorithms obsolete. The answer is “no.” My answer, not his.