Big Analytics Roundup (February 1, 2016)
There are three hard news stories this week: MSFT releases its toolkit for deep learning, MSFT releases its hybrid cloud platform, and the Zeppelin team releases its latest version. So MSFT wins the internet.
In Computerworld, Doug Cutting celebrates Hadoop’s tenth birthday.
InfoWorld distributes its 2016 Technology of the Year awards. Among the 32 awards: Docker, Kubernetes, Apache Mesos, Apache Ambari, Apache Kafka, Amazon Aurora, Apache Spark,
Cloudera Apache Impala, Splunk, Tableau and IBM Watson Analytics. That last one is a gift — Watson Analytics is a service, not a technology, and as cloud-based analytic services go it’s middle of the pack at best.
Elsewhere, InfoWorld’s Serdar Yegulap profiles 13 machine learning “frameworks”, a fancy word for “software.”
On the MapR blog, Jim Scott recaps his “Streaming in the Extreme” presentation from Strata+Hadoop World. It’s a good summary, free of hype promoting MapR Streams, which the company introduced a month after Scott’s presentation. Scott explains the importance of exactly-once processing, but then says you may not need it. Check your MapR invoice, and make sure they charge you exactly once.
In SD Times, Alex Handy wonders if Spark is replacing Hadoop, which is like asking if diesel is replacing the automobile. After a rough start, Handy gets the story back on track by refocusing the question on which parts of Hadoop compete with Spark; his answer: MapReduce. Who knew? He also notes use cases where Spark is coupled with storage architectures other than HDFS; but these use cases were never served by Hadoop, so Spark isn’t replacing Hadoop.
In FT, Leslie Hook profiles Amazon Web Services, quotes Deutsche Bank’s Ross Sandler saying that AWS is the “fastest-growing enterprise technology company in history.”
On DBMS2, Curt Monash covers Kafka and Confluent.
Adrian Colyer at the morning paper is a great source of interesting, influential and important papers in machine learning and high performance computing. Among his papers this week:
- Arabesque, a system for distributed graph mining.
- Petuum for distributed machine learning.
- Chimera for large scale classification.
Alex Woodie touts Akuda Labs‘ streaming analytics platform, which bears the name of a popular fruit. Akuda benchmarked their commercial software against Spark Streaming, and they want you to know it’s faster. Yawn. If you don’t see the words “fault-tolerant” and “exactly once” in a piece on streaming, stop reading.
On the Dato Blog, a “fireside chat” with CEO Carlos Guestrin. who covers what’s coming in 2016 and a quodlibet of other topics.
- On the Cloudera Engineering blog, Brad Barker explains how to invoke the Apache Spark MLlib and H2O machine learning libraries from R or Python. (h/t Hadoop Weekly)
- On the Databricks blog, Tim Hunter explains how to implement Deep Learning with Spark and TensorFlow. While TensorFlow itself runs on single machines only, Hunter demonstrates how Spark can distribute experiments across a cluster of machines running TensorFlow instances.
- Oracle’s Michael Schulman explains how to integrate Spark and Oracle NoSQL Database.
- On SlideShare, Hank Roark explains Kalman Filters with H2O.
- Also on Slideshare, Nag Arvind Gudiseva explains how to query Oracle, Hive and HBase with Apache Drill.
- On the MapR blog, Joseph Blue explains how to “bet Super Bowl 50 like a boss” with Apache Spark. With or without Spark, betting against Vegas is a sucker bet.
Microsoft Releases CNTK for Deep Learning
On the Microsoft blog, Allison Linn announces that Microsoft has released its Computational Network Toolkit (CNTK) on GitHub. CNTK is a unified deep-learning toolkit that describes neural networks as a series of steps with a directed graph. The package uses stochastic gradient descent learning to train feed-forward, convolutional, and recurrent networks.
CNTK distributes workload across multiple GPUs and servers. The software benchmarks well versus Theano, TensorFlow, Torch 7 and Caffe:
Microsoft has used CNTK successfully for speech recognition. On the Inside Microsoft Research blog, Xuedong Huang reports.
There are a slew of stories covering this story in the media.
Ruchi Gupta thinks MSFT is trolling Google.
Microsoft Previews Azure Stack for Hybrid Cloud
Microsoft announces technical preview of Azure Stack, MSFT’s new hybrid cloud platform. (For the uninitiated, a hybrid cloud platform enables enterprises to implement elastic provisioning with on-premises, in the cloud or a mix of the two.) More details about the release here.
New Zeppelin Release
Apache Zeppelin announces Release 0.5.6, with enhanced backend support for Spark 1.6, Elasticsearch, HiveInterpreter and Scalding in Local mode. There are also a number of new features and enhancements. More than 38 developers contributed to the release.