Big Analytics Roundup (May 2, 2016)

Movidius ups the ante for trade show trinkets by releasing what journos describe as supercomputing, neural computing power, vision processing, deep learning, and artificial intelligence on a USB drive.  Roundup here.

Movidius-Fathom-Key-Product-shot

Last November, IBM’s Paul Zikopoulos snarked at Cloudera for not supporting SparkR. Cloudera’s Sean Owen, responding to a query in the Cloudera Community, notes that SparkR “does not work with other resource managers,” and does not work unless R is installed on the data nodes. Sean also notes that Cloudera cannot redistribute R because it is under GPL license. Data scientist Iraklis Tsatsoulis explains how to make SparkR work in Cloudera. Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example — but it is based on actual working experience with the product, which IBM clearly does not have.

Turning to important matters, a group at the Technical University of Munich has a machine learning engine that predicts who will die in Game of Thrones. Not very well, it seems; they blew it on Roose Bolton. Oops, spoiler.

Screen Shot 2016-05-02 at 1.21.19 PM

Explainers

— Adrian Colyer explains GeePS, a Deep Learning framework for clusters of GPUs. Put that on a thumb drive and we can talk.

— On the Altiscale blog Professor Jimmy Lin compares local installations, virtual machine, IaaS providers and Altiscale’s Hadoop-as-a-Service offering for teaching students about Big Data. Spoiler: he likes Altiscale.

— Two benchmarks from the Cloudera Engineering Blog:

  • Devadutta Ghat et.al. explain results from benchmarking Impala 2.5 with TPC queries. They claim an average speedup of 4.35X over Impala 2.3 for TPC-DS.
  • Allstate’s Don Drake explains results of a test comparing Spark 1.6 performance with Avro and Parquet, with CSV as a baseline. Drake ran a multi-step benchmark with a narrow table and a wide table. Results: the Spark job ran faster with Parquet than Avro, markedly so for the wide data set, which makes sense since it’s columnar. Also, performance with CSV sucked.

— Three items from MapR’s Converge blog:

  • Nick Amato explains how to predict Airbnb listing prices with scikit-learn and Spark.
  • Mathieu Dumoulin explains Deep Learning with the CaffeOnSpark package.
  • Nicolas A Perez explains how to do Twitter sentiment analysis with Spark Streaming.

— Corentin Kerisit explains RDD partitioning in Spark.

Perspectives

— An anonymous blogger at CBInsights notes that big tech companies are paying big bucks for AI companies, so if you’re running a startup make sure you put AI in the name.

— Alexander Wissner-Gross weighs in on the “datasets versus algorithms” debate. My take: data trumps algorithms.

— Google streams engineer Tyler Akidau discusses streaming systems versus batch processing, which is like asking Mr. Fox for his perspective on chickens.

— David Weldon continues his series of interviews with people at Strata + Hadoop: Ravi Dharnikota of SnapLogic, who heard a lot of talk about streaming, Spark and data lakes.

— Alan Earls touts Amazon Machine Learning without understanding it.

Jack Vaughan interviews eBay’s Debashis Saha, who discusses Kylin and other stuff.

Open Source Announcements

— The Apache Software Foundation announces that Apache Apex has graduated to top level status. Apex, for streaming analytics, is the open source version of DataTorrent. Jessica Davis reports.

— North Bridge and Black Duck release their tenth annual survey of people who like open source.

— Apache Flink 1.0.2 ships with bug fixes and a new capability to integrate with RocksDB. So now, you can have Flink on Rocks.

Commercial Announcements

— Google’s DeepMind AI unit announces that they will use TensorFlow instead of Torch for all future work.

— Three guys exit Pivotal, start a company named SnappyData, land a tiny “A” round from Pivotal and GE Digital and propose to build something like GemFire, but on Spark. More here.

— Levyx announces a small “A” round. Levyx offers a version of Spark optimized to run on solid state/Flash memory.

— Tiny consulting firm Xentaurs announces a partnership with Mesosphere. And not just any partnership; it’s a strategic partnership. Actually, they just joined the DC/OS community.

Advertisements

3 comments

  • Thomas, a clarification on: “Sean also notes that Cloudera cannot redistribute R because it is under GPL license. … Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example …”

    Yes that’s right. We do distribute GPL code already, but under a separate parcel called GPLEXTRAS. That separation isn’t something forced by the GPL, but is done so that consumers of the core CDH parcel, who implicitly or explicitly expect Apache-licensed code only, don’t accidentally get GPL. That’s why I say it can’t be shipped “directly” with CDH. This much isn’t a blocker, just one of a few minor obstacles.

  • Thomas, a clarification on: “Sean also notes that Cloudera cannot redistribute R because it is under GPL license. … Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example …”

    Yes that’s right. We do distribute GPL code already, but under a separate parcel called GPLEXTRAS. That separation isn’t something forced by the GPL, but is done so that consumers of the core CDH parcel, who implicitly or explicitly expect Apache-licensed code only, don’t accidentally get GPL. That’s why I say it can’t be shipped “directly” with CDH. This much isn’t a blocker, just one of a few minor obstacles.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s