Big Analytics Roundup (November 23, 2015)

Eleven stories this week, including a new Flink release, new developments for Splice Machine, and a very big Spark HPC cluster in Warsaw.

InfoWorld publishes a well-written practical guide to Deep Learning.

Here are a couple of interesting articles on Spark:

  • MapR’s Jim Scott offers a nice overview of Spark RDDs.
  • Ian Pointer summarizes five things he hates about Spark.  (Spoiler: memory issues; small files problem; Spark Streaming; Python API; random crazy errors.)

On his personal blog, Harris Brakmic delivers an excellent introduction to stream processing with Flink.

(1) SystemML Makes Apache Incubator.  World: Wut?

IBM knows how to milk the last nickel out of a software asset; just ask the poor souls stuck using Lotus Notes, or Lotus Symphony.  So when IBM donates something to open source, you’re justified thinking it’s a dog they couldn’t figure out how to monetize.

Back in June, IBM announced that it would donate SystemML to open source as part of its Spark initiative.  SystemML is a declarative “framework” for machine learning that aims toward “flexible specification of ML algorithms…expressed in an R-like syntax.”  The software has been kicking around IBM Research since at least 2010, possibly earlier.  Originally conceived as a library of algorithms written in MapReduce, at some point IBM figured out that nobody wants another library of algorithms written in MapReduce.

Even the folks at BigInsights didn’t want it, possibly because the BigML brand is already taken.

So now it’s a framework platform.  I was sitting with clients last week, discussing strategy, when suddenly the CEO sat up and pounded the table.  “Dammit, everyone!”, she exclaimed, “we need more platforms around here!”

For all of you who use R, this is great news.  Instead of using the SparkR API, you can learn a new R-like framework.

Heh.  Presumably, IBM and Apache did a trademark search.

Larry Dignan reports, but ZDNet pulls the story for some reason.  Cached view here.

(2) AWS Updates EMR With Spark 1.5.2

That was quick.  Ten days after the Spark team releases 1.5.2, it’s available on AWS.   Amazon Web Services announces EMR 4.2.0.  The updated release includes Spark 1.5.2, Oozie 4.2.0, Presto 0.125, Zeppelin 0.5.5 and Ganglia 3.6 (to monitor resource utilization.)

(3) Cloudera Unloads Impala, Kudu

On the Cloudera Engineering Blog, Marcel Kornacker and Justin Erickson announce Cloudera’s proposal to donate Impala and Kudu to Apache.  Both projects are already open source under Apache license, so Cloudera must figure the projects will draw more contributors under Apache governance.

(4) Microsoft Releases DMTK, FWIW

As in Distributed Machine Learning Toolkit (why not DMLT?)  On the Inside Microsoft Research blog, George Thomas explains.  Initial functionality includes LDA and two algorithms for word embedding.  Until MSFT adds more capabilities, it’s YAF (Yet Another Framework.)  Available now on Git.  Serdar Yegulalp reports.

(5) Flink Flaunts Features, Fixes

The Flink team announces availability of its latest release, which includes bug fixes and enhancements from about 80 contributors.  Key new bits:

  • Event-time stream processing
  • Stateful stream processing
  • High-Availability for standalone and YARN clusters, using Zookeeper
  • DataStream API graduates from beta
  • DataStream connectors for ElasticSearch and Apache Nifi
  • Web dashboard for real-time monitoring
  • Off-heap managed memory
  • Improvements to Flink Gelly for graph processing

Plus more than 400 bug fixes.

(6) Huawei Sells a Whale

At SC15, Huawei announces plans to deliver a high-performance analytics platform to the University of Warsaw’s Interdisciplinary Centre for Mathematical and Computational Modeling (ICM).  Once implemented, the Spark cluster will be one of the largest in Europe, with more than 8,000 CPU cores, 43TB of RAM and 8PB disk storage.  HPC Wire reports.

(7) Splice Machine Embraces Spark

Splice Machine announces Version 2.0 of its RDBMS on Hadoop.  Splice Machine offers functionality equivalent to conventional relational databases, including ANSI SQL and ACID compliance.  Splice Machine’s query parser, planner, optimizer and executor use Apache Derby running on top of HBase and Spark.  In InfoWorld, Serdar Yegulalp notes that Splice Machine faces stiff competition in the SQL-in-Hadoop space.  Katherine Noyes reports that Splice Machine claims 20X performance improvement over “traditional database management systems” (e.g. Oracle) at a quarter of the cost.

(8) More TensorFlow Commentary

Alex Woodie explains why Google open sourced TensorFlow.  On KDnuggets, TensorFlow disappoints.  At a site called Clapway (“your source for quirky discoveries, tips and news off the beaten path”), Haley Paskalides explains AI and the impact of TensorFlow.

(9) Mesosphere Money

In TechCrunch, Matthew Lynley reports that Mesosphere is working on a new funding round after declining a buyout offer from Microsoft.  Mesosphere raised $36 million in “B” round funding in October 2014.

(10) DataRobot Partners With AI Leader

Tokyo-based Recruit Holdings invests in DataRobot, a Boston-based machine learning software vendor.  DataRobot form a business partnership with RIT, Recruit’s AI research laboratory.

(11) Bottom Story of the Week

Information Builders certifies WebFOCUS with Apache Drill.

One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.