Big Analytics Roundup (November 30, 2015)

Three main stories this week: an upcoming Spark release, a new Drill release, and more SystemML PR.

The Flink team announces Release 0.10.1, a maintenance release.

On the Edureka! blog, an anonymous blogger describes four ways to use R and Hadoop together: RHadoop, ORCH, RHIPE and Hadoop Streaming.  That’s like saying there are four ways to fly from New York to Los Angeles: American, Delta, United and by flapping your arms vigorously.

(1) Databricks Previews Spark 1.6

Databricks announces availability of an Apache Spark 1.6 preview package.  The preview is early release software; the general release is still planned for mid-December.  Key new Spark bits:

SQL: 13 enhancements, including the new Dataset API, queries on files (without prior table registration) and personalized metrics.

Streaming: 4 API enhancements, including improved state management and Kinesis record disaggregation, plus UI improvements.

MLLib: 5 new algorithms, including survival analysis, plus improvements to ML Pipelines, the R API and the Python API

GraphX: no enhancements, which makes GraphX oh-for-2015.  Is it time to stick a fork in GraphX?

On Tuesday, December 1 Databricks’ Patrick Wendell plans to deliver a webcast on the new release.  (Register to attend).

Serdar Yegulalp notes that Spark 1.6 includes enhanced memory management, resolving one of the five things Ian Pointer hates about Spark.

Databricks also announces a new release of its platform.  Dave Wang elaborates.

(2) New Release for Apache Drill

The Drill team announces Drill 1.3, which includes enhanced Amazon S3 support, mixed-type columns, text file headers, sequence files and ~50 bug fixes.  (h/t Hadoop Weekly)

(3) IBM Touts SystemML

IBM revs the PR machine for SystemML; stories here, here, here, here, herehere, here and here.  Jessica Davis appears to be confused by the publicity, describing IBM’s donation of SystemML as a “milestone” for Spark, which is a stretch.

In case you missed last week’s story, SystemML is a high-level declarative machine learning language; the user can choose between an “R-like” syntax or a “Python-like” syntax.  Users specify the model in a general way; SystemML converts the user request into an execution plan and runs the request either in MapReduce or Spark.

It’s hard to imagine why one would ever run a machine learning algorithm in MapReduce, so you can write an “optimizer” with one rule: If Spark is installed, run it there, otherwise…

SystemML’s library of MapReduce algorithms dates back a couple of years; IBM was unable to commercialize it.  While the Spark algorithms align roughly to existing capabilities of Spark MLLib, it appears that IBM rewrote them and added a few, including stepwise regression and survival analysis.

IBM donated the software to open source last June.  All active contributors are IBM employees.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.