Big Analytics Roundup (November 30, 2015)
Three main stories this week: an upcoming Spark release, a new Drill release, and more SystemML PR.
The Flink team announces Release 0.10.1, a maintenance release.
On the Edureka! blog, an anonymous blogger describes four ways to use R and Hadoop together: RHadoop, ORCH, RHIPE and Hadoop Streaming. That’s like saying there are four ways to fly from New York to Los Angeles: American, Delta, United and by flapping your arms vigorously.
(1) Databricks Previews Spark 1.6
— SQL: 13 enhancements, including the new Dataset API, queries on files (without prior table registration) and personalized metrics.
— Streaming: 4 API enhancements, including improved state management and Kinesis record disaggregation, plus UI improvements.
— MLLib: 5 new algorithms, including survival analysis, plus improvements to ML Pipelines, the R API and the Python API
— GraphX: no enhancements, which makes GraphX oh-for-2015. Is it time to stick a fork in GraphX?
On Tuesday, December 1 Databricks’ Patrick Wendell plans to deliver a webcast on the new release. (Register to attend).
Databricks also announces a new release of its platform. Dave Wang elaborates.
(2) New Release for Apache Drill
(3) IBM Touts SystemML
IBM revs the PR machine for SystemML; stories here, here, here, here, here, here, here and here. Jessica Davis appears to be confused by the publicity, describing IBM’s donation of SystemML as a “milestone” for Spark, which is a stretch.
In case you missed last week’s story, SystemML is a high-level declarative machine learning language; the user can choose between an “R-like” syntax or a “Python-like” syntax. Users specify the model in a general way; SystemML converts the user request into an execution plan and runs the request either in MapReduce or Spark.
It’s hard to imagine why one would ever run a machine learning algorithm in MapReduce, so you can write an “optimizer” with one rule: If Spark is installed, run it there, otherwise…
SystemML’s library of MapReduce algorithms dates back a couple of years; IBM was unable to commercialize it. While the Spark algorithms align roughly to existing capabilities of Spark MLLib, it appears that IBM rewrote them and added a few, including stepwise regression and survival analysis.
IBM donated the software to open source last June. All active contributors are IBM employees.