Big Analytics Roundup (December 7, 2015)
Cloudera’s expanded Spark support leads the news this week, together with a Data Science Virtual Machine from Microsoft. Neural network devotees will be pleased to see that Keras now runs on TensorFlow.
On the Databricks blog, H2O.ai’s Michal Malohlava describes Sparking Water, a Spark package that enables data scientists to build machine learning pipelines that integrate Spark and H2O functions. Malohlava walks the reader through the steps, using a spam prediction application as an example.
Daniel Gutierrez delivers an overview of Spark SQL, the most widely used component of Spark.
(1) Cloudera Adds Spark Library Support to CDH
Cloudera announces that it has added support for Spark SQL and MLlib into Cloudera Enterprise and CDH 5.5. According to the press release, Cloudera has more than 170 customers running Spark, including some of the largest multi-tenant clusters running today; reference customers cited include Cox Automotive, Allstate and Barclays.
Additional detail here, on the Cloudera Engineering blog.
Cloudera also reports that Hive-on-Spark is almost ready for production release. Hive-on-Spark started out as a Cloudera Labs project, but is now incorporated into the Hive project. This is Cloudera’s polite way of telling Hortonworks to take Tez and shove it.
IBM’s Paul Zikopoulos tweets that Cloudera does not yet support a number of other Spark features, including Spark SQL in PySpark, the ML pipeline API, SparkR and GraphX. It’s a fair point. But here’s something that Cloudera has that IBM seems to lack: production customers on Spark.
(2) Microsoft Introduces Data Science Virtual Machine
Microsoft’s Brad Severtson announces Microsoft Data Science Virtual Machine, an Azure VM image that is pre-installed and configured with six components:
- Revolution R Open (RRO)
- Anaconda Python distribution
- Visual Studio Community Edition
- PowerBI desktop
- SQL Server Express
- Azure SDK
RRO comes with an IDE called RRO RGui, or it can be used together with RStudio or other IDEs. The Anaconda distribution comes with the IPython notebook.
(3) Keras Runs Deep Learning on TensorFlow
(4) Fabian Flaunts Flink Features
On the Apache Flink blog, Fabian Hueske introduces stream windows, a key concept in streaming analytics, and Flink’s support for same. The Flink team continues to flog the distinction between “true” streaming and microbatching, a great distinction for doctoral dissertations, not so much in the commercial world.
(5) Open Source Denier Continues to Get It Wrong
Dan Woods, author of Wikis for Dummies, writes about H2O.ai and somehow manages to get it mostly wrong. Mr. Woods, who embarrassed himself last January when Microsoft acquired Revolution Analytics, now claims to have suggested what Microsoft announced months ago: R will run inside SQL Server. Read his January post; Woods suggests that Microsoft do something entirely different, along the lines of TIBCO, for whom Mr. Woods shamelessly shills. Microsoft has not announced that it is rebuilding R, nor is it necessary to do so to run R inside a database. Just ask Oracle.
Mr. Woods, who wrote a book about open source software ten years ago, has a problem: he likes to argue that open source software is useless unless a commercial vendor forks it and redistributes a commercial version. He cites MapR and Red Hat as examples, and seems to think that open source software is only suitable for idealists and dreamers with rose-colored glasses, and not for serious hardcore technologists who write books for dummies.
For the record, TIBCO did not “re-engineer” open source R. It acquired Insightful Corporation, which owned the rights to S-Plus, a commercial implementation of the S programming language, a predecessor to R, which it now markets as the TIBCO Enterprise Runtime for R (TERR). This is equivalent to acquiring an Edsel, rebranding it as the “Shmedsel”, then claiming that the result is “just like a Ford”. Since R is not the same as S, TIBCO does not claim that TERR can run all R programs; there is an extensive list of differences between TERR and open source R, and those are just the ones that TIBCO has discovered to date.
In other words, caveat emptor.
Woods’ theory about open source is problematic in this case because H2O, the subject of his piece, is an open source project. H2O’s sponsor, H2O.ai, operates on a true open source model, similar to Hortonworks. Thus, he spends 700 words in a 1,200 word piece defending his theory and touting TIBCO.
H2O.ai is an excellent machine learning engine with a growing user community. It has an R API (as well as Python, Java and Scala APIs), so that analysts working in those languages can easily invoke H2O’s machine learning algorithms. It is not, as Mr. Woods contends, a magic bean that makes “all of R’s machine learning functions” scalable.