Big Analytics Roundup (June 20, 2016)
Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.
On his personal blog, Michael Malak recaps the Spark Summit.
Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.
On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.
In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:
And yes, there are a few holdouts in the lower left quadrants.
CFPs and Competitions
— Flink Forward 2016, Berlin, September 12-14 (due June 30)
— Spark Summit Europe, Brussels, October 25-27 (closing date July 1)
— Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)
IBM Data Science Experience
Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement. The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.
It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.
Evaluating the release conceptually:
- IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
- Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
- R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
- A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.
Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.
— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.
— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.
— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.
— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.
— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.
— Jessica Davis compiles a listicle of Tech Giants who embrace open source.
— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.
— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.
— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.
Open Source Announcements
— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.
— Apache Mahout announces version 0.12.2, a maintenance release.
— Apache SystemML (incubating) announces release 0.10.0.
— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.
— Databricks announces availability of its managed Spark service on AWS GovCloud (US).
— Qubole announces QDS HBase-as-a-Service on AWS.