Big Analytics Roundup (June 20, 2016)

Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.

On his personal blog, Michael Malak recaps the Spark Summit.

Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.

On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.

In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:

Screen Shot 2016-06-20 at 12.24.07 PM

And yes, there are a few holdouts in the lower left quadrants.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

IBM Data Science Experience

Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement.  The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.

Screen Shot 2016-06-20 at 11.17.54 AM

It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.

Evaluating the release conceptually:

  • IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
  • Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
  • R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
  • A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.

Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.

Explainers

— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.

— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.

— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.

— In TechRepublic, Hope Reese explains machine learning to smart people. For everyone else, there’s this.

— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.

— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.

— Jessica Davis compiles a listicle of Tech Giants who embrace open source.

— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.

Perspectives

— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.

— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.

Open Source Announcements

— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.

— Apache Mahout announces version 0.12.2, a maintenance release.

— Apache SystemML (incubating) announces release 0.10.0.

Commercial Announcements

— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.

— Databricks announces availability of its managed Spark service on AWS GovCloud (US).

— Qubole announces QDS HBase-as-a-Service on AWS.

Advertisements

2 comments

  • Hi Thomas,

    what is your thought on the competition between IBM’s new Data Science Experience and the existing SPSS product? Seems to me like they either cannibalize or kill SPSS?

    Best,
    Martin

    • Martin,

      IBM’s “vision” for DSE is a managed service for open source tools targeted to an expert user. It offers a programming interface through Jupyter and RStudio.

      DSE is unlikely to cannibalize SPSS, as it targets a different user persona. IBM’s other cloud platform (IBM Watson Analytics), could do so.

      Regards,

      Thomas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s