Big Analytics Roundup (September 12, 2016)

On a Google blog, Kaz Sato describes how a Japanese farmer uses TensorFlow to classify cucumbers. Very good. Perhaps now Google can set TensorFlow to work figuring out how to comply with EU regulations.

Good Reads

Adrian Colyer returns for the Fall semester with daily papers.

Superlative Reads

Dez Blanchfield heaps praise on obscure blogger’s new book.

HPE Holds a Fire Sale

HPE announces the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also grants equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets.

So, a company that never had a clue about analytics sells its software business to a company that doesn’t even pretend to have a clue about analytics.

The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for an HPE software business, this may be a good time to dust off your resume.

Meanwhile, Robin Bloor joins the parade of distinguished industry analysts touting Haven, HPE’s cloud-based machine learning library. Here’s a snippet from what I wrote about Haven back in March:

Hewlett Packard Enterprise (HPE) announces Haven on Demand on Microsoft Azure; PR firestorm ensues.  Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, originally branded as HAVEn and announced by HP in June 2013.  Since then, the software hasn’t exactly gone viral; Haven failed to make KDnuggets’ list of the top 50 machine learning APIs last December, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So this latest announcement is a re-re-release of the software. 

How many data scientists use Haven three years into its product life cycle?

  • KDnuggets 2016 Data Science Software Usage poll: 2 out of 2,805 respondents
  • O’Reilly 2016 Data Science Salary Survey: 0 out of 983 respondents

I’ll believe that Haven is a serious entrant in the data science space when real data scientists use it.

The Tools Data Scientists Use

Speaking of the tools data scientists use, O’Reilly’s 2016 Data Science Salary Survey is available now for free (registration required.)  The report illustrates what we all know: the term “data scientist” is an ill-defined descriptor for a very wide range of activities, from developing applications with Elastisearch and MongoDB to attending meetings and using Microsoft Excel.

The survey asks respondents to provide detailed information about the tools they use and the tasks they perform. There are some interesting findings:

  • SQL, R, and Python are respondents’ top choice in programming languages.
  • MySQL edges out Microsoft SQL Server for top ranking in relational databases.
  • Spark passes Hive to take the lead as most widely used Big Data platform.
  • Microsoft Excel retains its position as the most popular BI tool, by far.

The authors use k-means to identify clusters among the respondents. Read the report for details. Here’s my summary description of the four clusters:

Fake Data Scientists: Respondents in this cluster have a low propensity to spend time performing any of 21 common data science tasks. They use Windows, Excel, and SQL. In other words, these are people who call themselves data scientists but aren’t.

Communicators:  These respondents spend a lot of time creating visualizations, developing dashboards, conducting analysis to answer research questions and communicating findings to business decision makers. They use Windows, Excel, and SQL, together with R, Tableau, and other BI tools.

Analytic Developers: This cluster includes respondents with a relatively high propensity to perform data cleaning, feature extraction and other “early in the pipeline” activities. They also develop prototype models and put them into production. Respondents in this cluster tend to use Python, scikit-learn, and R on Linux or MacOS.

Analytic Leaders: These respondents stand out from the other three clusters because they spend time identifying business problems to be solved with analytics, planning large scale projects and communicating findings. They are also hands-on with most other data science tasks.

Breaking — Data Engineers Are Scarce

Stitch Data is a Philadelphia-based startup that describes itself as an “ETL service”, not to be confused with Stitch App,, Team Stitch, Get Stitching, Stitcher, Stitch Fix, Stitcher Ads, or Stitch Wood. Anyway, Stitch Data publishes The State of Data Engineering, a detailed survey of the data engineering field. The report credibly draws a distinction between data engineers and data scientists and concludes that we need more of the former.

Intel Buys Movidius

Intel grabs another AI/machine learning startup. This time, it’s Movidius, the folks who put a deep learning chip on a memory stick.




— Adam Kelleher offers a technical primer on causality. Trigger warning: contains math.

— Tony Baer touts machine learning startup DataRobot.

— George Leopold profiles In-Q-Tel, the CIA’s venture capital firm, and its recent partnerships with Zoomdata and Databricks.  Anyone who still thinks that cloud platforms are not secure should read his notes on C2S and AWS GovCloud.

— Madding King scrapes Venture Beat headlines to discover that Pokemon is hot this year.

— In Tech Republic, Matt Asay interviews Lei Xu, a senior engineer at Qunar, China’s top travel site, who raves about the benefits of the Alluxio in-memory file system.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.