Big Analytics Roundup (September 12, 2016)
On a Google blog, Kaz Sato describes how a Japanese farmer uses TensorFlow to classify cucumbers. Very good. Perhaps now Google can set TensorFlow to work figuring out how to comply with EU regulations.
Adrian Colyer returns for the Fall semester with daily papers.
HPE Holds a Fire Sale
HPE announces the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also grants equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets.
So, a company that never had a clue about analytics sells its software business to a company that doesn’t even pretend to have a clue about analytics.
The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for
firing people cutting costs, so if you’re working for an HPE software business, this may be a good time to dust off your resume.
Hewlett Packard Enterprise (HPE) announces Haven on Demand on Microsoft Azure; PR firestorm ensues. Haven is a
loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, originally branded as HAVEn and announced by HP in June 2013. Since then, the software hasn’t exactly gone viral; Haven failed to make KDnuggets’ list of the top 50 machine learning APIs last December, a list that includes the likes of Ersatz, Hutoma, and Skyttle.
In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So this latest announcement is a re-re-release of the software.
How many data scientists use Haven three years into its product life cycle?
- KDnuggets 2016 Data Science Software Usage poll: 2 out of 2,805 respondents
- O’Reilly 2016 Data Science Salary Survey: 0 out of 983 respondents
I’ll believe that Haven is a serious entrant in the data science space when real data scientists use it.
The Tools Data Scientists Use
Speaking of the tools data scientists use, O’Reilly’s 2016 Data Science Salary Survey is available now for free (registration required.) The report illustrates what we all know: the term “data scientist” is an ill-defined descriptor for a very wide range of activities, from developing applications with Elastisearch and MongoDB to attending meetings and using Microsoft Excel.
The survey asks respondents to provide detailed information about the tools they use and the tasks they perform. There are some interesting findings:
- SQL, R, and Python are respondents’ top choice in programming languages.
- MySQL edges out Microsoft SQL Server for top ranking in relational databases.
- Spark passes Hive to take the lead as most widely used Big Data platform.
- Microsoft Excel retains its position as the most popular BI tool, by far.
The authors use k-means to identify clusters among the respondents. Read the report for details. Here’s my summary description of the four clusters:
Fake Data Scientists: Respondents in this cluster have a low propensity to spend time performing any of 21 common data science tasks. They use Windows, Excel, and SQL. In other words, these are people who call themselves data scientists but aren’t.
Communicators: These respondents spend a lot of time creating visualizations, developing dashboards, conducting analysis to answer research questions and communicating findings to business decision makers. They use Windows, Excel, and SQL, together with R, Tableau, and other BI tools.
Analytic Developers: This cluster includes respondents with a relatively high propensity to perform data cleaning, feature extraction and other “early in the pipeline” activities. They also develop prototype models and put them into production. Respondents in this cluster tend to use Python, scikit-learn, and R on Linux or MacOS.
Analytic Leaders: These respondents stand out from the other three clusters because they spend time identifying business problems to be solved with analytics, planning large scale projects and communicating findings. They are also hands-on with most other data science tasks.
Breaking — Data Engineers Are Scarce
Stitch Data is a Philadelphia-based startup that describes itself as an “ETL service”, not to be confused with Stitch App, Stitch.net, Team Stitch, Get Stitching, Stitcher, Stitch Fix, Stitcher Ads, or Stitch Wood. Anyway, Stitch Data publishes The State of Data Engineering, a detailed survey of the data engineering field. The report credibly draws a distinction between data engineers and data scientists and concludes that we need more of the former.
Intel Buys Movidius
— Adam Kelleher offers a technical primer on causality. Trigger warning: contains math.
— Tony Baer touts machine learning startup DataRobot.
— George Leopold profiles In-Q-Tel, the CIA’s venture capital firm, and its recent partnerships with Zoomdata and Databricks. Anyone who still thinks that cloud platforms are not secure should read his notes on C2S and AWS GovCloud.
— Madding King scrapes Venture Beat headlines to discover that Pokemon is hot this year.
— In Tech Republic, Matt Asay interviews Lei Xu, a senior engineer at Qunar, China’s top travel site, who raves about the benefits of the Alluxio in-memory file system.