O’Reilly Data Science Survey 2015
O’Reilly releases its 2015 Data Science Salary Survey. The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey. The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014.
The authors note that the survey includes self-selected respondents from the O’Reilly audience and may not generalize to the population of data scientists. This does not invalidate results of the survey — all surveys of data scientists, including Rexer and KDnuggets — use unscientific samples. It does mean one should keep the survey audience in mind when interpreting results.
Moreover, since O’Reilly’s data collection methods are consistent from year to year, changes from 2014 may be significant.
The primary purpose of the survey is to collect data about data scientist salaries. While some find that fascinating, I am more interested in what data scientists say about the tasks they perform and tools they use, and will focus this post on those topics.
Concerning data scientist tasks, the survey confirms what we already know: data scientists spend a lot of time in exploratory data analysis and data cleaning. However, those who spend more time in meetings and those who spend more time presenting analysis earn more. In other words, the real value drivers in data science are understanding the client’s business problem and explaining the results. (This is also where many data science projects fail.)
The authors’ analysis of tool usage has improved significantly over the three iterations of the survey. In the 2015 survey, for example, they analyze operating systems and analytic tools separately; knowing that someone says they use “Windows” for analysis tells us exactly nothing.
SQL, Excel and Python remain the most popular tools, while reported R usage declined from 2014. The authors say that the change in R usage is “only marginally significant”, which tells me they need to brush up on statistics. (In statistics, a finding either is or is not significant at the preselected significance level; this prevents fudging.) The reported decline in R usage isn’t reflected in other surveys so it’s likely either (a) noise, or (b) an artifact of the sampling and data collection methods used.
The 2015 survey shows a marked increase in reported use of Spark and Scala. Within the Spark user community, the recent Databricks survey shows Python rapidly gaining on Scala as the preferred Spark interface. Scala offers little in the way of native machine learning capability, so I doubt that the language has legs among data scientists. On the other hand, respondents were much less likely to use Java, a finding mirrored in the Databricks survey. Data scientists use Scala and Java to “roll their own” algorithms; but given the rapid growth of open source and commercial algorithms (and rapidly growing Python use), I expect that we will see less of that in the future.
Reported use of Mahout collapsed since the last survey. As I’ve written elsewhere, you can stick a fork in Mahout — it’s done. Respondents also said they were less likely to use Apache Hadoop; I guess folks have figured out that doing logistic regression in MapReduce is a loser.
Respondents also reported increased use of Tableau, which is not surprising. It’s everywhere.
The authors report discovering nine clusters of respondents based on tool usage, shown below. (In the 2014 survey, they found five clusters.)
The clustering is interesting. The top three clusters correspond roughly to a “Power Analyst” user persona, a business user who is able to use tools for analysis but is not a hardcore developer. The lower right quadrant corresponds to a developer persona, an individual with an Engineering background able to work actively in hardcore programming languages. Hive and BusinessObjects fall into a middle category; neither tool is accessible to most business users without some significant commitment and training.
Some of the findings will satisfy Captain Obvious:
- R and ggplot
- SAP HANA and BusinessObjects
- C and C++
- PostgreSQL and Amazon Redshift
- Hive, Pig, Hortonworks and Cloudera
- Python, Scala and Java
Others are surprising:
- Tableau and SAS
- SPSS and C#
- Hive and Weka
It’s also interesting to note that Amazon EMR and Amazon Redshift usage fall into different clusters, and that EMR clusters separately from Cloudera and Hortonworks.
Since the authors changed clustering methods from 2014 to 2015, it’s difficult to identify movement in the respondent population. One clear change is reflected in the separate cluster for R, which aligns more closely with the business user profile in the 2015 clustering. In the 2014 clustering, R clustered together with Python and Weka. This could easily be an artifact of the different clustering methods used — which the authors can rule out by clustering respondents to the 2014 survey using the 2015 methods.
Instead, the authors engage in silly speculation about R usage, citing tiny changes in tiny correlation coefficients. (They don’t show the p-values for the correlations, but I suspect we can’t reject the hypothesis that they are all zero; so the change from year to year is also zero.) Revolution Analytics’ acquisition by Microsoft has exactly zero impact on R users’ choice of operating system; and Teradata’s support for R in 2014 (which is limited to its Aster boxes) can’t have had a material impact on data scientists’ choice of tools.
It’s also telling that the most commonly used tools fall into a single cluster with the least commonly used tools. Folks who dabble with survey segmentation are often surprised to find that there is one big segment that is kind of a catchall for features that do not differentiate respondents. The way to deal with that is to remove the most and least cited responses from the list of active variables, since these do not differentiate respondents; spinning an interpretation of this “catchall” cluster is rubbish.