Python for Analytics
A reader complains that I did not include Python in a survey of Machine Learning in Hadoop. It’s a fair point. There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match. Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered. In this post I review the available evidence about the incidence of Python use for analytics; in a separate post, I will survey Python’s capabilities.
Python is a general purpose programming language whose syntax enables programmers to write efficient and concise code. The Python Software Foundation manages an open source reference implementation written in C and nicknamed CPython. Alternative implementations include Jython, written in Java; IronPython, for .net; and PyPy, a just-in-time compiler.
There is no dispute that Python is a popular language for general-purpose programming; according to the Transparent Language Popularity Index (TLPI), Python currently ranks seventh in popularity behind C, Java, Objective C, C++, Basic and PHP. By the same measure, exclusively analytic languages rank lower:
- #14. R
- #19. MATLAB
- #26. Scala
- #31. SAS
Measures like TLPI or the Tiobe Community Programming Index tell us something about the overall popularity of a language, but relatively little about its popularity for analytics. Many Python users aren’t at all engaged in analytics, and many analysts don’t use Python.
Python performs very well in Bob Muenchen’s analysis of analytic job postings (which he has perfected into a science). Muenchen’s analysis shows that Python ranks third in analytic job postings, behind Java and SAS. Python and R were at rough parity in job postings until early January 2013; since then, Python has outpaced R.
Surveys of analytic users show a mixed picture, reflecting differences in sampling and question construction. In the 2013 Rexer survey, 64% of all respondents report writing their own code; the top reported choice is SQL (43%), followed by Java (26%) and Python (24%). (These results are difficult to square with the overall finding that 70% of the respondents use R, which requires the user to write code.) Rexer’s sample includes a mix of Power Analysts and Business Analysts, but relatively few Data Scientists. (See this post for a definition of Analytic User Personas).
KDnuggets conducted its annual software poll in 2013; Python ranked fifth behind RapidMiner, R, Excel and Weka/Pentaho. In a separate KDnuggets poll explicitly focused on programming languages for analytics, data mining and data science, Python ranked second behind R. The KDnuggets online poll is a convenience sample (which is vulnerable to response bias), but there is no reason to believe that either R or Python users are over-represented relative to one another. The KDnuggets community consists largely of Data Scientists and Power Analysts.
A follow-up poll by KDnuggets expressly about switching between Python and R found that more people use R than Python, and users switching from other tools are more likely to choose R over Python; however, more users are switching from R to Python than from Python to R. The graphic below illustrates these relationships.
O’Reilly Media’s survey of data scientists at the 2012 and 2013 Strata conferences shows Python ranked third, behind SQL and R. (The survey does not break out responses from 2012 and 2013). More interesting is O’Reilly’s analysis of how reported usage of each tool correlates with all of the others; the graph shown below depicts all of the positive correlations significant at p=.05.
The most striking thing in this graph is the separation between open source tools at the top of the graph and commercial tools at the bottom; respondents tend to use one or the other, but not both. The dense network among open source tools indicates that those who use any open source tool tend to use many others. (Weka’s isolation from other tools in the graph indicates either that (a) Weka is a really awesome tool or (b) Weka users have a unique perspective on life. Or both.)
Among respondents to O’Reilly’s survey, Python and R use are correlated, and so are Java and R use; but Python and Java use are not correlated. Python and R use both correlate with Apache Hadoop and graph engines; Python also correlates with other components of the Hadoop ecosystem, such as Hive, Mahout and Hbase.
To summarize: Python usage is firmly embedded in the open source analytics ecosystem; however, usage is largely concentrated among Data Scientists, with lower penetration among Power Analysts (for whom R and SAS remain the preferred languages). The KDnuggets data suggests that new entrants to analytic programming are more likely to choose R over Python, but the rate of switching from R to Python suggests that Python addresses needs not currently met with R.
Arguments by Python advocates that Python will outpace R because it is easier to use strike me as silly. R is not difficult to learn for motivated users. Unmotivated users aren’t going to choose Python over R; they will choose a business analytics tool like Alpine, Alteryx or Rapid Miner and skip coding entirely. Analysts who want to code will choose a language for its functionality and not the elegance of its syntax.