Python for Analytics

A reader complains that I did not include Python in a survey of Machine Learning in Hadoop.  It’s a fair point.  There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match.  Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered.  In this post I review the available evidence about the incidence of Python use for analytics; in a separate post, I will survey Python’s capabilities.

Python is a general purpose programming language whose syntax enables programmers to write efficient and concise code.  The Python Software Foundation manages an open source reference implementation written in C and nicknamed CPython.  Alternative implementations include Jython, written in Java; IronPython, for .net; and PyPy, a just-in-time compiler.

There is no dispute that Python is a popular language for general-purpose programming; according to the Transparent Language Popularity Index (TLPI), Python currently ranks seventh in popularity behind  C, Java, Objective C, C++, Basic and PHP.  By the same measure, exclusively analytic languages rank lower:

  • #14. R
  • #19. MATLAB
  • #26. Scala
  • #31. SAS

Measures like TLPI or the Tiobe Community Programming Index tell us something about the overall popularity of a language, but relatively little about its popularity for analytics. Many Python users aren’t at all engaged in analytics, and many analysts don’t use Python.

Python performs very well in Bob Muenchen’s analysis of analytic job postings (which he has perfected into a science).  Muenchen’s analysis shows that Python ranks third in analytic job postings, behind Java and SAS.  Python and R were at rough parity in job postings until early January 2013; since then, Python has outpaced R.

Surveys of analytic users show a mixed picture, reflecting differences in sampling and question construction.  In the 2013 Rexer survey, 64% of all respondents report writing their own code; the top reported choice is SQL (43%), followed by Java (26%) and Python (24%).  (These results are difficult to square with the overall finding that 70% of the respondents use R, which requires the user to write code.)   Rexer’s sample includes a mix of Power Analysts and Business Analysts, but relatively few Data Scientists.  (See this post for a definition of Analytic User Personas).

KDnuggets conducted its annual software poll in 2013; Python ranked fifth behind RapidMiner, R, Excel and Weka/Pentaho.   In a separate KDnuggets poll explicitly focused on programming languages for analytics, data mining and data science, Python ranked second behind R.  The KDnuggets online poll is a convenience sample (which is vulnerable to response bias), but there is no reason to believe that either R or Python users are over-represented relative to one another.  The KDnuggets community consists largely of Data Scientists and Power Analysts.

A follow-up poll by KDnuggets expressly about switching between Python and R found that more people use R than Python, and users switching from other tools are more likely to choose R over Python; however, more users are switching from R to Python than from Python to R.  The graphic below illustrates these relationships.

Switching Between Python and R

O’Reilly Media’s survey of data scientists at the 2012 and 2013 Strata conferences shows Python ranked third, behind SQL and R.  (The survey does not break out responses from 2012 and 2013).  More interesting is O’Reilly’s analysis of how reported usage of each tool correlates with all of the others; the graph shown below depicts all of the positive correlations significant at p=.05.

Strata Tool Correlation

The most striking thing in this graph is the separation between open source tools at the top of the graph and commercial tools at the bottom; respondents tend to use one or the other, but not both.  The dense network among open source tools indicates that those who use any open source tool tend to use many others.  (Weka’s isolation from other tools in the graph indicates either that (a) Weka is a really awesome tool or (b) Weka users have a unique perspective on life. Or both.)

Among respondents to O’Reilly’s survey, Python and R use are correlated, and so are Java and R use; but Python and Java use are not correlated.  Python and R use both correlate with Apache Hadoop and graph engines; Python also correlates with other components of the Hadoop ecosystem, such as Hive, Mahout and Hbase.

To summarize: Python usage is firmly embedded in the open source analytics ecosystem; however, usage is largely concentrated among Data Scientists, with lower penetration among Power Analysts (for whom R and SAS remain the preferred languages).  The KDnuggets data suggests that new entrants to analytic programming are more likely to choose R over Python, but the rate of switching from R to Python suggests that Python addresses needs not currently met with R.

Arguments by Python advocates that Python will outpace R because it is easier to use strike me as silly.  R is not difficult to learn for motivated users.  Unmotivated users aren’t going to choose Python over R; they will choose a business analytics tool like Alpine, Alteryx or Rapid Miner and skip coding entirely.  Analysts who want to code will choose a language for its functionality and not the elegance of its syntax.

Analytic User Personas

Analytic users are not all the same; in most organizations, there are a number of different user “personalities”, or personas, with distinct needs.  If you develop an analytics architecture for your organization or develop analytic software to sell to others, it is important to understand these personas.  In this essay, I profile four personas:

  • Power Analyst
  • Data Scientist
  • Business Analyst
  • Analytic Consumer

Your organization may or may not include all four personas; for example, if your organization consistently outsources predictive model building, you may have no Power Analysts or Data Scientists.   Moreover, if your organization is large enough, it may be valuable for you to recognize distinct subclasses of users within each persona.   In any event, your success depends on how well you understand the diverse needs of prospective users.

The Power Analyst

The Power Analyst sees advanced analytics as a full-time job, and holds positions such as Statistician or Actuary in organizations with significant investments in analytics, or as consultants in organizations that provide analytic services.  The Power Analyst understands conventional statistics and machine learning, and has considerable working experience in applied analytics.

Power Analysts prefer to work in an analytic programming language such as Legacy SAS or R.  They have enough training and working experience with the language to be productive, and consider analytic programming languages to be more flexible and powerful than analytic software packages with GUI interfaces.  They do not need analytics to be easy, and may look down on those who do.

The rightanalytic method is extremely important to Power Analysts; they tend to be more concerned with using the correctmethodology than with actual differences in business results achieved with different methods.  This means, for example, if a particular analytic problem calls for a specific method or class of methods, such as Survival Analysis, the Power Analyst will go to great lengths to use this method even if the improvement to predictive accuracy is very small.

In practice, since working Power Analysts tend to work with highly diverse problems and cannot always predict the nature of the problems they will need to address, they place a premium on being able to use a wide variety of analytic methods and techniques.  The need for a particular method or technique may be rare, but Power Analysts want to be able to use it if the need arises.

Since data preparation is critical to successful predictive analytics, Power Analytics need to be able to understand and control the data they work with.  This does not mean that Power Analytics want to manage the data or perform ETL tasks; it means that they need the data management processes to be transparent and responsive.  In organizations where IT does not place a premium on supporting predictive analytics, Power Analysts will take over data management and ETL to meet their own needs, but this is not necessarily the working model they prefer.

The work product of Power Analysts may be a management report of some kind showing the results of an analysis, a predictive model specification to be recoded for production, a predictive model object (such as a PMML document) or an actual executable scoring function written in a programming language such as Java or C.  Power Analysts do not want to be heavily involved in production deployment or routing model scoring, though they may be forced into this role if the organization has not invested in tooling for model score deployment. 

Power Analysts are highly engaged in the specific brand, release and version of analytic software.  In organizations where the analytics team has significant influence, they play a decisive role in selecting analytic software.   They also want control over the technical infrastructure supporting the analytic software, though they tend to be indifferent about specific brands of hardware, databases, storage and so forth.

In many organizations, the Power Analyst provides an “attest” function to validate that analytics are correctly performed; hence, they tend to have disproportionate authority in analytic matters based on their reputation and expertise.

The Data Scientist

As the Google Trends graph below illustrates, the term “Data Scientist” is of recent origin, hardly used at all prior to 2011 but rapidly increasing since then.

Google Trends Data Scientist

The Data Scientist is similar in many respects to the Power Analyst.  Both share a lack of interest in easy to usetooling, and a desire to engage at a granular level with the data.

The principal differences between Data Scientists and Power Analysts relate to background, training and approach.   Power Analysts tend to understand statistical methods, bring a statistical orientation to analytics, and tend to prefer working with higher-level languages with built-in analytic syntax; Data Scientists, on the other hand, tend to come from a machine learning, engineering or computer science background.  Consequently, they tend to prefer working with programming languages such as C, Java or Python and tend to be much better equipped to work with SQL and MapReduce. 

It is no accident that the growing usage of the Data Scientist label correlates with expanded deployment and use of Hadoop.  Data Scientists tend to have working experience with Hadoop, and this may be their preferred working environment.  They are comfortable working with MapReduce or Apache Spark, and will develop their own code on these platforms if there is no available “off-the-shelf” software that meets their needs.

Data Scientistsmachine learning roots influence their methods, techniques and approach, which affect their requirements for analytic tooling.  The machine learning discipline tends to focus less on choosing the rightanalytic method, and places the focus on results of the predictive analytics process, including the predictive power of the model produced by the process.  Hence, they are much more open to various forms of brute forcelearning, and choose methods that may be difficult to defend within the statistical paradigm but demonstrate good results.

Data Scientists tend to have low regard for existing analytic software vendors, especially those like SAS and IBM who cater to business customers by soft-peddling technical details; instead, they tend to prefer open source tooling.  They seek the best technicalsolution, one with sufficient flexibility to support innovation.  Data Scientists tend to engage directly in the process of productionizingtheir analytic findings; Power Analysts, in contrast, tend to prefer an entirely hands-offrole in the process.

Since the Data Scientist role has recently emerged, it may lack the sapiential authority enjoyed by the Power Analyst in conservative organizations.  In some organizations, “Data Science” is perceived negatively, and

The Business Analyst

The Business Analyst uses analytics within the context of a role in the organization where analytics is important but not the exclusive responsibility.  Business Analysts hold a range of titles, such as Loan Officer, Marketing Analyst or Merchandising Specialist.

Business Analysts are familiar with analytics and may have some training and experience.  Nevertheless, they prefer an easy-to-use interface and software such as SAS Enterprise Guide, SAS Enterprise Miner, SPSS Statistics or similar products.  

While Power Analysts are very concerned with choosing the rightmethod for the problem, Business Analysts tend to prefer a simpler approach.  For example, they may be familiar with regression analysis, but they are unlikely to be interested in all of the various kinds of regression and the details of how regression models are calculated.  They value wizardtooling that guides the selection of methods and techniques within a problem-solving framework.

The Business Analyst may be aware that data is important to the success of analytics, but does not want to deal with it directly.  Instead, the Business Analyst prefers to work with data that certified correct by others in the organization.  Face validity matters to the Business Analyst; data should be internally consistent and align with the analysts understanding of the business.

In most cases, the work product of a Business Analyst is a report summarizing the results of an analysis.   The work product may also be a decision of some kind, such as the volume of merchandise to a complex loan decision.  Business Analysts rarely produce predictive models for production deployment, because their working methods tend to lack the rigor and exhaustiveness of Power Analysts.

Business Analysts value good customer-friendly Technical Support, and tend to prefer to use software from vendors with demonstrated credibility in analytics.  

The Analytic Consumer

Analytic Consumers are fully focused on business questions and issues and do not engage directly in the productionof analytics; instead they use the results of analytics in the form of automated decisions, forecasts and other forms of intelligence that are embedded into the business processes in which they engage.

Analytic Consumers are not necessarily top managementor any other specific level in the organization; they are simply not professionally engaged in the sausage-makingof forecasts, automated decisions, and so forth.

While the Analytic Consumer may not engage with mathematical computations, they are concerned with the overall utility, performance and reliability of the systems they use.  For example, a customer service rep in a credit card call center may not be concerned with the analytic method used to determine a decision, but will be very concerned if the system takes a long time to reach a decision.  The rep may also object if the system does not provide reasonable explanations when it declines credit request, or appears to decline too many customers that seem to be good risks.

In most organizations, Analytic Consumers are the largest group of prospective users.  Since the range of possible ways that analytics can positively affect business processes is large and growing rapidly, and since embedded analytics have few barriers to use, this group of users also has the greatest growth potential.

In most organizations, there are many more prospective Analytic Consumers and Business Analysts than Power Analysts and Data Scientists; on the surface, this means that a strategy of appealing to Analytic Consumers and Business Analysts offers the greatest potential for business value.  However, few organizations are willing to entrust “hard money” analytic applications (such as fraud, credit risk or trading) to analytic novices; since the best and brightest analysts tend to be Power Analysts or Data Scientists, they tend to carry the most weight in decision-making about analytics.