The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

screen-shot-2017-01-31-at-1-04-43-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, et.al. ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

screen-shot-2017-01-31-at-2-27-15-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

screen-shot-2017-01-31-at-2-52-27-pm
Source: Open Hub https://www.openhub.net/

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ 2.0.0.0 in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.

Presto

Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

IBM and Spark (Updated)

Updated March 8, 2016.  After publishing this post, I met with several IBM executives at Spark Summit East, who confirmed the accuracy of the original post and provided additional detail, which I’ve included in this version.  Updates are in bold red italics.

IBM also provided the low-resolution image.

IBM has a good story to tell — one out of ten contributors to Spark 1.6 were IBM employees.  But IBM does not tell its story effectively.  Nobody cares that IBM invented the punch card and the floppy disk.  Nobody cares that IBM is so big it can’t tell a straight product story.  Bigness is IBM’s problem.

On June 15, 2015, IBM announced a major commitment to Spark.  As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing.  Also, simply by endorsing Spark, IBM has changed the conversation.  In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.”  We haven’t heard a peep from this crowd since IBM’s announcement.  For that alone, IBM should get some kind of prize.  🙂

In its announcement, IBM detailed six initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015.  IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014.  The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are DataStage InfoSphere, DB2, Netezza PureData System for Analytics, Cognos IBM Business Analytics and SPSS IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector.  Want to access DB2 with Spark?  It’s a science project.  Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years.  An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Update: IBM has subsequently added the one-way single-threaded Netezza connector to Spark Packages.  It’s also available on Git and the IBM Developer site.

I would emphasize that a one-way single-threaded connector is useful once, when you decommission your Netezza box and move the data elsewhere.  Netezza developed a native multi-threaded connector for SAS in a matter of months, so it’s not clear why it takes IBM so long to deliver something comparable for Spark.  

 

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data”  — claimed that DataStage InfoSphere supports Spark now.  Searching documentation for InfoSphere 11.3, the most current version, produces this:

Screen Shot 2016-02-13 at 2.48.21 PM

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown.  That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move.  Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results.  Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source.  Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack.  For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.Screen Shot 2016-02-13 at 3.11.35 PM

Update: at my request, an IBM executive shared a list of products that IBM says it has rebuilt around Spark to date.  I’m publishing it verbatim for reference, but note that the list includes:

  1. Double and triple counting (“Watson Content Analytics integrates with Dataworks which integrates with Spark as a Service”).
  2. Products that do not seem to exist (Spark connectors to DB2 and Informix appear in IBM documentation as generic JDBC connections).
  3. Aspirational products (Spark on Z/OS).
  4. Projects that are not products (Watson Discovery Advisor ran a POC with Spark).
  5. Capabilities that require little or no contribution from IBM (Spark runs under Platform Symphony EGO YARN Service). 

Other than that, it’s a good list.

  • IBM BigInsights  ( Version 4.0 included Spark 1.3.1, version 4.1 includes Spark 1.4.1 – GA’ed August 25th
  • EHAAS (BigInsights on Cloud)  – Includes Spark version 1.3.1 – GA’ed June  2015
  • Analytic for Hadoop – Includes Spark version 1.3.1 – Beta . Will be replaced by Pay-go and include Spark 1.4.1
  • Spark-as-a Service  – Beta in July . GA – Oct. Currently uses Spark version 1.3.1. Will move to Spark 1.4.1
  • Dataworks (Only Cloud) – Beta  – Integrated with Spark Service. Uses Spark Version 1.3.1 
  • SPSS Analytic Server and SPSS Modeler – SPSS Modeler will support Spark version 1.4.1. GA planned for end of Q3, 2015
  • Cloudant  – Cloudant includes Spark Connector. This product is already GA.
  • Omni Channel Pricing – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Dynamic Pricing –  Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Mark Down Optimization –   Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Nimbus ETL – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA early 2016
  • Journey Analytics – Part of the IBM Commerce Brand. Integrating with Spark 1.4.1. Expected to GA end of 2015
  • IBM Twitter CDE  On Cloud – Internal only
  • IBM Insights for Twitter Service – On Cloud , Externally Available
  • Internet of Things Real time Analytics on Cloud.  Integrating with Spark 1.3.1  – Open Beta
  • Platform Symphony EGO Service – Integrated with Spark Service for Resource Scheduling and Management. Can also be used with Spark bundled with IBM Open Platform.
  • DB2 Spark Connector
  • Netezza Spark Connector
  • Watson Content Analytics  -> Integrated with Dataworks which is integrating with Spark-as-a-Service
  • Watson Content Services – > Planning to use Spark for Data Ingestion and Enrichment. 
  • Spark on Zlinux  -> Spark enabled on zLinux
  • Spark on ZOS -> Will be available by end of year
  • No response
  • GPFS -GPFS – Spark, as part of BI, runs out of the box on GPFS. In our GPFS Ambari RPM package we changed the Spark service dependency from HDFS to GPFS and added the GPFS Hadoop connector jar to the classpath
  • Informix Connector
  • Watson Discovery Advisor  –  This is a product within Watson Health.  In a small POC that this team has done. they observed that using Scala and Spark , they can reduce their lines of code from 1000s to few hundred lines.” Looking to integrate Spark in early 2016.
  • Cognos – Team indicated that they will be able to submit SPARK SQL queries for getting the results for the data. They would connect to Spark using the JDBC/ODBC driver and then be able to execute Spark SQL queries to generate results for the report . This is planned for 2016.
  • Streams –   Spark MLLib Toolkit for InfoSphere Streams 
  • Watson Research team – Developed a Geospatial RDD on Spark.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark.  Okay.  Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential.  Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms.  That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed.  Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage.  It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4.  Although two dot releases behind, that is competitive with Qubole Data Service.  Databricks is still the best bet if you want the most up to date release.

Update: An IBM executive tells me that Bluemix now uses Spark 1.6.  In the meantime, however, IBM has removed the Spark release version from its Bluemix documentation.  

IBM People to Spark Projects

3,500 researchers and developers.  Wow!  That’s a lot of butts in seats.  Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward.  Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code.  While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Satheesh Bandaram from the IBM Spark Technology Center replies: 26 people from STC contributed to Spark 1.6, with about 80 code commits.

Additional IBM response: Since June 2015. whenIBM announced Spark Technology Center (STC), engineers in STC have actively contributed to Spark releases:  v1.4.x, v1.5.x, v1.6.0, as well as releases v1.6.1 and v.2.0 (in progress.)  

As of today March 2, IBM STC has contributed to over 237 JIRAs and counting.  About 50% are answers to major JIRAs reported in Apache Spark.  

What’s in those 237 contributions…..
·        103 out of 237 (43%) are deliverables in Spark SQL area
·        56 of them (23%) are in MLlib module
·        37 (16%) are in PySpark module  

These top 3 areas of focus from IBM STC made up 82% of the total contributions as of today.   The rest are in the documentation, Spark Core and Streaming etc. modules.

You can track progress onthis live dashboard on github http://jiras.spark.tc/

Specific to Spark 1.6, IBM team members have over 80 commits – Majority of them from STC.  A total of 28 team members contributed to the release (25 of them from STC). Each contributing engineer is a credited contributor in the release notes of Spark 1.6.

For SparkSQL, we contributed:
·        enhancements and fixes in the new DataSet API
·        DataFrame API, Data type
·        UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support coalesce and re-partition etc.
·         We have added support for column datatype of CHAR
·         Fixed the type extractor failures for complex data types
·         Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues
·        Fixed the limitation in Order by clause to comply with standard.  
·        Contributed to a number of UDF code fixes in completion of Stddev support.  

For Machine Learning, the STC team met with key influencers and stakeholders in the Spark community to jointly work on items on the roadmap Most of the roadmap items discussed went into 1.6.  The implementation of LU Decomposition algorithm is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:
·        We greatly improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
·        Enhanced the Word2Vec algorithm.
·        We added optimized first through fourth order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
·        We greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
·        We made a performance enhancement to the Linear Data Generator which is critical for unit testing in Spark ML.

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, IBM STC also added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. 

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

  • AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format.  IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.
  • DataCamp does not offer Spark training.
  • MetiStream offers public and private Spark training with a defined curriculum and service offering.  The training program is certified by Databricks.
  • Galvanize does not offer Spark training.
  • Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least.  Interestingly, MetiStream developed the second of the two BDU courses.  So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”

Benchmark: Spark Beats MapReduce

A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads.  In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort.

On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”.  This point refers to Databricks’ record-setting entry in the 2014 Sort Benchmark run by Chris Nyberg, Mehul Shah and Naga Govindaraju.

However, Ramel appears to have overlooked section 3.3.1 of the paper, where the researchers explicitly address this question:

This difference is mainly because our cluster is connected using 1 Gbps Ethernet, as compared to a 10 Gbps Ethernet in, i.e., in our cluster configuration network can become a bottleneck for Sort in Spark.

In other words, had they deployed Spark on a cluster with high-speed network connections, it likely would run the Sort faster than MapReduce did.

I guess we’ll know when Nyberg et. al. release the 2015 GraySort results.

The IBM benchmark team found that k-means ran about 5X faster in Spark than in MapReduce.  Ramel highlights the difference between this and the Spark team’s claim that machine learning algorithms run “up to” 100X faster.

The actual performance comparison shown on the Spark website compares logistic regression, which the IBM researchers did not test.  One possible explanation — the Spark team may have tested against Mahout’s logistic regression algorithm, which runs on a single machine.  It’s hard to say, since the Spark team provides no backup documentation for its performance claims.  That needs to change.

O’Reilly Data Science Survey 2015

O’Reilly releases its 2015 Data Science Salary Survey.  The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey.  The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014.

The authors note that the survey includes self-selected respondents from the O’Reilly audience and may not generalize to the population of data scientists.  This does not invalidate results of the survey — all surveys of data scientists, including Rexer and KDnuggets — use unscientific samples.  It does mean one should keep the survey audience in mind when interpreting results.

Moreover, since O’Reilly’s data collection methods are consistent from year to year, changes from 2014 may be significant.

The primary purpose of the survey is to collect data about data scientist salaries.  While some find that fascinating, I am more interested in what data scientists say about the tasks they perform and tools they use, and will focus this post on those topics.

Concerning data scientist tasks, the survey confirms what we already know: data scientists spend a lot of time in exploratory data analysis and data cleaning.  However, those who spend more time in meetings and those who spend more time presenting analysis earn more.  In other words, the real value drivers in data science are understanding the client’s business problem and explaining the results.  (This is also where many data science projects fail.)

The authors’ analysis of tool usage has improved significantly over the three iterations of the survey.  In the 2015 survey, for example, they analyze operating systems and analytic tools separately; knowing that someone says they use “Windows” for analysis tells us exactly nothing.

SQL, Excel and Python remain the most popular tools, while reported R usage declined from 2014.  The authors say that the change in R usage is “only marginally significant”, which tells me they need to brush up on statistics.  (In statistics, a finding either is or is not significant at the preselected significance level; this prevents fudging.)  The reported decline in R usage isn’t reflected in other surveys so it’s likely either (a) noise, or (b) an artifact of the sampling and data collection methods used.

The 2015 survey shows a marked increase in reported use of Spark and Scala.  Within the Spark user community, the recent Databricks survey shows Python rapidly gaining on Scala as the preferred Spark interface.  Scala offers little in the way of native machine learning capability, so I doubt that the language has legs among data scientists.  On the other hand, respondents were much less likely to use Java, a finding mirrored in the Databricks survey.  Data scientists use Scala and Java to “roll their own” algorithms; but given the rapid growth of open source and commercial algorithms (and rapidly growing Python use), I expect that we will see less of that in the future.

Reported use of Mahout collapsed since the last survey.  As I’ve written elsewhere, you can stick a fork in Mahout — it’s done.  Respondents also said they were less likely to use Apache Hadoop; I guess folks have figured out that doing logistic regression in MapReduce is a loser.

Respondents also reported increased use of Tableau, which is not surprising.  It’s everywhere.

The authors report discovering nine clusters of respondents based on tool usage, shown below.  (In the 2014 survey, they found five clusters.)

Screen Shot 2015-10-05 at 11.19.29 AM

The clustering is interesting.  The top three clusters correspond roughly to a “Power Analyst” user persona, a business user who is able to use tools for analysis but is not a hardcore developer.  The lower right quadrant corresponds to a developer persona, an individual with an Engineering background able to work actively in hardcore programming languages.  Hive and BusinessObjects fall into a middle category; neither tool is accessible to most business users without some significant commitment and training.

Some of the findings will satisfy Captain Obvious:

  • R and ggplot
  • SAP HANA and BusinessObjects
  • C and C++
  • JavaScript and D3
  • PostgreSQL and Amazon Redshift
  • Hive, Pig, Hortonworks and Cloudera
  • Python, Scala and Java

Others are surprising:

  • Tableau and SAS
  • SPSS and C#
  • Hive and Weka

It’s also interesting to note that Amazon EMR and Amazon Redshift usage fall into different clusters, and that EMR clusters separately from Cloudera and Hortonworks.

Since the authors changed clustering methods from 2014 to 2015, it’s difficult to identify movement in the respondent population.  One clear change is reflected in the separate cluster for R, which aligns more closely with the business user profile in the 2015 clustering.  In the 2014 clustering, R clustered together with Python and Weka.  This could easily be an artifact of the different clustering methods used — which the authors can rule out by clustering respondents to the 2014 survey using the 2015 methods.

Instead, the authors engage in silly speculation about R usage, citing tiny changes in tiny correlation coefficients.  (They don’t show the p-values for the correlations, but I suspect we can’t reject the hypothesis that they are all zero; so the change from year to year is also zero.)  Revolution Analytics’ acquisition by Microsoft has exactly zero impact on R users’ choice of operating system; and Teradata’s support for R in 2014 (which is limited to its Aster boxes) can’t have had a material impact on data scientists’ choice of tools.

It’s also telling that the most commonly used tools fall into a single cluster with the least commonly used tools.  Folks who dabble with survey segmentation are often surprised to find that there is one big segment that is kind of a catchall for features that do not differentiate respondents.   The way to deal with that is to remove the most and least cited responses from the list of active variables, since these do not differentiate respondents; spinning an interpretation of this “catchall” cluster is rubbish.

Software for High Performance Advanced Analytics

Strata+Hadoop World week is a good opportunity to update the list of platforms for high-performance advanced analytics.  Vendors are hustling this week to announce their latest enhancements; I’ll post updates as needed.

First some definition.  The scope of this analysis includes software with the following properties:

  • Support for supervised and unsupervised machine learning
  • Support for distributed processing
  • Open platform or multi-vendor platform support
  • Availability of commercial support

There are three main “architectures” for high-performance advanced analytics available today:

  • Integration with an MPP database through table functions
  • Push-down integration with Hadoop
  • Native distributed computing, freestanding or co-located with Hadoop

I’ve written previously about the importance of distributed computing for high-performance predictive analytics, why it’s difficult to deliver and potentially disruptive to the analytics ecosystem.

This analysis excludes software that runs exclusively in a single vendor’s data platform (such as Netezza Analytics, Oracle Advanced Analytics or Teradata Aster‘s built-in analytic functions.)  While each of these vendors seeks to use advanced analytics to differentiate its data warehousing products, most enterprises are unwilling to invest in an analytics architecture that promotes vendor lock-in.  In my opinion, IBM, Oracle and Teradata should consider open sourcing their machine learning libraries, since they’re effectively giving them away anyway.

This analysis also excludes open source libraries “in the wild” (such as Vowpal Wabbit) that lack significant commercial support.

Open Source Software

H2O 

Distributor: H2O.ai (formerly 0xdata)

H20 is an open source distributed in-memory computing platform designed for deployment in Hadoop or free-standing clusters. Current functionality (Release 2.8.4.4) includes Cox Proportional Hazards modeling, Deep Learning, generalized linear models, gradient boosted classification and regression, k-Means clustering, Naive Bayes classifier, principal components analysis, and Random Forests. The software also includes tooling for data transformation, model assessment and scoring.   Users interact with the software through a web interface, a REST API or the h2o package in R.  H2O runs on Spark through the Sparkling Water interface, which includes a new Python API.

H2O.ai provides commercial support for the open source software.  There is a rapidly growing user community for H2O, and H2O.ai cites public reference customers such as Cisco, eBay, Paypal and Nielsen.

MADLib 

Distributor: Pivotal Software

MADLib is an open source machine learning library with a SQL interface that runs in Pivotal Greenplum Database 4.2 or PostgreSQL 9.2+ (as of Release 1.7).  While primarily a captive project of Pivotal Software — most of the top contributors are Pivotal or EMC employees — the support for PostgreSQL qualifies it for this list.    MADLib includes rich analytic functionality, including ten different regression methods, linear systems, matrix factorization, tree-based methods, association rules, clustering, topic modeling, text analysis, time series analysis and dimensionality reduction techniques.

Mahout

Distributor: Apache Software Foundation

Mahout is an eclectic machine learning project incepted in 2011 and currently included in major Hadoop distributions, though it seems to be something of an embarrassment to the community.  The development cadence on Mahout is very slow, as key contributors appear to have abandoned the project three years ago.   Currently (Release 0.9), the project includes twenty algorithms; five of these (including logistic regression and multilayer perceptron) run on a single node only, while the rest run through MapReduce.  To its credit, the Mahout team has cleaned up the software, deprecating unsupported functionality and mandating that all future development will run in Spark.  For Release 1.0, the team has announced plans to deliver several existing algorithms in Spark and H2O, and also to deliver something for Flink (for what that’s worth).  Several commercial vendors, including Predixion Software and RapidMiner leverage Mahout tooling in the back end for their analytic packages, though most are scrambling to rebuild on Spark.

Spark

Distributor: Apache Software Foundation

Spark is currently the platform of choice for open source high-performance advanced analytics.  Spark is a distributed in-memory computing framework with libraries for SQL, machine learning, graph analytics and streaming analytics; currently (Release 1.2) it supports Scala, Python and Java APIs, and the project plans to add an R interface in Release 1.3.  Spark runs either as a free-standing cluster, in AWS EC2, on Apache Mesos or in Hadoop under YARN.

The machine learning library (MLLib) currently (1.2) includes basic statistics, techniques for classification and regression (linear models, Naive Bayes, decision trees, ensembles of trees), alternating least squares for collaborative filtering, k-means clustering, singular value decomposition and principal components analysis for dimension reduction, tools for feature extraction and transformation, plus two optimization primitives for developers.  Thanks to large and growing contributor community, Spark MLLib’s functionality is expanding faster than any other open source or commercial software listed in this article.

For more detail about Spark, see my Apache Spark Page.

Commercial Software

Alpine Chorus

Vendor: Alpine Data Labs

Alpine targets a business user persona with a visual workflow-oriented interface and push-down integration with analytics that run in Hadoop or relational databases.  Although Alpine claims support for all major Hadoop distributions and several MPP databases, in practice most customers seem to use Alpine with Pivotal Greenplum database.  (Alpine and Greenplum have common roots in the EMC ecosystem).   Usability is the product’s key selling point, and the analytic feature set is relatively modest; however, Chorus’ collaboration and data cataloguing capabilities are unique.  Alpine’s customer list is growing; the list does not include a recent win (together with Pivotal) at a large global retailer.

dbLytix

Vendor: Fuzzy Logix

dbLytix is a library of more than eight hundred functions for advanced analytics; analytics run as database table functions and are currently supported in Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database.  Embedded in SQL, analytics may be invoked from a range of application, including custom web interfaces, Microsoft Excel, popular BI tools, SAS or SPSS.   The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

For those seeking the absolute cutting edge in advanced analytics, Fuzzy’s Tanay Zx Series offers more than five hundred analytic functions designed to run on GPU chips.  Tanay is available either as a software library or as an analytic appliance.

IBM SPSS Analytic Server

Vendor: IBM

Analytic Server serves as a Hadoop back end for IBM SPSS Modeler, a mature analytic workbench targeted to business users (licensed separately).  The product, which runs on Apache Hadoop, Cloudera CDH, Hortonworks HDP and IBM BigInsights, enables push-down MapReduce for a limited number of Modeler nodes.  Analytic Server supports most SPSS Modeler data preparation nodes, scoring for twenty-four different modeling methods, and model-building operations for linear models, neural networks and decision trees.  The cadence of enhancements for this product is very slow; first released in May 2013, IBM has released a single maintenance release since then.

RapidMiner Radoop

Vendor: RapidMiner

(Updated for Release 2.2)

RapidMiner targets a business user persona with a “code-free” user interface and deep selection of analytic features.  Last June, the company acquired Radoop, a three-year-old business partner based in Budapest.  Radoop brings to RapidMiner the ability to push down analytic processing into Hadoop using a mix of MapReduce, Mahout, Hive, Pig and Spark operations.

RapidMiner Radoop 2.2 supports more than fifty operators for data transformation, plus the ability to implement custom HiveQL and Pig scripts.  For machine learning, RapidMiner supports k-means, fuzzy k-means and canopy clustering, PCA, correlation and covariance matrices, Naive Bayes classifier and two Spark MLLib algorithms (logistic regression and decision trees); Radoop also supports Hadoop scoring capabilities for any model created in RapidMiner.

Support for Hadoop distributions is excellent, including Cloudera CDH, Hortonworks HDP, Apache Hadoop, MapR, Amazon EMR and Datastax Enterprise.  As of Release 2.2, Radoop supports Kerberos authentication.

Revolution R Enterprise

Vendor: Revolution Analytics

Revolution R Enterprise bundles a number of components, including Revolution R, an enhanced and commercially supported R distribution, a Windows IDE, integration tools and ScaleR, a suite of distributed algorithms for predictive analytics with an R interface.  A little over a year ago, Revolution released its version 7.0, which enables ScaleR to integrate with Hadoop using push-down MapReduce.   The mix of techniques currently supported in Hadoop includes tools for data transformation, descriptive statistics, linear and logistic regression, generalized linear models, decision trees, ensemble models and k-means clustering.   Revolution Analytics supports ScaleR in Cloudera, Hortonworks and MapR; Teradata Database; and in free-standing clusters running on IBM Platform LSF or Windows Server HPC.  Microsoft recently announced that it will acquire Revolution Analytics; this will provide the company with additional resources to develop and enhance the platform.

SAS High Performance Analytics

Vendor: SAS

SAS High Performance Analytics (HPA) is a distributed in-memory analytics engine that runs in Teradata, Greenplum or Oracle appliances, on commodity hardware or co-located in Hadoop (Apache, Cloudera or Hortonworks).  In Hadoop, HPA can be deployed either in a symmetric configuration (SAS instance on each DataNode) or in an asymmetric configuration (SAS deployed on dedicated “Analysis” nodes within the Hadoop cluster.)  While an asymmetric architecture seems less than ideal (due to the need for data movement and shuffling), it reduces the need to upgrade the hardware on every node and reduces SAS software licensing costs.

Functionally, there are five different bundles, for statistics, data mining, text mining, econometrics and optimization; each of these is separately licensed.  End users leverage the algorithms from SAS Enterprise Miner, which is also separately licensed.  Analytic functionality is rich compared to available high-performance alternatives, but existing SAS users will be surprised to see that many techniques available in SAS/STAT are unavailable in HPA.

SAS first introduced HPA in December, 2011 with great fanfare.  To date the product lacks a single public reference customer; this could mean that SAS’ Marketing organization is asleep at the switch, or it could mean that customer success stories with the product are few and far between.  As always with SAS, cost is an issue with prospective customers; other issues cited by customers who have evaluated the product include HPA’s inability to run existing programs developed in Legacy SAS, and concerns about the proprietary architecture. Interestingly, SAS no longer talks up this product in venues like Strata, pointing prospective customers to SAS In-Memory Statistics for Hadoop (see below) instead.

SAS In-Memory Statistics for Hadoop

Vendor: SAS

SAS In-Memory Statistics for Hadoop (IMSH) is an analytics application that runs on SAS’ “other” distributed in-memory architecture (SAS LASR Server).  Why does SAS have two in-memory architectures?  Good luck getting SAS to explain that in a coherent manner.  The best explanation, so far as I can tell, is a “mud-on-the-wall” approach to new product development.

Functionally, IMSH Release 2.5 supports data prep with SAS DS2 (an object-oriented language), descriptive statistics, classification and regression trees (C4.5), forecasting, general and generalized linear models, logistic regression, a Random Forests lookalike, clustering, association rule mining, text mining and a recommendation system.   Users interact with the product through SAS Studio, a web-based IDE introduced in SAS 9.4.

Overall, IMSH is a better value than HPA.  SAS prices this software based on the number of cores in the servers upon which it is deployed; while I can’t disclose the list price per core, it’s fair to say that any configuration beyond a sandbox will rapidly approach seven figures for the first year fee.

Skytree

Product: Skytree Infinity

Skytree began life as an academic machine learning project (FastLab, at Georgia Tech); the developers shopped the distributed machine learning core to a number of vendors and, finding no buyers, launched as a commercial software vendor in January 2013.  Recently rebranded from Skytree Server to Skytree Infinity, the product now includes modules for data marshaling and preparation that run on Spark.  Distributed algorithms can run as a free-standing cluster or co-located in Hadoop under YARN.  The product has a programming interface; the vendor claims ability to run from R, Weka, C++ and Python.   Neither Skytree’s modest list of algorithms nor its short list of public reference customers has changed in the past two years.

SAS Versus R Part Two

In a previous post, I summarized some myths about SAS and R — arguments offered by proponents of one or the other that deserve to be dismissed.

In this post, I will review some arguments that do make sense — things to consider if you are an aspiring analyst or if you are an executive making decisions about software for your organization.

(1) Every analysis technique available in SAS is available in R — plus many more

It’s fair to say that any analysis you can do in SAS you can also do in R.  The reverse, however, is not true — there are many techniques available in R that are not available in SAS.

As an open source platform, R is open to innovation, and offers few barriers to entry for new techniques.  An analyst who develops a new technique can quickly publish it in R, even if the technique has only niche appeal; it’s a great example of the long tail effect in action.

Commercial software providers like SAS, on the other hand, use product management calculus to balance the benefits of introducing a new technique against the cost to develop and support it.  The marginal revenue from adding a feature is hard to measure, while the costs are known, so conservative companies like SAS tend to lag well behind the cutting edge.  SAS also tends to bundle popular new capabilities into new products rather than enhancing the existing product, forcing customers to add more SAS software licenses to the stack if they want the capability.

Random Forests is a case in point.  Breiman and Cutler published their seminal article describing the technique in October, 2001; the following year, they published the randomForest package in R.  In December, 2012, SAS released an “experimental” version of what it calls “HP Forests” in SAS High Performance Analytics, and in 2013 included the PROC in SAS Enterprise Miner 13.1.

Ten years is a long time to wait.

(2) SAS is easier to learn and use than R

R mavens dispute this point, but they are wrong.  R is significantly harder to learn and use than SAS, at several levels, and for a number of reasons.

Bob Muenchen recently published an excellent catalogue of Things That Make R Hard to Learn.  Bob should know; he makes a living helping users cross the chasm from SAS to R.  Here is a brief except, but you should definitely read the whole thing:

R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.

There are two main reasons SAS is easier to user than R.  First, as a commercial product every element of SAS is governed by a common design that unifies the SAS programming language, user interfaces and documentation.  As a result, SAS programming syntax and documentation is generally consistent across procedures; statements generally mean the same thing whether you are working in PROC ACCESS or PROC XML.

Developers who contribute R packages, on the other hand, operate independently and without a comparable design.  While each individual package may be well or poorly written, there is no governing principle that ensures packages are consistent with one another.  While R aficionados celebrate its diversity, to the outsider it just seems messy.

SAS’ strong development tools add significant value for the user.  SAS Enterprise Guide, for example, included with Analytics Pro at no extra charge, offers a workflow interface and the ability to generate SAS or SQL code behind the scenes.  There is no equivalent code-generating tool available for R today.

(3) SAS offers an “enterprise-grade” solution

Individual analysts surveyed by Rexer last year said that cost and ease of use are the most important factors they consider when choosing analytic tools.  For enterprises, however, the selection criteria are more complex.

Technical support is a key concern for most organizations; some go so far as to adopt blanket policies banning the use of unsupported software.  SAS invests heavily in its Support organization; unlike many large software vendors, Technical Support is a career track at SAS, with low employee turnover.  With locations located in multiple countries, SAS is able to support customers globally and at enterprise scale.

When SAS licenses its software, it warrants that the software is materially free of defects.  This warranty is backed up by a contractual commitment to fix defects that surface.  Hence, SAS offers the customer a “single throat to choke” — customers know when they license SAS that a single organization is responsible for development, distribution, implementation and support of the software, and accountability is clear.

Open source R, of course, has no organic technical support.  Organizations such as Revolution Analytics offer technical support either for open source R or Revolution’s own commercial R distribution.  Third party service providers like Revolution can be highly knowledgeable and effective; however, if there is a software defect in an R package, the support provider can only notify the developer and request resolution.

(4) SAS costs more than R

“Duh!” you say; “R is free!”  True enough.  R is open source software, distributed with a free license to use; for a single analyst, the incremental TCO to download, install and use R on an existing machine is zero.   This is also true for other key components of the R ecosystem, such as RStudio, the popular development environment.   Low cost of entry is a key driver behind R’s growing popularity.

SAS, on the other hand, charges a subscription fee which consists of a term license to use the software plus technical support and maintenance into a subscription fee.  Entry costs to license the most basic package (SAS Analytics Pro) costs $8,700 (first year fee) at the SAS online store; this package includes Base SAS, SAS/STAT and SAS/Graph.  SAS renewal fees generally run 25-30% of the first year fee.  SAS bundles its analytic features into a number of separate packages, such as SAS/ETS for time series, SAS/OR for optimization and SAS/IML for matrix manipulation; if you require these capabilities, you must pay extra.  SAS also offers access engines for an assortment of data sources, each of which can be licensed individually for $3,000 each.

The version of SAS sold through the online store is for single Windows machines only.  SAS sells its software for servers through its sales force, and pricing is negotiated; “list” price depends on the computing power of the server, measured by cores or sockets.  Server pricing for Analytics Pro starts in the low six figures.

SAS offers a virtualized “University Edition” which is free but not open source.  See here for a review.

Bottom line — for the analyst

Aspiring analysts ask “should I learn SAS or R?”   I’m tempted to answer “why not both?” but that begs the question of which to learn first.

If SAS is the primary tool at your organization or university, learn it and use it.  There are still more jobs available for SAS users than R users (though the gap is narrowing); and even prospective employers who do not currently use SAS treat it as a proxy for analytics know-how.

If your organization or university supports both SAS and R, look for trends in usage.  Is the R community growing rapidly?  Are the “best and brightest” people using SAS or R?  Is your management putting out subtle (or not-so-subtle) messages promoting use of one or the other?   Take the pulse of your organization and make your choice.

If your organization or university does not already license SAS, if you aspire to free-lance consulting or you are simply unemployed, learn R.  Doing so costs you nothing, and there are plenty of low-cost options for training and self-directed learning.

Bottom line — for the enterprise

If you are making decisions about software for an analytics team or an entire organization, the calculus is more complex.

R has more analytic techniques than SAS, but what techniques to you actually need?  Take note of your team’s actual current and future analytic needs, and act accordingly.  If you are using SAS today, the chances are very good that a handful of PROCs account for 95% of current usage; the same is true for R.

SAS is easier to learn than R, but if all or most of your analysts already know R, what difference does it make?  Many younger analysts entering the workforce already know how to use R, and it is a waste of time and money to force them to learn SAS.  On the other hand, if your analysts rely on SAS, you can expect to invest considerable time and money for retraining.

Do you need an enterprise solution?  If you organization spans multiple countries, if you support more than twenty users, the chances are that the answer is “yes”.  For a larger organization, it’s hard to beat SAS’ ability to mobilize support, training and consulting resources around the world.  This is likely to change in the future, as organizations like Revolution Analytics build scale and credibility.

SAS costs more than R, but R is not free.  If you are concerned about SAS costs, carefully evaluate your spending and take note of the value offered by SAS.  Keep in mind that software licensing costs are only one component of Total Cost of Ownership (TCO); third-party support for R is not free, and neither is training and conversion.  Do the math.

In general, SAS works well for organizations that are in the middle of Tom Davenport’s maturity cycle pictured above.   These organizations have the basic data infrastructure and business cases for analytics, combined with a need for rapid scale and consistency across locations.  As organizations mature, they become less dependent on a single vendor for analytics and more willing to develop a “best-in-breed” approach; they are more interested in innovation and “cutting-edge” techniques, and the analysts they hire have the will and skill to learn R.  These organizations are adopting R at an increasing rate.

Adoption of R is most pervasive among analytic service providers, such as consultants, system integrators and marketing service providers.  These organizations are sensitive to software costs and tend to hire most highly skilled analysts, for whom R’s learning curve is not a serious issue.  Costs aside, SAS restrictions on use — designed to prevent cannibalization– are highly problematic for service providers.

SAS Versus R (Part 1)

Which is better for analytics, SAS or R?  One frequently sees discussions on this topic in social media; for examples, see here, here, here, here, here and here.   Like many debates in social media, the degree of conviction is often inverse to the quantity of information, and these discussions often produce more heat than light.

The question is serious.  Many organizations with a large investment in SAS are actively considering whether to adopt R, either to supplement SAS or to replace it altogether.  The trend is especially marked in the analytic services industry, which is particularly sensitive to SAS licensing costs and restrictive conditions.

In this post, I will recap some common myths about SAS and R.  In a follow-up post,  I will summarize the pros and cons of each as an analytics platform.

Myths About SAS and R

Advocates for SAS and R often support their positions with beliefs that are little more than urban legends; as such, they are not good reasons to choose SAS over R or vice-versa.   Let’s review six of these myths.

(1) Regulatory agencies require applicants to use SAS.

This claim is often cited in the context of submissions to the Food and Drug Administration (FDA), apparently by those who have never read the FDA’s regulations governing submissions.  The FDA accepts submissions in a range of formats including SAS Transport Files (which an R user can create using the StatTransfer utility.)   Nowhere in its regulations does the FDA mandate what software should be used to produce the analysis; like most government agencies, the FDA is legally required to support standards that do not favor single vendors.

Pharmaceutical firms tend to rely heavily on SAS because they trust the software, and not due to any FDA mandate.  Among its users, SAS has a deservedly strong reputation for quality; it is a mature product and its statistical techniques are mature, well-tested and completely documented.  In short, the software works, which means there is very little incentive for an established user to experiment with something else, just to save on licensing fees.

That trust in SAS isn’t a permanent state of affairs.  R is gradually making inroads in the life sciences community; it has already largely displaced SAS in the academic world.  Like many other regulatory bodies, the FDA itself uses open source R together with SAS.

(2) R is better than SAS because it is object oriented.

This belief is wrong on two counts: (1) it assumes that object-oriented languages are best for all use cases; and (2) it further assumes that SAS offers no object-oriented capability.

Object-oriented languages are more efficient and easier to use for many analysis tasks.  In real-world analytics, however, we often work with messy and complex data; a cursor-based language like the SAS DATA Step offers the user a great deal of flexibility, which is why it is so widely used.  Anyone who has ever attempted to translate SAS “first and last” processing into an object-oriented language understands this point.  (Yes, it can be done; but it requires a high-level of expertise in the OOL to do it).

In Release 9.3, SAS introduced DS2, an object-oriented language with a defined migration path from SAS DATA Step programming. Hence, for those tasks where object-oriented programming is desirable, DS2 meets this need for the SAS user.  (DS2 is included with Base SAS).

(3) You never know what’s inside open source software like R.

Since R is an open programming environment, anyone can develop a package and contribute it to the project.  Commercial software vendors like to plant FUD about open source software by suggesting that contributors may be amateurs or worse — in contrast to the “professional” engineering of commercial software.

One of the key virtues of open source software is that you do know what’s inside it because — unlike commercial software — you can inspect the source code.  With commercial software, you must have faith in the vendor’s integrity, technical support and willingness to stand by its warranty.  For open source software, there is no warranty nor is one required; the code speaks for itself.

When a contributor publishes an enhancement to R, a large community of users evaluates and tests the new feature.  This “crowdsourced” testing quickly flags and logs issues with software syntax and semantics, and logged issues are available for anyone to see.

Commercial software vendors like SAS have professional testing and QA departments, but since testing is expensive there is considerable pressure to minimize the expense.   Under the pressure of Marketing and Sales deadlines, systematic testing is often the first task to be cut.  Bismarck once said that nobody should witness how laws or sausages are made; the same is true for commercial software.

SAS does not disclose the headcount it commits to software testing and QA, but given the size of the R user base, it’s fair to say that the number of people who test and evaluate each R release is far greater than the number of people who evaluate each SAS release.

(4) R is better than SAS because it has thousands of packages.

This is like arguing that Wal-Mart is a better store than Brooks Brothers because it carries more items.  Wal-Mart’s breadth of product makes it a great shopping destination for many shoppers, but a Brooks Brothers shopper appreciates the store’s focus on a certain look and personalized service.

By analogy, R’s cornucopia of functionality is both a feature and a bug.  Yes, there is a package in R to support every conceivable analytic need; in many cases, there is more than one package.  As of this writing, there are 486 packages that support linear regression, which is great unless you only need one and don’t want to sift through 486.

Of course, actual R users don’t check every package to find what they need; they settle on a few trusted packages based on actual experience, word-of-mouth, books, periodicals or other sources of information.  In practice, relatively few R packages are actually used; the graph below shows package downloads from RStudio’s popular CRAN mirror in September 2014.

CRAN Downloads

(For the record, the ten most downloaded packages from RStudio’s CRAN mirror in September 2014 were Rcpp, plyr, ggplot2, stringr, digest, reshape2, RColorBrewer, labeling, colorspace and scales.)

For actual users, the relevant measure isn’t the total number of features supported in SAS and R; it’s how those features align with user needs.

N.B. — Some readers may quibble with my use of statistics from a single CRAN mirror as representative of the R community at large.  It’s a fair point — there are at least 105 public CRAN mirror sites worldwide — but given RStudio’s strong market presence it’s a reasonable proxy.

(5) Switching from SAS to R is expensive because you have to rewrite all of your code.

It’s true that when switching from SAS to R you have to rewrite programs that you want to keep; there is no engine that will translate SAS code to R code. However, SAS users tend to overestimate the effort and cost to accomplish this task.

Analytic teams that have used SAS for some years typically accumulate a large stock of programs and data; much of this accumulation, however, is junk that will never be re-used.    Keep in mind that analytic users don’t work the same way as software developers in IT or a software engineering organization.  Production developers tend to work in a collaborative environment that ensures consistent, reliable and stable results.  Analytic users, on the other hand, tend to work individually on ad hoc analysis projects; they are often inconsistently trained in software best practices.

When SAS users are pressed to evaluate a library of existing programs and identify the “keepers”, they rarely identify more than 10-20% of the existing library.  Hence, the actual effort and expense of program conversion should not be a barrier for most organizations if there is a compelling business case to switch.

It’s also worth noting that sticking with SAS does not free the organization from the cost of code migration, as SAS customers discovered when SAS 9 was released.

The real cost of switching from SAS to R is measured in human capital — in the costs of retraining skilled professionals.  For many organizations, this is a deal-breaker at present; but as more R-savvy analysts enter the workforce, the costs of switching will decline.

(6) R is a good choice when working with Big Data.

When working with Big Data, neither “legacy” SAS nor open source R is a good choice, for different reasons.

Open source R runs in memory on a single machine; it can work with data up to available memory, then fails.  It is possible to run R in a Hadoop cluster or as table functions inside MPP databases.  However, since R runs independently on each node, this is useful only for embarrassingly parallel tasks; for most advanced analytics tasks, you will need to invoke a distributed analytics engine.   There are a number of distributed engines you can invoke from R, including H2O, ScaleR and Skytree, but at this point R is simply a client and the actual work is done by the distributed engine.

“Legacy” SAS uses file-swapping to handle out-of-memory problems, but at great cost to performance; when a data set is too large to load into memory, “legacy” SAS slows to a crawl.  Through SAS/ACCESS, SAS supports the ability to pass through SQL operations to MPP databases and HiveQL, MapReduce and Pig to Hadoop; however, as is the case with R, “legacy” SAS simply functions as a client and the work is done in the database or Hadoop.  The user can accomplish the same tasks using any SQL or Hadoop interface.

To its credit, SAS also offers distributed in-memory software that runs inside Hadoop (the SAS High-Performance Analytics suite and SAS In-Memory Statistics for Hadoop).  Of course, these products do not replicate “legacy” SAS; they are entirely new products that support a subset of “legacy” SAS functionality at extra cost.  Some migration may be required, since they run DS2 but not the traditional SAS DATA Step.  (I cite these points not to denigrate the new SAS software, which appears to be well designed and implemented,  but to highlight the discontinuity for SAS users between the “legacy” product and the scalable High Performance products.)

If your organization works with Big Data, your primary focus should be on choosing the right scalable analytics platform, with secondary emphasis on the client or API used to invoke it.

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Python for Analytics

A reader complains that I did not include Python in a survey of Machine Learning in Hadoop.  It’s a fair point.  There was a lively debate last year between R and Python advocates, variously described as a war or a boxing match.  Matt Asay argued that Python is displacing R; Sharon Machlis and David Smith countered.  In this post I review the available evidence about the incidence of Python use for analytics; in a separate post, I will survey Python’s capabilities.

Python is a general purpose programming language whose syntax enables programmers to write efficient and concise code.  The Python Software Foundation manages an open source reference implementation written in C and nicknamed CPython.  Alternative implementations include Jython, written in Java; IronPython, for .net; and PyPy, a just-in-time compiler.

There is no dispute that Python is a popular language for general-purpose programming; according to the Transparent Language Popularity Index (TLPI), Python currently ranks seventh in popularity behind  C, Java, Objective C, C++, Basic and PHP.  By the same measure, exclusively analytic languages rank lower:

  • #14. R
  • #19. MATLAB
  • #26. Scala
  • #31. SAS

Measures like TLPI or the Tiobe Community Programming Index tell us something about the overall popularity of a language, but relatively little about its popularity for analytics. Many Python users aren’t at all engaged in analytics, and many analysts don’t use Python.

Python performs very well in Bob Muenchen’s analysis of analytic job postings (which he has perfected into a science).  Muenchen’s analysis shows that Python ranks third in analytic job postings, behind Java and SAS.  Python and R were at rough parity in job postings until early January 2013; since then, Python has outpaced R.

Surveys of analytic users show a mixed picture, reflecting differences in sampling and question construction.  In the 2013 Rexer survey, 64% of all respondents report writing their own code; the top reported choice is SQL (43%), followed by Java (26%) and Python (24%).  (These results are difficult to square with the overall finding that 70% of the respondents use R, which requires the user to write code.)   Rexer’s sample includes a mix of Power Analysts and Business Analysts, but relatively few Data Scientists.  (See this post for a definition of Analytic User Personas).

KDnuggets conducted its annual software poll in 2013; Python ranked fifth behind RapidMiner, R, Excel and Weka/Pentaho.   In a separate KDnuggets poll explicitly focused on programming languages for analytics, data mining and data science, Python ranked second behind R.  The KDnuggets online poll is a convenience sample (which is vulnerable to response bias), but there is no reason to believe that either R or Python users are over-represented relative to one another.  The KDnuggets community consists largely of Data Scientists and Power Analysts.

A follow-up poll by KDnuggets expressly about switching between Python and R found that more people use R than Python, and users switching from other tools are more likely to choose R over Python; however, more users are switching from R to Python than from Python to R.  The graphic below illustrates these relationships.

Switching Between Python and R

O’Reilly Media’s survey of data scientists at the 2012 and 2013 Strata conferences shows Python ranked third, behind SQL and R.  (The survey does not break out responses from 2012 and 2013).  More interesting is O’Reilly’s analysis of how reported usage of each tool correlates with all of the others; the graph shown below depicts all of the positive correlations significant at p=.05.

Strata Tool Correlation

The most striking thing in this graph is the separation between open source tools at the top of the graph and commercial tools at the bottom; respondents tend to use one or the other, but not both.  The dense network among open source tools indicates that those who use any open source tool tend to use many others.  (Weka’s isolation from other tools in the graph indicates either that (a) Weka is a really awesome tool or (b) Weka users have a unique perspective on life. Or both.)

Among respondents to O’Reilly’s survey, Python and R use are correlated, and so are Java and R use; but Python and Java use are not correlated.  Python and R use both correlate with Apache Hadoop and graph engines; Python also correlates with other components of the Hadoop ecosystem, such as Hive, Mahout and Hbase.

To summarize: Python usage is firmly embedded in the open source analytics ecosystem; however, usage is largely concentrated among Data Scientists, with lower penetration among Power Analysts (for whom R and SAS remain the preferred languages).  The KDnuggets data suggests that new entrants to analytic programming are more likely to choose R over Python, but the rate of switching from R to Python suggests that Python addresses needs not currently met with R.

Arguments by Python advocates that Python will outpace R because it is easier to use strike me as silly.  R is not difficult to learn for motivated users.  Unmotivated users aren’t going to choose Python over R; they will choose a business analytics tool like Alpine, Alteryx or Rapid Miner and skip coding entirely.  Analysts who want to code will choose a language for its functionality and not the elegance of its syntax.