Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

2015: Predictions for Big Analytics

First, a review of last year’s predictions:

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

At the New York Strata/Hadoop World conference in October, if you took a drink each time a speaker said “Spark”, you would struggle to make it past noon.  At my lunch table, every single person said his company is currently evaluating Spark.  There are few alternatives to Spark for advanced analytics in Hadoop, and the platform has arrived.

(2) “Co-location” will be the latest buzzword.

Few people use the word “co-location”, but thanks to YARN, vendors like SAS and Skytree are now able to honestly position their products as running “inside” Hadoop.  YARN has changed the landscape for analytics in Hadoop, so that products that interface through MapReduce are obsolete.

(3) Graph engines will be hot.

Graph engines did not take off in 2014.  Development on Apache Giraph has flatlined, and open source GraphLab is quiet as well. Apache Spark’s GraphX is the only graph engine for Hadoop under active development; the Spark team recently promoted GraphX from Alpha to production.  However, with just 10 out of 132 contributors working on GraphX in Release 1.2, the graph engine is relatively quiet compared to the SQL, Machine Learning and Streaming modules.

(4) R approaches parity with SAS in the commercial job market.

As of early 2014, when Bob Muenchin last updated his job market statistics, SAS led R in job postings, but R was closing the gap rapidly.

Linda Burtch of Burtch Works is the nation’s leading executive recruiter for quants and data scientists.  I asked Linda what analytic languages hiring managers seek when they hire quants.  “My clients are still more frequently asking for SAS, although many more are now asking for either SAS or R,” she says.   “I also recommend to my clients who ask specifically for SAS skills to be open to those using R, and many will agree after the suggestion. ”

 (5) SAP emerges as the company most likely to buy SAS.

After much hype about the partnership in late 2013, SAS and SAP issued not a single press release in 2014.  The dollar’s strength against the Euro makes it less likely that SAP will buy SAS.

(6) Competition heats up for “easy to use” predictive analytics.

Software companies target the “easy to use” analytics market because it’s larger than the expert market and because expert analysts rarely switch.  Alpine, Alteryx, and Rapid Miner all gained market presence in 2014; Dell’s acquisition of Statsoft gives that company the deep pockets they need for a makeover.  In easy to use cloud analytics, StatWing has added functionality, and IBM Watson Analytics emerged from beta.

Four out of six ain’t bad.  Now looking ahead:

(1) Apache Spark usage will explode.

While interest in Spark took off in 2014, relatively few people actually use the platform, which appeals primarily to hard-core data scientists.  That will change in 2015, for several reasons:

  • The R interface planned for release in Q1 opens the platform to a large and engaged community of users
  • Alteryx, Alpine and other easy to use analytics tools currently support or plan to support Spark RDDs as a data source
  • Databricks Cloud offers an easy way to spin up a Spark cluster

As a result of these and other innovations, there will be many more Spark users in twelve months than there are today.

(2) Analytics in the cloud will take off.

Yes, I know — some companies are reluctant to put their “sensitive” data in the cloud.  And yet, all of the top ten data breaches in 2014 defeated an on-premises security system.  Organizations are waking up to the fact that management practices are the critical factor in data security — not the physical location of the data.

Cloud is eating the analytics world for three big reasons:

  • Analytic workloads tend to be lumpy and difficult to predict
  • Analytic projects often need to get up and running quickly
  • Analytic service providers operate in a variable cost world, with limited capital for infrastructure

Analytic software options available in the Amazon Marketplace are increasing rapidly; current options include Revolution R, BigML and YHat, among others.  For the business user, StatWing and IBM Watson Analytics provide compelling independent cloud-based platforms.

Even SAS seeks to jump on the Cloud bandwagon, touting its support for Amazon Web Services.  Cloud devotees may be disappointed, however, to discover that SAS does not offer elastic pricing for AWS,  lacks a native access engine for RedShift, and does not support its Hadoop interface with EMR.

(3) Python will continue to gain on R as the preferred open source analytics platform.

The Python versus R debate is at least as contentious as the SAS versus R debate, and equally tiresome.  As a general-purpose scripting language, Python’s total user base is likely larger than R’s user base.  For analytics, however, the evidence suggests that R still leads Python, but that Python is catching up.  According to a recent poll by KDNuggets, more people switch from R to Python than the other way ’round.

Both languages have their virtues. The sheer volume of analytic features in R is much greater than Python, though in certain areas of data science (such as Deep Learning) Python appears to have the edge.  Devotees of each language claim that it is easier to use than the other, but the two languages are at rough parity by objective measures.

Python has two key advantages over R.  As a general-purpose language, it is a better tool for application development; hence, for embedded analytic applications (such as recommendation engines, decision engines and online scoring), Python gets the nod over R.  Second, Python’s open source license is less restrictive than the R license, which makes it a better choice for commercial use.  There are provisions in the R license that scare the pants off some company lawyers, rightly or wrongly.

(4) H2O will continue to win respect and customers in the Big Analytics market.

If you’re interested in scalable analytics but haven’t checked out H2O, you should.  H2O is a rapidly growing true open source project for distributed analytics; it runs in clusters, in Hadoop and in Amazon Cloud; offers an excellent R interface together with Java and Scala APIs; and is accessible from Tableau.  H2O supports a rich and growing machine learning library that includes Deep Learning and the only available distributed Gradient Boosting algorithm on the market today.

While the software is freely available, H2O offers support and services for an attractive price.  The company currently claims more than two thousand users, including reference customers Cisco, eBay, Nielsen and Paypal.

(5) SAS customers will continue to seek alternatives.

SAS once had an almost religious loyalty from its customers.  This is no longer the case; in a recent report published by Gartner, surveyed executives reported they are more likely to discontinue use of SAS than any other business intelligence software.  While respondents rated SAS above average on sales experience and average on product quality, SAS fared poorly in measures of usability and ease of integration.  While the Gartner survey does not address pricing, it’s fair to say that no vendor can command premium prices without an outstanding product.

While few enterprises plan to pull the plug on SAS entirely, many are limiting growth of the SAS footprint and actively developing alternatives.  This is especially marked in the analytic services industry, which tends to attract people with the skills to use Python or R, and where cost control is important.  Even among big banks and pharma companies, though, SAS user headcount is declining.

Distributed Analytics: A Primer

Can we leverage distributed computing for machine learning and predictive analytics? The question keeps surfacing in different contexts, so I thought I’d take a few minutes to write an overview of the topic.

The question is important for four reasons:

  • Source data for analytics frequently resides in distributed data platforms, such as MPP appliances or Hadoop;
  • In many cases, the volume of data needed for analysis is too large to fit into memory on a single machine;
  • Growing computational volume and complexity requires more throughput than we can achieve with single-threaded processing;
  • Vendors make misleading claims about distributed analytics in the platforms they promote.

First, a quick definition of terms.  We use the term parallel computing to mean the general practice of dividing a task into smaller units and performing them in parallel; multi-threaded processing means the ability of a software program to run multiple threads (where resources are available); and distributed computing means the ability to spread processing across multiple physical or virtual machines.

The principal benefit of parallel computing is speed and scalability; if it takes a worker one hour to make one hundred widgets, one hundred workers can make ten thousand widgets in an hour (ceteris paribus, as economists like to say).  Multi-threaded processing is better than single-threaded processing, but shared memory and machine architecture impose a constraint on potential speedup and scalability.  In principle, distributed computing can scale out without limit.

The ability to parallelize a task is inherent in the definition of the task itself.  Some tasks are easy to parallelize, because computations performed by each worker are independent of all other workers, and the desired result set is a simple combination of the results from each worker; we call these tasks embarrassingly parallel.   A SQL Select query is embarrassingly parallel; so is model scoring; so are many of the tasks in a text mining process, such as word filtering and stemming.

A second class of tasks requires a little more effort to parallelize.  For these tasks, computations performed by each worker are independent of all other workers, and the desired result set is a linear combination of the results from each worker.  For example, we can parallelize computation of the mean of a distributed database by computing the mean and row count independently for each worker, then compute the grand mean as the weighted mean of the worker means.  We call these tasks linear parallel.

There is a third class of tasks, which is harder to parallelize because the data must be organized in a meaningful way.  We call a task data parallel if computations performed by each worker are independent of all other workers so long as each worker has a “meaningful” chunk of the data.  For example, suppose that we want to build independent time series forecasts for each of three hundred retail stores, and our model includes no cross-effects among stores; if we can organize the data so that each worker has all of the data for one and only one store, the problem will be embarrassingly parallel and we can distribute computing to as many as three hundred workers.

While data parallel problems may seem to be a natural application for processing inside an MPP database or Hadoop, there are two constraints to consider.  For a task to be data parallel, the data must be organized in chunks that align with the business problem.  Data stored in distributed databases rarely meets this requirement, so the data must be shuffled and reorganized prior to analytic processing, a process that adds latency.  The second constraint is that the optimal number of workers depends on the problem; in the retail forecasting problem cited above, the optimal number of workers is three hundred.  This rarely aligns with the number of nodes in a distributed database or Hadoop cluster.

There is no generally agreed label for tasks that are the opposite of embarrassingly parallel; for convenience, I use the term orthogonal to describe a task that cannot be parallelized at all.  In analytics, case-based reasoning is the best example of this, as the method works by examining individual cases in a sequence.  Most machine learning and predictive analytics algorithms fall into a middle ground of complex parallelism; it is possible to divide the data into “chunks” for processing by distributed workers, but workers must communicate with one another, multiple iterations may be required and the desired result is a complex combination of results from individual workers.

Software for complex machine learning tasks must be expressly designed and coded to support distributed processing.  While it is physically possible to install open source R or Python in a distributed environment (such as Hadoop), machine learning packages for these languages run locally on each node in the cluster.  For example, if you install open source R on each node in a twenty-four node Hadoop cluster and try to run logistic regression you will end up with twenty-four logistic regression models developed separately for each node.  You may be able to use those results in some way, but you will have to program the combination yourself.

Legacy commercial tools for advanced analytics provide only limited support for parallel and distributed processing.  SAS has more than 300 procedures in its legacy Base and STAT software packages; only a handful of these support multi-threaded (SMP) operations on a single machine;  nine PROCs can support distributed processing (but only if the customer licenses an additional product, SAS High-Performance Statistics).  IBM SPSS Modeler Server supports multi-threaded processing but not distributed processing; the same is true for Statistica.

The table below shows currently available distributed platforms for predictive analytics; the table is complete as of this writing (to the best of my knowledge).

Distributed Analytics Software, May 2014

Several observations about the contents of this table:

(1) There is currently no software for distributed analytics that runs on all distributed platforms.

(2) SAS can deploy its proprietary framework on a number of different platforms, but it is co-located and does not run inside MPP databases.  Although SAS claims to support HPA in Hadoop, it seems to have some difficulty executing on this claim, and is unable to describe even generic customer success stories.

(3) Some products, such as Netezza and Oracle, aren’t portable at all.

(4) In theory, MADLib should run in any SQL environment, but Pivotal database appears to be the primary platform.

To summarize key points:

— The ability to parallelize a task is inherent in the definition of the task itself.

— Most “learning” tasks in advanced analytics tasks are not embarrassingly parallel.

— Running a piece of software on a distributed platform is not the same as running it in distributed mode.  Unless the software is expressly written to support distributed processing, it will run locally, and the user will have to figure out how to combine the results from distributed workers.

Vendors who claim that their distributed data platform can perform advanced analytics with open source R or Python packages without extra programming are confusing predictive model “learning” with simpler tasks, such as scoring or SQL queries.

Analytic User Personas

Analytic users are not all the same; in most organizations, there are a number of different user “personalities”, or personas, with distinct needs.  If you develop an analytics architecture for your organization or develop analytic software to sell to others, it is important to understand these personas.  In this essay, I profile four personas:

  • Power Analyst
  • Data Scientist
  • Business Analyst
  • Analytic Consumer

Your organization may or may not include all four personas; for example, if your organization consistently outsources predictive model building, you may have no Power Analysts or Data Scientists.   Moreover, if your organization is large enough, it may be valuable for you to recognize distinct subclasses of users within each persona.   In any event, your success depends on how well you understand the diverse needs of prospective users.

The Power Analyst

The Power Analyst sees advanced analytics as a full-time job, and holds positions such as Statistician or Actuary in organizations with significant investments in analytics, or as consultants in organizations that provide analytic services.  The Power Analyst understands conventional statistics and machine learning, and has considerable working experience in applied analytics.

Power Analysts prefer to work in an analytic programming language such as Legacy SAS or R.  They have enough training and working experience with the language to be productive, and consider analytic programming languages to be more flexible and powerful than analytic software packages with GUI interfaces.  They do not need analytics to be easy, and may look down on those who do.

The rightanalytic method is extremely important to Power Analysts; they tend to be more concerned with using the correctmethodology than with actual differences in business results achieved with different methods.  This means, for example, if a particular analytic problem calls for a specific method or class of methods, such as Survival Analysis, the Power Analyst will go to great lengths to use this method even if the improvement to predictive accuracy is very small.

In practice, since working Power Analysts tend to work with highly diverse problems and cannot always predict the nature of the problems they will need to address, they place a premium on being able to use a wide variety of analytic methods and techniques.  The need for a particular method or technique may be rare, but Power Analysts want to be able to use it if the need arises.

Since data preparation is critical to successful predictive analytics, Power Analytics need to be able to understand and control the data they work with.  This does not mean that Power Analytics want to manage the data or perform ETL tasks; it means that they need the data management processes to be transparent and responsive.  In organizations where IT does not place a premium on supporting predictive analytics, Power Analysts will take over data management and ETL to meet their own needs, but this is not necessarily the working model they prefer.

The work product of Power Analysts may be a management report of some kind showing the results of an analysis, a predictive model specification to be recoded for production, a predictive model object (such as a PMML document) or an actual executable scoring function written in a programming language such as Java or C.  Power Analysts do not want to be heavily involved in production deployment or routing model scoring, though they may be forced into this role if the organization has not invested in tooling for model score deployment. 

Power Analysts are highly engaged in the specific brand, release and version of analytic software.  In organizations where the analytics team has significant influence, they play a decisive role in selecting analytic software.   They also want control over the technical infrastructure supporting the analytic software, though they tend to be indifferent about specific brands of hardware, databases, storage and so forth.

In many organizations, the Power Analyst provides an “attest” function to validate that analytics are correctly performed; hence, they tend to have disproportionate authority in analytic matters based on their reputation and expertise.

The Data Scientist

As the Google Trends graph below illustrates, the term “Data Scientist” is of recent origin, hardly used at all prior to 2011 but rapidly increasing since then.

Google Trends Data Scientist

The Data Scientist is similar in many respects to the Power Analyst.  Both share a lack of interest in easy to usetooling, and a desire to engage at a granular level with the data.

The principal differences between Data Scientists and Power Analysts relate to background, training and approach.   Power Analysts tend to understand statistical methods, bring a statistical orientation to analytics, and tend to prefer working with higher-level languages with built-in analytic syntax; Data Scientists, on the other hand, tend to come from a machine learning, engineering or computer science background.  Consequently, they tend to prefer working with programming languages such as C, Java or Python and tend to be much better equipped to work with SQL and MapReduce. 

It is no accident that the growing usage of the Data Scientist label correlates with expanded deployment and use of Hadoop.  Data Scientists tend to have working experience with Hadoop, and this may be their preferred working environment.  They are comfortable working with MapReduce or Apache Spark, and will develop their own code on these platforms if there is no available “off-the-shelf” software that meets their needs.

Data Scientistsmachine learning roots influence their methods, techniques and approach, which affect their requirements for analytic tooling.  The machine learning discipline tends to focus less on choosing the rightanalytic method, and places the focus on results of the predictive analytics process, including the predictive power of the model produced by the process.  Hence, they are much more open to various forms of brute forcelearning, and choose methods that may be difficult to defend within the statistical paradigm but demonstrate good results.

Data Scientists tend to have low regard for existing analytic software vendors, especially those like SAS and IBM who cater to business customers by soft-peddling technical details; instead, they tend to prefer open source tooling.  They seek the best technicalsolution, one with sufficient flexibility to support innovation.  Data Scientists tend to engage directly in the process of productionizingtheir analytic findings; Power Analysts, in contrast, tend to prefer an entirely hands-offrole in the process.

Since the Data Scientist role has recently emerged, it may lack the sapiential authority enjoyed by the Power Analyst in conservative organizations.  In some organizations, “Data Science” is perceived negatively, and

The Business Analyst

The Business Analyst uses analytics within the context of a role in the organization where analytics is important but not the exclusive responsibility.  Business Analysts hold a range of titles, such as Loan Officer, Marketing Analyst or Merchandising Specialist.

Business Analysts are familiar with analytics and may have some training and experience.  Nevertheless, they prefer an easy-to-use interface and software such as SAS Enterprise Guide, SAS Enterprise Miner, SPSS Statistics or similar products.  

While Power Analysts are very concerned with choosing the rightmethod for the problem, Business Analysts tend to prefer a simpler approach.  For example, they may be familiar with regression analysis, but they are unlikely to be interested in all of the various kinds of regression and the details of how regression models are calculated.  They value wizardtooling that guides the selection of methods and techniques within a problem-solving framework.

The Business Analyst may be aware that data is important to the success of analytics, but does not want to deal with it directly.  Instead, the Business Analyst prefers to work with data that certified correct by others in the organization.  Face validity matters to the Business Analyst; data should be internally consistent and align with the analysts understanding of the business.

In most cases, the work product of a Business Analyst is a report summarizing the results of an analysis.   The work product may also be a decision of some kind, such as the volume of merchandise to a complex loan decision.  Business Analysts rarely produce predictive models for production deployment, because their working methods tend to lack the rigor and exhaustiveness of Power Analysts.

Business Analysts value good customer-friendly Technical Support, and tend to prefer to use software from vendors with demonstrated credibility in analytics.  

The Analytic Consumer

Analytic Consumers are fully focused on business questions and issues and do not engage directly in the productionof analytics; instead they use the results of analytics in the form of automated decisions, forecasts and other forms of intelligence that are embedded into the business processes in which they engage.

Analytic Consumers are not necessarily top managementor any other specific level in the organization; they are simply not professionally engaged in the sausage-makingof forecasts, automated decisions, and so forth.

While the Analytic Consumer may not engage with mathematical computations, they are concerned with the overall utility, performance and reliability of the systems they use.  For example, a customer service rep in a credit card call center may not be concerned with the analytic method used to determine a decision, but will be very concerned if the system takes a long time to reach a decision.  The rep may also object if the system does not provide reasonable explanations when it declines credit request, or appears to decline too many customers that seem to be good risks.

In most organizations, Analytic Consumers are the largest group of prospective users.  Since the range of possible ways that analytics can positively affect business processes is large and growing rapidly, and since embedded analytics have few barriers to use, this group of users also has the greatest growth potential.

In most organizations, there are many more prospective Analytic Consumers and Business Analysts than Power Analysts and Data Scientists; on the surface, this means that a strategy of appealing to Analytic Consumers and Business Analysts offers the greatest potential for business value.  However, few organizations are willing to entrust “hard money” analytic applications (such as fraud, credit risk or trading) to analytic novices; since the best and brightest analysts tend to be Power Analysts or Data Scientists, they tend to carry the most weight in decision-making about analytics.

Dell Buys Statsoft

Dell announced this morning that it has acquired Statsoft, a privately held company that distributes Statistica, a suite of software for statistics and data mining.   Terms of sale were not announced.

Founded by academics in 1984, Statsoft has developed a loyal following at the low end of the analytics market, where it offers a reasonably priced alternative to SAS and SPSS.  The Statistica software suite includes a number of modules that support statistics, multivariate analysis, data mining, ETL, real-time scoring, quality control, process control and vertical solutions.  Relative to other statistical software packages on the market, Statistica’s support for analytic features is comprehensive.

Statistica 12.0: Plot Window

Statistica appeals to a core group of loyal and satisfied users.  In the most recent Rexer data mining survey, Statistica ranked eleventh overall in reported use, but ranked second in reported primary use; the product scored at the top of the list in user satisfaction.  According to Rexer’s segmentation, Statistica has the highest penetration among users who are new to data mining, rarely work with Big Data, place a high value on ease of use, and do not want to write their own code.

StatSoft supports desktop and server editions of Statistica on Windows only; that should fit well with Dell’s hardware business.  What does not make sense is Dell’s claim that this acquisition “bolsters its portfolio of Big Data Solutions”; Statistica lacks support for distributed computing, and does not run in databases or Hadoop.

Machine Learning in Hadoop: Part One

Much has changed since I last blogged on this subject a year ago (here and here).  This is the first of a three-part blog covering the current state of play for machine learning in Hadoop.  I use the term “machine learning” deliberately, to refer to tools that can learn from data in an automated or semi-automated manner; this includes traditional statistical modeling plus supervised and unsupervised machine learning.  For convenience, I will not cover fast query tools, BI applications, graph engines or streaming analytics; all of those are important, and deserve separate treatment.

Every analytics vendor claims the ability to work with Hadoop.  In Part One, we cover five things to consider when evaluating how well a particular machine learning tool integrates with Hadoop: deployment topology, hardware requirements, workload integration, data integration, and the user interface.  Of course, these are not the only things an organization should consider when evaluating software; other features, such as support for specific analytic methods, required authentication protocols and other needs specific to the organization may be decisive.

Deployment Topology

Where does the machine learning software reside relative to the Hadoop TaskTracker and Data Nodes (“worker nodes”)?  Is it (a) distributed among the Hadoop worker nodes; (b) deployed on special purpose “analytic” nodes or (c) deployed outside the Hadoop cluster?

Distribution among the worker nodes offers the best performance; under any other topology, data movement will impair performance.  If end users tend to work with relatively small snippets of data sampled from the data store, “beside” architectures may be acceptable, but fully distributed deployment is essential for very large datasets.

Deployment on special purpose “analytic” nodes is a compromise architecture, usually motivated either by a desire to reduce software licensing fees or avoid hardware upgrades for worker node servers.  There is nothing wrong with saving money, but clients should not be surprised if performance suffers under anything other than a fully distributed architecture.

Hardware Requirements

If the machine learning software supports distributed deployment on the Hadoop worker nodes, can it run effectively on standard Hadoop node servers?  The definition of a “standard” node server is a moving target; Cloudera, for example, recognizes that the appropriate hardware spec depends on planned workload.  Machine learning, as a rule, benefits from a high memory spec, but some machine learning software tools are more efficient than others in the way they use memory.

Clients are sometimes reluctant to implement a fully distributed machine learning architecture in Hadoop because they do not want to replace or upgrade a large number of node servers.  This reluctance is natural, but the problem is attributable in part to a gap in planning and rapidly changing technology.  Trading off performance for cost reduction may be the right thing to do, but it should be a deliberate decision.

Workload Integration

If the machine learning software can be distributed among the worker nodes, how well does it co-exist with other MapReduce and non-MapReduce applications?  The gold standard is the ability to run under Apache YARN, which supports resource management across MapReduce and non-MapReduce applications.   Machine learning software that pushes commands down to MapReduce is also acceptable, since the generated MapReduce jobs run under existing Hadoop workload management.

Software that effectively takes over the Hadoop cluster and prevents other jobs from running is only acceptable if the cluster will be dedicated to the machine learning application.   This is not completely unreasonable if the Hadoop cluster replaces a conventional standalone analytic server and file system; the TCO for a Hadoop cluster is very favorable relative to a dedicated high-end analytic server.  Obviously, clients should know how they plan to use the cluster when considering this.

Data Integration

Ideally, machine learning software should be able to work with every data format supported in Hadoop; most machine learning tools are more limited in what they can read and write. The ability to work with uncompressed text in HDFS is table stakes; more sophisticated tools can work with sequence files as well, and support popular compression formats such as Snappy and Bzip/Gzip.  There is also growing interest in use of Apache Avro.   Users may also want to work with data in HBase, Hive or Impala.

There is wide variation in the data formats supported by machine learning software; clients are well advised to tailor assessments to the actual formats they plan to use.

User Interface

There are many aspects of the user interface that matter to clients when evaluating software, but here we consider just one aspect:  Does the machine learning software require the user to specify native MapReduce commands, or does it effectively translate user requests to run in Hadoop behind the scenes?

If the user must specify MapReduce, Hive or Pig it begs the question: why not just perform that task directly in MapReduce, Hive or Pig?

In Part Two, we will examine current open source alternatives for machine learning in Hadoop. 

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.


(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

Analytic Startups: Skytree

Skytree started out as an academic machine learning project developed at Georgia Tech’s Fastlab.  Leadership shopped the software to a number of software vendors prior to 2011 and, finding no buyers, launched as a standalone venture in 2012.

In April 2013, Skytree announced Series A funding of $18 million, with backing from U.S. Venture Partners, UPS, Javelin Venture Partners and Osage University Partners.   The company has 18 U.S. employees in LinkedIn.

Skytree’s public reference customers include Adconian, Brookfield Residential Property Services, CANFAR, eHarmony, SETI Institute and United States Golf Association.  This customer list did not change in 2013 despite significant investment in marketing and sales.

Skytree has formally partnered with Cloudera, Hortonworks and MapR.

Compared to its peers, Skytree reveals very little about its technology, which is generally a yellow flag.

urlSkytree’s principal product is Skytree Server, a server-based library of distributed algorithms.   Skytree claims to support the following techniques:

  • Support Vector Machines (SVM)
  • Nearest Neighbor
  • K-Means
  • Principal Component Analysis (PCA)
  • Linear Regression
  • Two-Point Correlation
  • Kernal Density Estimation (KDE)
  • Gradient Boosted Trees
  • Random Forests

Skytree does not show images or videos of its user interface anywhere on its website.  The implication is that it lacks a visual interface, and programming is required.  Skytree claims a web services interface as well as interfaces to R, Weka, C++ and Python.

For data sources, Skytree claims the ability to connect to relational databases (presumably through ODBC); Hadoop (presumably HDFS); and to consume data from flat files and “common statistical packages”.

Skytree claims the ability to deploy on commodity Linux servers in local, cluster, cloud or Hadoop configurations.  (Absent YARN support, though, the latter will be a “beside” architecture, with data movement).

A second product, Skytree Advisor, launched in Beta in September.  Skytree Advisor is mostly interesting for what it reveals about Skytree Server.  The product includes some unique capabilities, including the ability to produce an actual report, but the user interface evokes a blue screen of death.   The status of this offering seems to be in doubt, as Skytree no longer promotes it.

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.


Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.


Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.


In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.


In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.

2013 Rexer Data Miner Survey

Rexer Analytics published its 2013 Data Miner Survey just before the Holidays, and it’s an excellent read.

As always when working with survey research, one should use some caution in interpreting the results; it’s very difficult to build a representative sample of analysts and data miners.  While it is easy to find fault with Rexer’s sample — which vendors who are unhappy with some of the findings will likely try to do — there is no better survey of working analysts available today.

Key findings:

  • Customer Analytics is the most frequently cited application for analytics:
    • Understanding customers
    • Improving customer experience
    • Customer acquisition, upsell and cross-sell
  • Respondents recognize growing data volumes, but the size of their analytic data sets is stable
    • In other words, one should not confuse managing Big Data with analyzing Big Data
  • R is the most widely used analytic software
    • 70% of respondents say they use R
    • 24% say R is their primary tool, more than any other software
  • Text mining is mainstream; 70% of respondents say they mine text now or plan to start
  • Time to deployment remains an issue; respondents report deployment cycles ranging from weeks to a year or more

One of the most interesting pieces of analysis in the survey is a clustering based on the importance ratings of tool selection criteria.  Rexer’s analysis reveals two principal dimensions in the data, one labeled as “Cost” and the other labeled as “Ease of Use and Interface Quality”.  The largest cluster, which includes respondents who rated everything important, should be discounted as an artifact of questionnaire design; it reflects a phenomenon known as the “wrist effect”, where respondents simply check all of the boxes on one end of the scale.   Of the remaining respondents:

  • Respondents who value the ability to write one’s own code generally do not value ease of use, and vice versa.  These respondents are most likely to cite SAS or R as their primary software
    • Among these users, those who cite the importance of cost are much more likely to cite R as their primary tool
    • Those who place a lower value on cost tend to value the quality of the user interface
  • Respondents who value ease of use and the quality of the user interface are more likely to be new to analytics
    • These respondents are most likely to cite Statistica, Rapid Miner and IBM SPSS Modeler as their primary tool

For more information about the survey and to get a copy, go here.