Analytic Startups: 0xdata (Updated May 2014)

Updated May 22, 2014

0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O).  Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then.

0xdata operates on a services business model, and does not offer commercially licensed software.  The firm has four public reference customers and claims more than 2,000 users.  0xdata has formal partnerships with Cloudera, Hortonworks, Intel and MapR.

0xdata’s H20 project is a library of distributed algorithms designed for deployment in Hadoop or free-standing clusters.  0xdata licenses H2O under the Apache 2.0 open source license.  The development team is very active; in the thirty days ended May 22, 19 contributors pushed 783 commits to the project on Git.

The roadmap is aggressive; as of May 2014 the library includes:

For Generalized Linear Models, k-Means and Gradient Boosting, H2O supports a Grid Search feature enabling users to specify multiple models for simultaneous development and comparison.   This feature is a significant timesaver when the optimal model parameters are unknown (which is ordinarily the case).

Users interact directly with the software through a web browser or REST API.  Alternatively, R users can use the H2O.R package to invoke algorithms from RStudio or an alternative R development environment.  (Video demo here).  Scala users can work with H2O through the Scalala library.

For Hadoop deployment, H2O supports CDH4.x, MapR 2.x and AWS EC2.   H2O integrates with HDFS, and is co-located within Hadoop.   At present, H2O supports CSV, Gzip-compressed CSV, MS Excel (XLS), ARRF, HIVE file format, “and others”.

Each H2O algorithm supports scoring and prediction capability.   There is currently no facility for PMML export; this is unnecessary if H2O is deployed in Hadoop (since one can simply use the native prediction capability).

In March, the Apache Mahout project announced that it will support H2O.

SAP Buys KXEN

SAP announced today that it plans to acquire KXEN in a deal that will close in the fourth quarter.  No purchase price was announced.  Since one recently laid-off employee characterized the company’s prospects as “circling the toilet”, this seems like a case of bottom-feeding by SAP.

KXEN has struggled to position and sell its InfiniteInsight analytic software.  The vendor’s black-boxy approach has little appeal for hard-core analysts, who prefer tooling that offers greater control over the analytics process.  At the other end of the value chain, business executives are not interested in analytics, but in business solutions.

Hence, KXEN is neither fish nor fowl as a standalone company, but its technology is worth something to an enterprise vendor such as SAP, who say they will embed KXEN in applications for managing operations, customer relationships, supply chains, risk and fraud

KXEN has never been terribly forthcoming about details of its technology.  The software is server-based, with database integration primarily through ODBC and PMML.   KXEN has an established partnership with SAP Sybase, but for model scoring only in a “run-beside” architecture.   SAP says it will integrate KXEN with HANA, but I suspect that will also be in a run-beside architecture, since KXEN adds little to SAPs’ in-database Predictive Analytics Library.

Update:   Several analysts have commented on SAP’s move, including Curt Monash.  Monash correctly distinguishes between analytic programming languages (such as SAS or R) and analytic applications such as KXEN’s InfiniteInsight.  (There is a third category, which I call the analytic workbench, that is designed for users who have some understanding of analytics but would rather not program.  SPSS Modeler is an example,)

Monash also rightly throws cold water on SAP’s ability to embed KXEN in business solutions, pointing out InfiniteInsight’s lack of tooling needed for risk applications.  I’d go farther to say that KXEN has no credibility outside of Marketing Campaign Management, where SAP CRM is sadly stuck behind IBM/Unica, SAS, Neolane, Teradata Aprimo, Oracle and Pitney Bowes.

Notes from Strata 2013

Last week I attended the O’Reilly Strata 2013 Conference.    Here are some notes on presentations pertinent to analytics, in four categories:

  • Vendors
  • Users
  • Technical
  • Thought Provokers

Vendor Presentations

We wouldn’t have trade shows without sponsors, and the big ones get ten minutes of fame.  Some used their time well, others not so much.  I’ll refrain from shaming the bloviaters, but will single out three for applause:

  • John Schroeder of MapR did a nice preso on the business case for Hadoop, with a refreshing focus on measurable revenue impact and cost reduction;
  • Girish Juneja from Intel delivered a thoughtful summary of Intel’s participation in open source projects.  Not a lot of sizzle, but refreshingly free of hype;
  • Charles Zedlewski of Cloudera provided a terrific explanation of the history and direction of Hadoop and made a compelling case for the platform.

Someone should tell O’Reilly that Skytree is a vendor.  Skytree managed to get a 45-minute slot in a non-vendor track, and Alexander Gray of Skytree used the time to say stuff that data miners learned years ago.

User Presentations

Several presenters spoke about how their organization uses analytics.   In any conference, presentations like this are often the most compelling.

  • Rajat Taneja from Electronic Arts spoke about the depth of information captured by gaming companies, and how they use this information to improve the gaming experience.  Good presentation, with great visuals
  • Eric Colson of Stitch Fix (and formerly with Netflix) spoke about recommendation engines.  Stitch Fix sends bundles of new clothing to buyers on spec, and they have finely tuned the bundling process using a mix of machine learning and human decisions.  Eric spoke about the respective strengths of machine and human decisioning, and how to use them together effectively.
  • Michael Bailey of Facebook gets credit for truth in packaging for “Introduction to Forecasting”.  His presentation covered very basic content, the sort of thing covered in Stat 101, and he did a fine job presenting that.  Michael hinted at Facebook’s complex forecasting problem — they have to simultaneously forecast eyeballs and ad placements — and it would be great to hear more about that in a future presentation.

Technical Presentations

It’s tough to deliver detailed content in a short session; most of the presenters I saw struck the right balance.

  • Sharmila Shahani-Mulligan and others from ClearStory Data presented to an overflow audience interested in learning more about Spark and Shark.  Spark is an open source in-memory distributed computational engine that runs on top of Hadoop. It is designed to support iterative algorithms, and supports Java, Scala and Python.  Shark is part of Hive, integrates with Spark, and offers a SQL interface
  • Dr. Vijay Srinivas Agneeswaran of Impetus Technologies delivered what I thought was the best presentation in the show.   He summarized the limits of legacy analytics, discussed analytics in Hadoop (such as Mahout),  and spoke about a third wave of distributed analytics based on technologies like Spark, HaLoop, Twister, Apache Hama and GraphLab.
  • Jayant Shekhar of Cloudera delivered a very detailed presentation on how to build a recommendation engine.

Thought Provokers

Several presenters spoke on broad conceptual topics, with mixed results.

  • James Hendler of RPI spoke on the subject of “broad data”.  His presentation seemed thoughtful, but to be honest he lost me.
  • Nathan Marz of Twitter has co-authored a book on Big Data coming out soon.   After listening to his short preso on data modeling (“Human Fault Tolerance”) , I added the book to my wish list.
  • Kate Crawford of Microsoft presented on the subject of hidden biases in big data.  Her presentation covered material well known to seasoned analysts (“hey, did you know that your data may be biased?), Kate’s presentation was excellent, and full of good examples.

Overall, an excellent show.

Analytic Applications, Part Four: Enabling Customers

This post is the last in a four-part series covering analytic applications organized according to how enterprises consume analytics.

Part One (here) covered Strategic Analytics, or analytics that address C-suite questions and issues.

Part Two (here) covered Managerial Analytics, which serve to measure the performance of products and programs at a departmental level, and to optimize the allocation of resources across programs.

Part Three (here) covered Operational Analytics, analytics that improve the efficiency or effectiveness of business processes.

All of these applications have one thing in common: they exist to serve internal needs of the enterprise, which retains the value produced by analytics.  This is not a bad thing; credit card customers benefit indirectly when an issuer  uses analytics to avoid giving credit to customers who subsequently default, but the firm itself is the direct and primary beneficiary of the credit risk analysis.

Customer-enabling analytics turn this logic on its head: the analytics are designed to provide a benefit to customers, while the enterprise benefits indirectly through product differentiation, goodwill or some combination of the two.

There are four distinct categories of Customer-Enabling Analytics:

  • Analytic Services
  • Prediction Services
  • Analytic Applications
  • Product-Embedded Analytics

On the surface, Analytic Services provided by consulting firms, marketing service providers and so forth are simply a sourcing alternative for the previously defined Strategic, Managerial or Operational Analytics, but not fundamentally different.   In practice, however, analytics delivered by service providers tend to be very different than analytics developed “in-house”.  With few barriers to entry, the market for Analytic Services is highly competitive; as a result, successful providers tend to be highly innovative and specialized, offering services that cannot be easily reproduced.  Moreover, the relationship between service provider and enterprise consumer (and the visible costs associated with a project) tend to ensure that project goals are well-defined, a step that is often omitted from internally-delivered analytics (to the detriment of all engaged).

For Analytic Services, the “product” sold and delivered is an analysis project, which is typically priced based on the effort required to complete the project and the time-value of resources consumed.  For Prediction Services, the product sold and delivered to the customer is a prediction, not a project, and is typically priced on a per-use basis.  Credit scores are the best-known example of Prediction Services, but there are many other examples of prediction services for sales, marketing, human resources, insurance underwriting.  As with Analytic Services, the end uses to which Prediction Services appear to be the same as in-house delivered Strategic, Managerial and Operational Analytics, but in practice externally developed Prediction Services work in a very different way.  Since the development and deployment costs for a predictive model are amortized over a large volume of transactions,  Prediction Services enable a broad market of smaller enterprises to benefit from predictive analytics that would not be able to do so otherwise.  Prediction Service providers are also able to achieve economies of scale, and often have access to data sources that would not necessarily be available to the enterprise.

Analytic Applications are a natural extension of Analytic Services and Prediction Services.  Analytic Applications are business applications that consume data-driven predictions and support all or part of a business process.  Examples include:

  • Mortgage application decision systems (which consume predictions about the applicant’s propensity to repay the loan)
  • Insurance underwriting systems (consume predictions about expected losses from an insurance policy)
  • Fraud case management systems (consume predictions about the likelihood that a particular claim or group of claims is fraudulent)

These applications are often sold and delivered by providers under a “razor-and-blade” strategy, where the application itself is delivered under a fixed price and combined with a long-term contract to provide Analytic Services or Prediction Services.

Each of the first three categories of Customer-Enabling Analytics is similar to and competes with “in-house” delivered Strategic, Managerial and Operational Analytics.  The fourth category, Product-Embedded Analytics, is potentially the most disruptive and offers enterprises the greatest potential return.  Product-Embedded Analytics differentiate the firm’s products in meaningful ways by solving a consumer problem.

If this sounds esoteric, it is because the best examples are often not thought of in the same way we think about other kinds of analytics:

  • Consumers have a problem finding information.  Google’s search engine solves this problem.
  • Consumers have a problem finding a movie they want to watch.  Netflix’ recommendation engine solves this problem.

These examples — and many others, including Facebook’s newsfeed engine, Match.com’s matching algorithm — use machine learning technology in ways that directly benefit customers.   But the firms that offer these services benefit indirectly, by building site traffic, selling more product or satisfying customers in a manner that cannot be readily reproduced by competitors.

Analytic Applications (Part Three): Operational Analytics

This is the third post in a series on analytic applications organized by how analytic work product is used within the enterprise.

  • The first post, linked here, covers Strategic Analytics (defined as analytics for the C-Suite)
  • The second post, linked here, covers Managerial Analytics (defined as analytics to measure and optimize the performance of value-delivering units such as programs, products, stores or factories).

This post covers Operational Analytics, defined as analytics that improve the efficiency or effectiveness of a business process.  The distinction between Managerial and Operational analytics can be subtle, and generally boils down to the level of aggregation and frequency of the analysis.  For example, the CMO is interested in understanding the performance and ROI of all Marketing programs, but is unlikely to be interested in the operational details of any one program.  The manager of that program, however, may be intensely interested in its operational details, but have no interest in the performance of other programs.

Differences in level of aggregation and frequency lead to qualitative differences in the types of analytics that are pertinent.  A CMO’s interest in Marketing programs is typically at a level of “keep or kill”;  continue funding the program if its effective, kill it if it is not.  This kind of problem is well-suited to dashboard-style BI combined with solid revenue attribution, activity based costing and ROI metrics.  The Program Manager, on the other hand, is intensely interested in a range of metrics that shed insight not simply on how well the program is performing, but why it is performing as it is and how to improve it.  Moreover, the Program Manager in this example will be deeply involved in operational decisions such as selecting the target audience, determining which offers to assign, handling response exceptions and managing delivery to schedule and budget.  This is the realm of Operational Analytics.

While any BI package can handle different levels of aggregation and cadence, the problem is made more challenging due to the very diverse nature of operational detail across business processes.   A social media marketing program relies on data sources and operational systems that are entirely different from web media or email marketing programs; preapproved and non-pre-approved credit card acquisition programs do not use the same systems to assign credit lines; some or all of these processes may be outsourced.  Few enterprises have successfully rationalized all of their operational data into a single enterprise data store (nor is it likely they will ever do so).  As a result, it is very rare that a common BI system comprehensively supports both Managerial and Operational analytic needs.  More typically, one system supports Managerial Analytics (for one or more disciplines), while diverse systems and ad hoc analysis support Operational Analytics.

At this level, questions tend to be domain-specific and analysts highly specialized in that domain.  Hence, an analyst who is an expert in search engine optimization will typically not be seen as qualified to perform credit risk analysis.  This has little to do with the analytic methods used, which tend to be similar across business disciplines, and more to do with the language and lingo used in the discipline as well as domain-specific technology and regulatory issues.  A biostatistician must understand common health care data formats and HIPAA regulations; a consumer credit risk analysis must understand FICO scores, FISERV formats and FCRA.  In both cases, the analyst must have or develop a deep understanding of the organization’s business processes, because this is essential to recognizing opportunities for improvement and prioritizing analytic projects.

While there is a plethora of different ways that analytics improve business processes, most applications fall in to one of three categories:

(1) Applied decision systems supporting business processes such as customer-requested line increases or credit card transaction authorizations.  These applications improve the business process by applying consistent data-driven rules designed to balance risks and rewards.  Analytics embedded in such systems help the organization optimize the tradeoff between “loose” and “tight” criteria, and ensure that decision criteria reflect actual experience.  An analytics-driven decisioning system performs in a faster and more consistent way than systems based on human decisions, and can take more information into account than a human decision-maker.

(2) Targeting and routing systems (such as a text-processing system that reads incoming email and routes it to a customer service specialist).  While applied decision systems in the first category tend to recommend categorical yes/no, approve/decline decisions in a stream of transactions, a targeting system selects from a larger pool of candidates, and may make qualitative decisions among a large number of alternate routings.   The business benefit from this kind of system is improved productivity and reduced processing time as, for example, the organization no longer requires a team to read every email and route it to the appropriate specialist.  Applied analytics make these systems possible.

(3) Operational forecasting (such as a system that uses projected store traffic to determine staffing levels).   These systems enable to organization to operate more efficiently through better alignment of operations to customer demand.  Again, applied analytics make such systems possible; while it is theoretically possible to build such a system without an analytic forecasting component, it is inconceivable that any management would risk the serious customer service issues that would be created without one.  Unlike the first two applications, forecasting systems often work with aggregate data rather than atomic data.

For analytic reporting, the ability to flexibly ingest data from operational data sources (internal and external) is critical, as is the ability to publish reports into a broad-based reporting and BI presentation system.

Deployability is the key requirement for predictive analytics; the analyst must be able to publish a predictive model as a PMML (Predictive Model Markup Language) document or as executable code in a choice of programming languages.

In the next post, I will cover the most powerful and disruptive form of analytics, what I call Customer-Enabling Analytics: analytics that differentiate your products and services and deliver value to the customer.

Latest Forrester Analytics “Wave”

Forrester’s latest assessment of predictive analytics vendors is available here;  news reports summarizing the report are herehere and here.

A few comments:

(1) While the “Wave” analysis purports to be an assessment of “Predictive Analytics Solutions for Big Data”, it is actually an assessment of vendor scale.  You can save yourself $2,495 by rank-ordering vendors by revenue and get similar results.

(2) The assessment narrowly focuses on in-memory tools, which arbitrarily excludes some of the most powerful analytic tools in the market today.    Forrester claims that in-database analytics “tend to be oriented toward technical users and require programming or SQL”.  This is simply not true.  Oracle Data Mining and Teradata Warehouse Miner have excellent user interfaces, and IBM SPSS Modeler provides an excellent front-end to in-database analytics across a number of databases (including IBM Netezza, DB2, Oracle and Teradata).  Alpine Miner is a relatively new entrant that also has an excellent UI.

(3) Forrester exaggerates SAS’ experience and market presence in Big Data.  Most SAS customers work primarily with foundation products that do not scale effectively to Big Data; this is precisely why SAS developed the high performance products demonstrated in the analyst beauty show.   SAS has exactly one public reference customer for its new in-memory high-performance analytics software.

(4) SAS Enterprise Miner may be “easy to learn”, but it is a stretch to say that it has the capability to “run analytics in-database or distributed clusters to handle Big Data”.

(5) Designation of SAP as a “leader” in predictive analytics is also a stretch.  SAP’s Predictive Analytics Library is a new product with some interesting capabilities; however, it is largely unproven and SAP lacks a track record in this area.

(6) The omission of Oracle Data Mining from this report makes no sense at all.

(7) Forrester’s scorecard gives every vendor the same score for pricing and licensing.  That’s like arguing that a Bentley and a Chevrolet are equally suitable as family cars, but the Bentley is preferable because it has leather seats.    TCO matters.  As I reported here, a firm that tested SAS High Performance Analytics and reported good results did not actually buy the product because, as a company executive notes, “this stuff is expensive.”

Advanced Analytics in Hadoop, Part Two

In a previous post, I summarized the current state of Mahout, the Apache project for advanced analytics in Hadoop.    But what if the analytic methods you need are not implemented in the current Mahout release?  The short answer is that you are either going to program the algorithm yourself in MapReduce or adapt an open source algorithm from an alternative library.

Writing the program yourself is less daunting than it sounds; this white paper from Cloudera cites a number of working applications for predictive analytics, none of which use Mahout.  Adapting algorithms from other libraries is also an excellent option; this article describes how a team used a decision tree algorithm from Weka to build a weather forecasting application.

Most of the enterprise Hadoop distributors (such as Cloudera, Hortonworks and MapR) support Mahout but without significant enhancement.   The exception is IBM. whose Infosphere BigInsights Hadoop distribution incorporates a suite of text mining features nicely demonstrated in this series of videos.  IBM Research has also developed System ML, a suite of machine learning algorithms written in MapReduce, although as of this writing System ML is a research project and not generally available software.

To simplify program development in MapReduce for analysts, Revolution Analytics launched its Rhadoop open source project earlier this year.  Rhadoop’s  rmr package provides R users with a high-level interface to MapReduce that greatly simplifies implementation of advanced analytics.   This example shows how an rmr user can implement k-means clustering with 28 lines of code; a comparable procedure, run in Hortonworks with a combination of Python, Pig and Java requires 100 lines of code.

For analytic use cases where the primary concern is to implement scoring in Hadoop. Zementis offers the Universal PMML Plug-In(TM) for Datameer.  This product enables users to deploy PMML documents from external analytic tools as scoring procedures within Datameer.   According to Michael Zeller, CEO of Zementis, the Plug-In can actually be deployed into any Hadoop distribution.  There is an excellent video about this product from the Hadoop Summit at this link.

Datameer itself is a spreadsheet-like BI application that integrates with Hadoop data sources.  It has no built-in capabilities for advanced analytics, but supports a third-party app market for Customer Analytics, Social Analytics and so forth.  Datameer’s claim that its product is suitable for genomic analysis is credible if you believe that a spreadsheet is sufficient for genomic analysis.

Finally, a word on what SAS is doing with Hadoop.  Prior to January, 2012, the search terms “Hadoop” and “MapReduce” produced no hits on the SAS website.   In March of this year, SAS released SAS/ACCESS Interface to Hadoop, a product that enables SAS programmers to embed Hive and MapReduce expressions in a SAS program.  While SAS/ACCESS engines theoretically enable SAS users to push workload into the datastore, most users simply leverage the interface to extract data and move it into SAS.  There is little reason to think that SAS users will behave differently with Hadoop; SAS’ revenue model and proprietary architecture incents it to preach moving the data to the analytics and not the other way around.

Advanced Analytics in Hadoop, Part One

This is the first of a two-part post on the current state of advanced analytics in Hadoop.  In this post, I’ll cover some definitions, the business logic of advanced analytics in Hadoop, and summarize the current state of Mahout.  In a second post, I’ll cover some alternatives to Mahout, currently available and in the pipeline.

For starters, a few definitions.

I use the term advanced analytics to cover machine learning tools (including statistical methods) for the discovery and deployment of useful patterns in data.   Discovery means the articulation of patterns as rules or mathematical expressions;  deployment means the mobilization of discovered patterns to improve a business process.  Advanced analytics may include supervised learning or unsupervised learning, but not queries, reports or other analysis where the user specifies the pattern of interest in advance.  Examples of advanced analytic methods include decision trees, neural networks, clustering, association rules and similar methods.

By “In Hadoop” I mean the complete analytic cycle (from discovery to deployment) runs in the Hadoop environment with no data movement outside of Hadoop.

Analysts can and do code advanced analytics directly in MapReduce.  For some insight into the challenges this approach poses, review the slides from a recent presentation at Strata by Allstate and Revolution Analytics.

The business logic for advanced analytics in Hadoop is similar to the logic for in-database analytics.   External memory-based analytic software packages (such as SAS or SPSS) provide easy-to-use interfaces and rich functionality but they require the user to physically extract data from the datastore.  This physical data movement takes time and effort, and may force the analyst to work with a sample of the data or otherwise modify the analytic approach.  Moreover, once the analysis is complete, deployment back into the datastore may require a complete extract and reload of the data, custom programming or both.  The end result is an extended analytic discovery-to-deployment cycle.

Eliminating data movement radically reduces analytic cycle time.  This is true even when actual run time for model development in an external memory-based software package is faster, because the time needed for data movement and model deployment tends to be much greater than the time needed to develop and test models in the first place.  This means that advanced analytics running in Hadoop do not need to be faster than external memory-based analytics; in fact, they can run slower than external analytic software and still reduce cycle time since the front-end and back-end integration tasks are eliminated.

Ideal use cases for advanced analytics in Hadoop have the following profile:

  • Source data is already in Hadoop
  • Applications that consume the analytics are also in Hadoop
  • Business need to use all of available data (e.g. sampling is not acceptable)
  • Business need for minimal analytic cycle time; this is not the same as a need for minimal score latency, which can be accomplished without updating the model itself

The best use cases for advanced analytics running in Hadoop are dynamic applications where the solution itself must be refreshed constantly.  These include microclustering, where there is a business need to update the clustering scheme whenever a new entity is added to the datastore; and recommendation engines, where each new purchase by a customer can produce new recommendations.

Apache Mahout is an open source project to develop scalable machine learning libraries whose core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm.   Mahout currently supports classification, clustering, association, dimension reduction, recommendation and lexical analysis use cases.   Consistent with the ideal use cases described above, the recommendation engines and clustering capabilities are the most widely used in commercial applications.

As of Release 0.7 (June 16, 2012), the following algorithms are implemented:

Classification: Logistic Regression, Bayesian, Random Forests, Online Passive Aggressive and Hidden Markov Models

Clustering: Canopy, K-Means, Fuzzy K-Means, Expectation Maximization, Mean Shift, Hierarchical, Dirchlet Process, Latent Dirichlet, Spectral, Minhash, and Top Down

Association: Parallel FP-Growth

Dimension Reduction: Singular Value Decomposition and Stochastic Singular Value Decomposition

Recommenders: Distributed Item-Based Collaborative Filtering and Collaborative Filtering with Parallel Matrix Factorization

Lexical Analysis: Collocations

For a clever introduction to machine learning and Mahout, watch this video.

For more detail, review this presentation on Slideshare.

There are no recently released books on Mahout.  This book is two releases out of date, but provides a good introduction to the project.

Mahout is currently used for commercial applications by Amazon, Buzzlogic, Foursquare, Twitter and Yahoo, among others.   Check the Powered by Mahout page for an extended list.

Next post: Alternatives to Mahout, some partial solutions and enablers, and projects in the pipeline.

Book Review: Antifragile

There is a (possibly apocryphal) story about space scientist James Van Allen.  A reporter asked why the public should care about Van Allen belts, which are layers of particles held in place by Earth’s magnetic field.    Dr. Van Allen puffed on his pipe a few times, then responded:  “Van Allen belts?  I like them.  I make a living from them.”

One can imagine a similar conversation with Nassim Nicholas Taleb, author of The Black Swan and most recently Antifragile: Things That Gain From Disorder.

Reporter: Why should the public care about Black Swans?

Taleb: Black Swans?  I like them.  I make a living from them.

And indeed he does.   Born in Lebanon, educated at the University of Paris and the Wharton School, Mr. Taleb pursued a career in trading and arbitrage (UBS, CS First Boston, Banque Indosuez, CIBC Wood Gundy, Bankers Trust and BNP Paribas) where he developed the practice of tail risk hedging, a technique designed to insure a portfolio against rare but catastrophic events.  Later, he established his own hedge fund (Empirica Capital), then retired from active trading to pursue a writing and academic career.  Mr. Taleb now positions at NYU and Oxford, together with an assortment of adjuncts.

Antifragile is Mr. Taleb’s third book in a series on randomness.  The first, Fooled by Randomness, published in 2001, made Fortune‘s  2005  list of “the 75 smartest books we know.”   The Black Swan, published in 2007, elaborated Mr. Taleb’s theory of Black Swan Events (rare and unforeseen events of enormous consequences) and how to cope with them; the book has sold three million copies to date in thirty-seven languages.   Mr. Taleb was elevated to near rock-star status on the speaker circuit in part due to his claim to have predicted the recent financial crisis, a claim that would  be more credible had he published his book five years earlier.

I recommend this book; it is erudite, readable and full of interesting tidbits, such as an explanation of Homer’s frequent use of the phrase “the wine-dark sea”.   (Mr. Taleb attributes this to the absence of the word ‘blue’ in Ancient Greek.  I’m unable to verify this, but it sounds plausible.)  Erudition aside, Antifragile is an excellent sequel to The Black Swan because it enables Mr. Taleb to elaborate on how we should build institutions and businesses that benefit from unpredictable events.  Mr. Taleb contrasts the “too big to fail” model of New York banking with the “fail fast” mentality of Silicon Valley, which he cites as an example of antifragile business.

Some criticism is in order.  Mr. Taleb’s work sometimes seems to strive for a philosophical universalism that explains everything but provides few of the practical heuristics which he says are the foundation of an antifragile order.  In other words, if you really believe what Mr. Taleb says, don’t read this book.

Moreover, it’s not exactly news that there are limits to scientific rationalism; the problem, which thinkers have grappled with for centuries, is that it is difficult to build systematic knowledge outside of  a rationalist perspective.   One cannot build theology on the belief that the world is a dark and murky place where the gods can simply zap you at any time for no reason.  Mr. Taleb cites Nietzsche as an antifragile philosopher, and while Nietzsche may be widely read among adolescent lads and lassies, his work is pretty much a cul-de-sac.

One might wonder what the study of unpredictable events has to do with predictive analytics, where many of us make a living.  In Reckless Endangerment, Gretchen Morgenstern documents how risk managers actually did a pretty good job identifying financial risks, but that bank leadership chose to ignore, obfuscate or shift risks to others.  Mr. Taleb’s work offers a more compelling explanation for this institutional failure than the customary “greedy robber baron” theory.  Moreover, everyone in the predictive analytics business (and every manager who relies on predictive analytics) should remember that predictive models have boundary conditions, which we ignore at our peril.

RevoScaleR Beats SAS, Hadoop for Regression on Large Dataset

Still catching up on news from Strata conference.

This post from Revolution Analytics’ blog summarizes an excellent paper jointly presented at Strata by Allstate and Revolution Analytics.

The paper documents how a team at Allstate struggled to run predictive models with SAS on a data set of 150 million records.  The team then attempted to run the same analysis using three alternatives to SAS: a custom MapReduce program running in Hadoop cluster, open source R and RevoScale R running on an LSF cluster.

Results:

— SAS PROC GENMOD on a Sun 16-core server (current state): five hours to run;

— Custom MapReduce on a 10 node/80-core Hadoop cluster: more than ten hours to run, and much more difficult to implement;

— Open source R: impossible, open source R cannot load the data set;

— RevoScale R running  on 5-node/20-core LSF cluster: a little over five minutes to run.

In this round of testing, Allstate did not consider in-database analytics, such as dbLytix running in IBM Netezza; it would  be interesting to see results from such a test.

Some critics have pointed out that the environments aren’t equal.  It’s a fair point to raise, but expanding the SAS server to 20 cores (matching the RevoScaleR cluster) wouldn’t materially reduce SAS runtime, since PROC GENMOD is single-threaded.    SAS does have some multi-threaded PROCs and tools like HPA that can run models in parallel, so it’s possible that a slightly different use case would produce more favorable results for SAS.

It’s theoretically possible that an even larger Hadoop environment would run the problem faster, but one must balance that consideration with the time, effort and cost to achieve the desired results.  One point that the paper does not address is the time needed to extract the data from Hadoop and move it to the server, a key consideration for a production architecture.  While predictive modeling in Hadoop is clearly in its infancy, this architecture will have some serious advantages for large data sets that are already resident in Hadoop.

One other key point not considered in this test is the question of scoring — once the predictive models are constructed, how will Allstate put them into production?

— Since SAS’ PROC GENMOD can only export a model to SAS, Allstate would either have to run all production scoring in SAS or manually write a custom scoring procedure;

— Hadoop would certainly require a custom MapReduce procedure;

— With RevoScaleR, Allstate can push the scoring into IBM Netezza.

This testing clearly shows that RevoScaleR is superior to open source R, and for this particular use case clearly outperforms SAS.  It also demonstrates that predictive analytics running in Hadoop is an idea whose time has not yet arrived.