The Year in Machine Learning (Part One)

This is the first installment in a four-part review of 2016 in machine learning and deep learning.

In the first post, we look back at ML/DL news organized in five high-level topic areas:

  • Concerns about bias
  • Interpretable models
  • Deep learning accelerates
  • Supercomputing goes mainstream
  • Cloud platforms build ML/DL stacks

In Part Two, we cover developments in each of the leading open source machine learning and deep learning projects.

Parts Three and Four will review the machine learning and deep learning moves of commercial software vendors.

Concerns About Bias

As organizations expand the use of machine learning for profiling and automated decisions, there is growing concern about the potential for bias. In 2016, reports in the media documented racial bias in predictive models used for criminal sentencing, discriminatory pricing in automated auto insurance quotes, an image classifier that learned “whiteness” as an attribute of beauty, and hidden stereotypes in Google’s word2vec algorithm.

Two bestsellers were published in 2016 that address the issue. The first, Cathy O’Neil’s Weapons of Math Destruction, is a candidate for the National Book Award. In a review for The Wall Street Journal, Jo Craven McGinty summarizes O’Neil’s arguments as “algorithms aren’t biased, but the people who build them may be.”

A second book, Virtual Competition, written by Ariel Ezrachi and Maurice Stucke, focuses on the ways that machine learning and algorithmic decisions can promote price discrimination and collusion. Burton Malkiel notes in his review that the work “displays a deep understanding of the internet world and is outstandingly researched. The polymath authors illustrate their arguments with relevant case law as well as references to studies in economics and behavioral psychology.”

Most working data scientists are deeply concerned about bias in the work they do. Bias, after all, is a form of error, and a biased algorithm is an inaccurate algorithm. The organizations that employ data scientists, however, may not commit the resources needed for testing and validation, which is how we detect and correct bias. Moreover, people in business suits often exaggerate the accuracy and precision of predictive models or promote their use for inappropriate applications.

In Europe, GDPR creates an incentive for organizations that use machine learning to take the potential for bias more seriously. We’ll be hearing more about GDPR in 2017.

Interpretable Models

Speaking of GDPR, beginning in 2018, organizations that use machine learning to drive automated decisions must be prepared to explain those decisions to the affected subjects and to regulators. As a result, in 2016 we saw considerable interest in efforts to develop interpretable machine learning algorithms.

— The MIT Computer Science and Artificial Intelligence Laboratory announced progress in developing neural networks that deliver explanations for their predictions.

— At the International Joint Conference on Artificial Intelligence, David Gunning summarized work to date on explainability.

— MIT selected machine learning startup Rulex as a finalist in its Innovation Showcase. Rulex implements a technique called Switching Neural Networks to learn interpretable rule sets for classification and regression.

— In O’Reilly Radar, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin explained Local Interpretable Model-Agnostic Explanations (LIME), a technique that explains the predictions of any machine learning classifier.

The Wall Street Journal reported on an effort by Capital One to develop machine learning techniques that account for the reasoning behind their decisions.

In Nautilus, Aaron M. Bornstein asked: Is artificial intelligence permanently inscrutable?  There are several issues, including a lack of clarity about what “interpretability” means.

It is important to draw a distinction between “interpretability by inspection” versus “functional” interpretability. We do not evaluate an automobile by disassembling its engine and examining the parts; we get behind the wheel and take it for a drive. At some point, we’re all going to have to get behind the idea that you evaluate machine learning models by how they behave and not by examining their parts.

Deep Learning Accelerates

In a September Fortune article, Roger Parloff explains why deep learning is suddenly changing your life. Neural networks and deep learning are not new techniques; we see practical applications emerge now for three reasons:

— Computing power is cheap and getting cheaper; see the discussion below on supercomputing.

— Deep learning works well in “cognitive” applications, such as image classification, speech recognition, and language translation.

— Researchers are finding new ways to design and train deep learning models.

In 2016, the field of DL-driven cognitive applications reached new milestones:

— A Microsoft team developed a system that recognizes conversational speech as well as humans do. The team used convolutional and long short-term memory (LSTM) neural networks built with Microsoft Cognitive Toolkit (CNTK).

— On the Google Research Blog, a Google Brain team announced the launch of the Google Neural Machine Translation System, a system based on deep learning that is currently used for 18 million translations per day.

— In TechCrunch, Ken Weiner reported on advances in DL-driven image recognition and how they will transform business.

Venture capitalists aggressively funded startups that leverage deep learning in applications, especially those that can position themselves in the market for cognitive solutions:

Affectiva, which uses deep learning to read facial expressions in digital video, closed on a $14 million “D” round led by Fenox Venture Capital.

Clarifai, a startup that offers a DL-driven image and video recognition service, landed a $30 million Series B round led by Menlo Ventures.

Zebra Medical Vision, an Israeli startup, uses DL to examine medical images and diagnose diseases of the bones, brain, cardiovascular system, liver, and lungs. Zebra disclosed a $12 million venture round led by Intermountain Health.

There is an emerging ecosystem of startups that are building businesses on deep learning. Here are six examples:

Deep Genomics, based in Toronto, uses deep learning to understand diseases, disease mutations and genetic therapies.

— Cybersecurity startup Deep Instinct uses deep learning to predict, prevent, and detect threats to enterprise computing systems.

Ditto Labs uses deep learning to identify brands and logos in images posted to social media.

Enlitic offers DL-based patient triage, disease screening, and clinical support to make medical professionals more productive.

— Gridspace provides conversational speech recognition systems based on deep learning.

Indico offers DL-driven tools for text and image analysis in social media.

And, in a sign that commercial development of deep learning isn’t all hype and bubbles, NLP startup Idibon ran out of money and shut down. We can expect further consolidation in the DL tools market as major vendors with deep pockets ramp up their programs. The greatest opportunity for new entrants will be in specialized applications, where the founders can deliver domain expertise and packaged solutions to well-defined problems.

Supercomputing Goes Mainstream

To make deep learning practical, you need a lot of computing horsepower. In 2016, hardware vendors introduced powerful new platforms that are purpose-built for machine learning and deep learning.

While GPUs are currently in the lead, there is a serious debate under way about the relative merits of GPUs and FPGAs for deep learning. Anand Joshi explains the FPGA challenge. In The Next Platform, Nicole Hemsoth describes the potential of a hybrid approach that leverages both types of accelerators. During the year, Microsoft announced plans to use Altera FPGAs, and Baidu said it intends to standardize on Xilinx FPGAs.

NVIDIA Launches the DGX-1

NVIDIA had a monster 2016, tripling its market value in the course of the year. The company released the DGX-1, a deep learning supercomputer. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

NVIDIA also revealed a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow, and Theano.

Tech media salivated:

MIT Technology Review interviewed NVIDIA CEO Jen-Hsun Huang, who is now Wall Street’s favorite tech celebrity.

Separately, Karl Freund reports on NVIDIA’s announcements at the SC16 supercomputing show.

Early users of the DGX-1 include BenevolentAI, PartnersHealthCare, Argonne and Oak Ridge Labs, New York University, Stanford University, the University of Toronto, SAP, Fidelity Labs, Baidu, and the Swiss National Supercomputing Centre. Nicole Hemsoth explains how NVIDIA supports cancer research with its deep learning supercomputers.

Cray Releases the Urika-GX

Cray launched the Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high-performance network interconnect. Cray ships 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

Intel Responds

The headline on the Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. Intel acquired Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reported a price tag of $408 million. The customary tech media unicorn story storm ensues.

Intel said it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Paul Alcorn offers additional detail on Intel’s new Xeon CPU and Deep Learning Inference Accelerator. In Fortune, Aaron Pressman argues that Intel’s strategy for machine learning and AI is smart, but lags NVIDIA. Nicole Hemsoth describes Intel’s approach as “war on GPUs.”

Separately, Intel acquired Movidius, the folks who put a deep learning chip on a memory stick.

Cloud Platforms Build ML/DL Stacks

Machine learning use cases are inherently well-suited to cloud platforms. Workloads are ad hoc and project oriented; model training requires huge bursts of computing power for a short period. Inference workloads are a different matter, which is one of many reasons one should always distinguish between training and inference when choosing platforms.

Amazon Web Services

After a head fake earlier in the year when it publishing DSSTNE, a deep learning project that nobody wants, AWS announces that it will standardize on MXNet for deep learning. Separately, AWS launched three new machine learning managed services:

Rekognition, for image recognition

Polly, for text to speech

Lex, a conversational chatbot development platform

In 2014, AWS was first to market among the cloud platforms with GPU-accelerated computing services. In 2016, AWS added P2 instances with up to 16 Tesla K8- GPUs.

Microsoft Azure

Released in 2015 as CNTK, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit and released Version 2.0, with a new Python API and many other enhancements. The company also launched 22 cognitive APIs in Azure for vision, speech, language, knowledge, and search. Separately, MSFT released its managed service for Spark in Azure HDInsight and continued to enhance Azure Machine Learning.

MSFT also announced the Azure N-Series compute instances powered by NVIDIA GPUs for general availability in December.

Azure is one part of MSFT’s overall strategy in advanced analytics, which I’ll cover in Part Three of this review.

Google Cloud

In February, Google released TensorFlow Serving, an open source inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explained.

Later in the Spring, Google announced that it was building its own deep learning chips, or Tensor Processing Units (TPUs). In Forbes, HPC expert Karl Freund dissected Google’s announcement. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs.

Google launched a dedicated team in October to drive Google Cloud Machine Learning, and announced a slew of enhancements to its services:

— Google Cloud Jobs API provides businesses with capabilities to find, match and recommend jobs to candidates. Currently available in a limited alpha.

Cloud Vision API now runs on Google’s custom Tensor Processing Units; prices reduced by 80%.

Cloud Translation API will be available in two editions, Standard and Premium.

Cloud Natural Language API graduates to general availability.

In 2017, GPU-accelerated instances will be available for the Google Compute Engine and Google Cloud Machine Learning. Details here.

IBM Cloud

In 2016, IBM contributed heavily to the growing volume of fake news.

At the Spark Summit in June, IBM announced a service called the IBM Data Science Experience to great fanfare. Experienced observers found the announcement puzzling; the press release described a managed service for Apache Spark with a Jupyter IDE, but IBM already had a managed service for Apache Spark with a Jupyter IDE.

In November, IBM quietly released the service without a press release, which is understandable since there was nothing to crow about. Sure enough, it’s a Spark service with a Jupyter IDE, but also includes an R service with RStudio, some astroturf “community” documents and “curated” data sources that are available for free from a hundred different places. Big Whoop.

In IBM’s other big machine learning move, the company rebranded an existing SPSS service as Watson Machine Learning. Analysts fell all over themselves raving about the new service, apparently without actually logging in and inspecting it.

screen-shot-2016-10-30-at-11-05-33-am

Of course, IBM says that it has big plans to enhance the service. It’s nice that IBM has plans. We should all aspire to bigger and better things, but keep in mind that while IBM is very good at rebranding stuff other people built, it has never in its history developed a commercially successful software product for advanced analytics.

IBM Cloud is part of a broader strategy for IBM, so I’ll have more to say about the company in Part Three of this review.

How to Steal a Predictive Model

In the Proceedings of the 25th USENIX Security Symposium, Florian Tramer et. al. describe how to “steal” machine learning models via Prediction APIs. This finding won’t surprise anyone in the business, but Andy Greenberg at Wired and Thomas Claburn at The Register express their amazement.

Here’s how you “steal” a model:

— The prediction API tells you what variables the model uses; the packaging for a prediction API will say something like “submit X1 and X2, we will return a prediction for Y”; so you know that X1 and X2 are the variables in the model. The developer can try to fool you by directing you to submit a hundred variables even though it only needs two, but that’s not likely; most developers make the prediction API as parsimonious as possible.

— Use an experimental design to create test records with a range of values for each variable in the model. You won’t need many records; the number depends on the number of variables in the model and the degree of granularity you want.

— Now, ping the API with each test record and collect the results.

— With the data you just collected, you can estimate a model that approximates the model behind the prediction API.

The authors of the USENIX paper tested this approach with BigML and Amazon Machine Learning, succeeding in both cases. BigML objects; Amazon sleeps.

Legally, it may not be stealing. Model coefficients are intellectual property. If someone hacks into your model repository and steals the model file, or bribes one of your data scientists into providing the coefficients, that is theft. But while IP owners can assert a right over their actual code, it is much harder to assert a right to an application’s observable behavior. Reverse-engineering is legal in the U.S. and the European Union so long as the party that performs the work has legal possession of the relevant artifacts. If someone lawfully purchases predictions from your prediction API, they can reverse-engineer your model.

Restrictive licenses offer limited protection. Intellectual property owners can assert a claim against reverse-engineering if the predictions are under an end-user license that prohibits the practice. The fine print will please your Legal department, but is virtually impossible to enforce. Predictions, unlike other forms of intellectual property, aren’t watermarked; they’re just numbers.

Pricing plays a role. While it may be technically feasible to reverse-engineer a predictive model, it may be prohibitively expensive to do so. Models that predict behavior with financial implications, such as consumer credit risk models, are expensive. Arguably, the best way to prevent reverse-engineering is to charge a non-cancellable annual subscription fee for access to the API rather than selling predictions by the record. In any event, the risk of reverse-engineering should be a consideration in pricing.

Encryption may be necessary. If you want to do business with trusted parties over an open API, a hashing algorithm can scramble the prediction in a way that makes reverse-engineering impossible. Of course, the customer must be able to decrypt the prediction at their end of the transaction, with a key transmitted separately or from a common random seed.

Access control is key. The key point of the USENIX authors is that if your prediction API is available “in the wild,” you might as well call it an open source model because reverse-engineering is easy to do. Of course, if you are in the business of selling predictions, you already have some form of access control so you can meter usage and charge an account. Bad actors, however, have credit cards; so, if you are concerned about your predictive model’s IP, you’re going to have to establish tighter control over access to the prediction API.

Databricks Releases Spark Survey

In a press release and blog post, Databricks announces results from its 2016 Spark Survey. Databricks surveyed 1,615 Spark users and prospective users in July, 2016 Respondents include data engineers, data scientists, architects, technical managers, and academics.

Key findings from the survey:

  • Spark SQL remains the most widely used component.
    • 88% use Spark SQL
    • 71% use Spark Streaming
    • 71% use MLlib (machine learning)
  • Respondents value Spark’s performance and advanced analytics.
    • 91% rate performance very important
    • 82% rate advanced analytics very important
    • 76% rate ease of programming very important
    • 69% rate ease of deployment very important
    • 51% rate real-time streaming very important
  • Production use has increased markedly since 2015.
    • 40% use SQL in production, up from 24%
    • 38% use DataFrames in production, up from 15%
    • 22% use streaming in production, up from 14%
    • 18% use machine learning, up from 13%
  • So has usage in the public cloud.
    • 61% said they use Spark in the public cloud, up from 51% in 2015.
  • Usage of Spark deployed on-premises has declined.
    • 42% use Spark in a standalone deployment, down from 48%
    • 36% use Spark under YARN, down from 40%
    • 7% use Spark on Apache Mesos, down from 11%
  • The Scala API remains the most popular, followed closely by the Python API.
    • 65% use Scala, down from 71% in 2015
    • 62% use Python, up from 58%
    • 44% use SQL, up from 36%
    • 29% use Java, down from 31%
    • 20% use R, up from 18%
  • While Linux remains the most popular OS, Mac and Windows usage is growing rapidly.
    • 74% use Linux/Unix, down from 75% in 2015
    • 32% use Windows, up from 23%
    • 22% use Mac OSX, up from 14%

The report also includes statistics about the Spark community at large.

— Databricks reports growth in the contributor base from 600 in 2015 to 1,000 in 2016, a figure that does not seem to square with the statistics reported in OpenHub.

— Spark Meetup membership grew from 66,000 in 2015 to 225,000 in 2016.

— Spark Summit attendance grew from 3,912 to 5,100.

For a copy of the report and an infographic, go here.

Disruption: It’s All About the Business Model

This post is an excerpt adapted from my book, Disruptive Analytics, available soon from Apress and Amazon. (Note: under my contract with Apress I am legally obligated to link to their site, but it’s not yet possible to order the book there. Use the Amazon link if you want the book.)

The analytics business is booming. Technology consultant IDC estimates total spending for analytic services, software and hardware exceeded $120 billion in 2015; through 2019, IDC forecasts that spending will increase to $187 billion, an 11% compound annual growth rate.

Powerful forces are at work in the economy today:

  • Digital transformation of the economy and rapidly declining storage costs combine to create a flood of data.
  • The number of data sources is exploding. Data sources are everywhere: on-premises, in the cloud, in consumers’ pockets, in vehicles, in RFID chips, and so forth.
  • The “long march” of Moore’s Law: cheap computing power makes machine learning and deep learning techniques practical.

So, if analytics is such a hot field, why are the industry leaders struggling?

  • Oracle’s cloud revenue growth fails to offset declining software and hardware sales.
  • SAP’s cloud revenue grows, but total software revenue is flat.
  • IBM reports seventeen straight quarters of declining revenue. Mass layoffs
  • Microsoft underperforms analysts’ expectations despite 120% growth in Azure cloud revenue.
  • Predictive analytics leader SAS reports five years of low single-digit revenue growth; Executive Vice President and Chief Marketing Officer departs.
  • Data warehousing leader Teradata shuffles its leadership team after four years of declining product revenue.

Product quality is not the problem. Each company offers products that industry analysts rate highly:

  • Forrester and Gartner recognize IBM, SAS, SAP and Oracle as leaders in data quality tools.
  • Gartner rates Oracle, SAP, IBM, Microsoft and Teradata as leaders in data warehousing.
  • Forrester rates Microsoft, SAP, SAS, and Oracle as leaders in agile business intelligence.
  • Gartner recognizes SAS and IBM as leaders in Advanced Analytics.

The answer, in a word, is disruption. Clayton Christensen of the Harvard Business School outlined the theory of disruptive innovation in 1997. Summarizing the argument briefly:

  • Industries consist of value networks, collections of suppliers, channels, and buyers linked by relationships.
  • Innovations disrupt industries when they create a new value network.
  • Not all innovations are disruptive. Many are introduced by market leaders to sustain a competitive position.
  • Disruptive innovations tend to be introduced by outsiders.
  • Purely technological innovation is not disruptive; what matters is the business model enabled by the new technology.

For a more detailed exposition of the theory, read Christensen’s book.

Christensen identified two forms of disruption. Low-end disruption occurs when industry leaders enhance products faster than customers can assimilate the enhancements; the disruptor enters the market with a “good enough” product and a better value proposition. The disruptor’s innovation makes it possible to serve customers at a lower cost than the industry leaders can deliver.

New market disruption takes place when the disruptor innovates in ways enabling it to serve customers that are not served by the industry leaders.

Technology alone does not disrupt industries; incumbents can and do innovate. New business models enabled by new technology are the cutting edge of disruption. Frequently, incumbents cannot respond effectively to new business models; this is partly due to “blinders” caused by changing value networks, and partly out of fear of cannibalizing existing business arrangements. Two business models, in particular, are disrupting the business analytics world today:

  • Open source software business models offer an increasingly attractive alternative to commercial software licensing. The Hadoop ecosystem displaces conventional data warehousing; R and Python displace commercial software for advanced analytics.
  • The elastic business model made possible by cloud computing undercuts conventional software licensing. When customers pay only for what they use, they pay a lot less.

Disruption does not mean that leading companies like Oracle, IBM and SAS will go out of business. Blockbuster may be the poster child for disrupted businesses, but most cases are less dire; for the business analytics leaders, disruption means they will struggle to grow. Slow growth is less benign than it sounds. As McKinsey notes, the rule today is “Grow or Go”: companies that cannot define a credible growth strategy will be acquired by other companies or by private equity.

The alternative to revenue growth is increasing profitability. But when revenue is flat or declining, that usually means job cuts.

job-cuts
Disruption looks like this.

Consider what happened to Teradata. Late in 2012, the company started missing sales targets; in early 2013, it stunned investors by reporting an absolute decline in sales. Management offered excuses; Wall Street punished the stock, driving it down by half in the face of a bull market for tech stocks.

Teradata’s leadership continued to miss sales and earnings targets; Wall Street drove the stock price down to a fraction of its 2012 peak. While it is tempting to blame the problem on poor leadership, Teradata’s persistent failure to accurately forecast its sales and earnings is a clear sign that its leadership no longer understood the value networks in which they operated. The world had changed; the value networks created in Teradata’s rise to leadership no longer existed; the mental models managers used to understand the market no longer worked.

There are two distinct types of disruption. The first is disruptive innovation within the analytics value chain. Here are two recent examples:

Hadoop. The Hadoop ecosystem disrupts the data warehousing industry from below. Hadoop does not do everything a relational database can do, but it does just enough to offer an attractive value proposition for the right use cases. When first introduced, Hadoop’s capabilities were very limited compared to data warehouse appliances. But Hadoop’s flexibility and low cost were highly attractive for applications that did not need the performance and features of a data warehouse appliance. While established vendors struggle to maintain flat and declining revenue, companies that offer solutions built on Hadoop grow at double-digit rates.

Tableau. Tableau virtually created the market for agile, self-service discovery. The charting and visualization features in Tableau are available in mainstream business intelligence tools. But while business intelligence vendors target the IT organization and continually add complexity to their product, Tableau targets the end user with a simple, easy to use and versatile tool. As a result, Tableau has increased its revenue tenfold in five years, leapfrogging over many other BI vendors.

Disruption within the analytics value chain is pertinent for readers who plan to invest in analytics technology for their organization. Technologies at risk of disruption are risky investments; they may have abbreviated useful lives, and their suppliers may suffer from business disruption. Taking a “wait-and-see” attitude towards disrupted technologies makes good sense, if only because prices will likely decline in the future.

The second type is disruption by innovations in analytics. Examples of disruption by analytics are harder to find, but they do exist:

Credit Scoring. General-purpose credit scoring introduced by Fair, Isaac and Co. in 1987 virtually created a national market in credit cards.  Previously, banks issued credit cards to their local customers, with whom they had an established relationship. Uniform credit scoring enabled a few large issuers to identify creditworthy clients in the general population, without a prior relationship.

Algorithmic Trading. When the U.S. Securities and Exchange Commission authorized electronic trading in regulated securities in 1998, market participants quickly moved to develop algorithms that could arbitrage between markets, arbitrage between indexes and the underlying stocks and exploit other short-term opportunities. Traders that most effectively deployed machine learning for electronic trading grew at the expense of other traders.

For startups and analytics practitioners, disruption by analytics is essential. Startups must disrupt their industries if they want to succeed. Using analytics to differentiate a product is a way to create a disruptive business model or to create new markets.

There is a common theme across the four examples: the business model enabled by the technology and not the technology itself drives the disruption. Hadoop and Tableau do less than the legacy products they compete against; what they do, however, is sufficient for a class of use cases, for which they provide a better value proposition. Credit scoring and algorithmic trading created fundamentally new ways to lend and invest; while these applications attracted technological innovations as they expanded, it was the new business models they created that disrupted the lending and investing industries.

To illustrate the importance of the business model, consider the case of columnar serialization, a significant innovation in data warehousing that did not disrupt the industry. In 2005, Vertica introduced a commercial columnar database, a technology that is well-suited to high-performance analytics (as we explain in Chapter Two of Disruptive Analytics). Vertica successfully built a customer base, but did not create a unique business model; by 2010 the leading data warehouse vendors had introduced columnar serialization into their products. HP acquired Vertica in 2011 for about $250 million, a price well below the $1.7 billion IBM paid for Netezza, a competing data warehouse appliance vendor.

Here are some takeaways for the reader to consider.

First, if you want to invest in new business analytics technology, ask yourself:

  • Are we paying for what we use, or for what we might use?
  • What particular value do commercial software options offer over open source alternatives?

Second, if you want to use analytics to create a disruptive innovation, ask yourself:

  • What new business model does this support?
  • Can we disrupt incumbents from below with a better value proposition?
  • Can we reach new markets and new customers who are underserved by existing value networks?

There is one additional takeaway: nobody ever disrupted anything by managing data. Keep that in mind the next time a data warehousing vendor tries to tell you that their Big Box is a “strategic” investment. We’ll explore that in another excerpt from the book.

Disruptive Analytics

This is an introduction to my book, Disruptive Analytics, available now from Amazon and  Apress.

DA Cover

Disruption: in business, a radical change in an industry or business strategy, especially involving the introduction of a new product or service that creates a new market.

From its birth in 1979, Teradata led the field in data warehousing. The company built a reputation for technical acumen, serving customers like Wal-Mart and Citibank; analysts and implementers alike rated the company’s massively parallel databases “best in class.”  After a 2007 spinoff from NCR, the company grew by double digits.

On August 6, 2012, Teradata released its earnings report for the second quarter.  Results excelled; revenue was up 18% and earnings per share (EPS) up 28%.  Teradata stock traded at $80, five times its value four years earlier.

“We are increasing our guidance for constant currency revenue growth and EPS for 2012,” wrote CEO Mike Koehler.

In retrospect, that moment was Teradata’s peak. Over the next three and a half years, the company lost 75% of its market value, as it repeatedly missed revenue and earnings targets. In 2015, Koehler announced a restructuring and sale of assets; several top executives departed. Finally, after a brutal first quarter earnings report, Koehler himself stepped down in May 2016.

Management blamed many factors for the sluggish sales: long sales cycles, a sluggish economy, and unfavorable currency movement.  But worldwide spending on business analytics increased during this period, and some vendors reported double-digit revenue growth.

One can blame Teradata’s struggles on poor leadership, but the truth isn’t that simple. The company’s growth problems in the last few years are not unique: in the same period, Oracle and IBM suffered declining revenue; Microsoft and SAP failed to grow consistently, disappointing investors; and SAS had to walk back embarrassing projections of double-digit growth, recording low single-digit gains.

In short, while businesses continue to invest in analytics, they aren’t buying what the industry leaders are selling.

Meanwhile, a steady stream of innovation creates new value networks in the business analytics marketplace:

Open Source Analytics. With substantial gains in the last several years, open source software makes deep inroads in the analytics community. Surveys show that working data scientists prefer to use open source R and Python more than any brand of commercial software. Technology leaders like Oracle, IBM, and Microsoft rush to get on the bandwagon.

Hadoop and its Ecosystem. As Hadoop matures, it competes successfully with data warehouse appliances, even displacing them. Technology consultant Gartner estimates that 42% of all enterprises now use Hadoop. A few years ago, data warehousing vendors laughed at Hadoop; they aren’t laughing today.

In-Memory Analytics. As the cost of memory declines, fast and scalable high-performance analytics are within reach for any organization. Adoption of open source Apache Spark increases exponentially. With more than a thousand contributors, Spark is the most active open source project in Big Data.

Streaming Analytics. Organizations face a growing volume of data in motion, driven in part by the Internet of Things (IoT). Today, there are no less than six open source projects for streaming analytics in the Apache ecosystem. In-memory databases position themselves as streaming engines for hybrid transactional/analytical processing (HTAP).

Analytics in the Cloud. When Amazon Web Services introduced its Redshift columnar database in 2012, it lacked many of the features available in competing data warehouses. For many businesses, however, Amazon offered a compelling value proposition: “good enough” functionality, at a fraction of the cost of a Teradata warehouse. The leading cloud services all report double-digit revenue growth; Gartner estimates that 44% of all businesses use the cloud.

Deep Learning. Cheap high-performance computing power makes deep learning practical. Nvidia releases its DGX-1 chip for deep learning, with the power of 250 servers; Cray announces its Urika-GX appliance with up to 1,728 cores and 35 terabytes of solid-state memory. Meanwhile, Google releases its TensorFlow framework to open source and declares that it uses deep learning in “hundreds” of applications.

Self-Service Analytics. With an easy-to-learn user interface and robust connectors to data sources, Tableau disrupts the business intelligence software industry and grows its revenues tenfold.

We do not hype Big Data in this book; petabytes of data are worthless unless they answer a business question. However, the tsunami of data produced by the digital economy is a fact of life that managers and analysts must address. Whether you manage a multinational or drive a truck, your business generates more data than ever; you will either use it or discard it, but one way or the other, you must decide what to do with it.

In a disrupted business analytics market, managers must focus ruthlessly on needs for insight, then build systems and processes that satisfy those needs. Understanding the innovations described in these chapters is a step towards that end, but the focus must remain on the demand for insight and the value chain that delivers it.

Innovations do not spring fully formed from the mind of an inventor; they are the result of a long process of tinkering. Many of the most significant innovations we describe in this book are more than fifty years old; they emerge today for various reasons, such as the long-run decline of computing costs. We present a historical perspective at several points in this book so the reader can distinguish between that which is new and that which is merely repackaged and rebranded.

In the middle chapters of this book, we present a survey of key innovations in business analytics. These chapters include detailed information about available software products and open source projects. In general, we do not cover offerings from industry leaders, under the premise that these companies have ample marketing budgets to build awareness of their products.

We close the book with a handbook for managers: specific strategies to profit from disruptive innovation. Some of these strategies may seem radical; if this disturbs you, put this book down – it’s not for you. But if you are ready to embrace disruptive innovation, and profit by it, read on.

Databricks’ 2016 Spark Survey

Databricks is running a short survey to understand the needs of Apache Spark users. If you haven’t taken the survey yet, do so today.

For results of the 2015 survey, look here. Last year’s survey produced a number of interesting findings; here’s what I wrote back in September when Databricks released its report:

=====

Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here.  The report is an informative mashup of survey findings, plus other information, such as the headcount from Spark Summits.  (Spoiler: it’s increasing.)  On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points.  See additional analysis herehereherehereherehere, here and here.

Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop).  Andrew Oliver, for example, commenting on Cloudera’s One Platform  announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey.  But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.

Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop.  But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.

The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java.  The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.

Teradata Reports Loss, Fires CEO

After three years of strategic floundering, Teradata now understands that it has a leadership problem, and announces a CEO change. Victor Lund, who heads the audit committee on the board, takes the helm.

Speaking to investors, Lund noted in his opening remarks that he is too old to be the permanent CEO, denied that he is merely a caretaker, and said he plans to find a new CEO in 90 days.

Asked about strategic missteps, Lund pointed to the Aprimo purchase; a safe comment given previous announcements. Beyond that he said that Teradata must change its culture and move faster.

In other words, he has no idea what to do. But he’s gung-ho.

Teradata also reports a net loss of $46 million, and a 20% decline in product revenue. Product revenue drives consulting and maintenance revenue, and a decline that steep implies a failing business model. Consulting revenue was up 4%: maintenance revenue up 2%. Selling, general and administrative expenses, down 5%; research and development down by 10%.

Screen Shot 2016-05-05 at 7.58.03 AM

CFO Steve Scheppmann announced a definitive agreement to sell the Marketing software business for $90 million, “below what we expected.” Teradata paid $525 million for Aprimo in 2011.

Steve, you’re supposed to buy low and sell high.

Demonstrating his keen insight into the data warehousing business, Scheppmann noted that buyers “are moving away from capex.” He noted that Q1 sales in the Americas were down because Teradata shuffled its sellers. (Asked about this in the previous investor call, Mike Keough denied that shuffling the sellers would impair sales.) Scheppmann also noted that some deals slipped to Q2, and expects some Q2 deals to slip into later quarters.

Lots of slippage going on.

Oliver Ratzesberger, president of Teradata Labs, painted a picture of Teradata everywhere: on-premises, in private cloud, and in public cloud. The fly in the ointment is that Teradata in AWS Marketplace is a single node version; Teradata without an MPP architecture is like a muscle car with a tiny engine. He noted that Teradata is accelerating plans to put an MPP version into the cloud, and now expects to do so by the end of this year, only five years after Oracle.

Ratzesberger also mentioned rebranding the Teradata architecture as IntelliFlex, and consulting-led solutions. He did not mention Aster. In fact, nobody mentioned Aster. Presumably, that old dog won’t hunt much longer.

Asked a slightly technical question, Ratzesberger rambled incoherently.

Lund, Scheppmann and Ratzesberger all spoke of the central role of consulting in leading Teradata out of the woods. If Teradata is serious about that, they’re going to have to go full open source, like Pivotal did last year. You can’t easily mix a strategic consulting business with a software business. Just ask IBM.

Big Analytics Roundup (March 14, 2016)

HPE wins the internet this week by announcing the re-re-release of Haven, this time on Azure.  The other big story this week: Flink announces Release 1.0.

Third Time’s a Charm

Hewlett Packard Enterprise (HPE) announces Haven on Demand on Microsoft Azure; PR firestorm ensues.  Haven  is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, originally branded as HAVEn and announced by HP in June, 2013.  Since then, the software hasn’t exactly gone viral; Haven failed to make KDnuggets’ list of the top 50 machine learning APIs last December, a list that includes the likes of Ersatz, Hutoma and Skyttle.

One possible reason for the lack of virality: although several analysts described Haven as “open source”, HP did not release the Haven source code, and did not offer the software under an open source license.

Other than those two things, it’s open source.

In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform.

So this latest announcement is a re-re-release of the software. On paper, the library looks like it has some valuable capabilities in text, images video and audio analytics.  The interface and documentation look a bit rough, but, after all, this is a first third release.

Jim’s Latest Musings

Angus Loten of the WSJ’s CIO Journal interviews SAS CEO Jim Goodnight, who increasingly sounds like your great-uncle at Thanksgiving dinner, the one who complains about “these kids today.”  Goodnight compares cloud computing to mainframe time sharing.  That’s ironic, because although SAS runs in AWS, it does not offer elastic pricing, the one thing that modern cloud computing shares with timesharing.

Goodnight also pooh-poohs IoT, noting that “we don’t have any major IoT customers, and I haven’t seen a good example of IoT yet.”  SAS’ Product Manager for IoT could not be reached for comment.

Meanwhile, SAS held its annual analyst conference at a posh resort in Steamboat Springs, Colorado; in his report for Ventana Research, David Menninger gushes.

Herbalife Messes Up, Blames Data Scientists

Herbalife discloses errors reporting non-financial information, blames “database scripting errors.” The LA Times reports; Kaiser Fung comments.

Explainers

— Several items from the morning paper this week:

  • Adrian Colyer explains CryptoNets, a combination of Deep Learning and homohorphic encryption.  By encrypting your data before you load it into the cloud, you make it useless to a hacker.
  • Adrian explains Neural Turing Machines.
  • Adrian explains Memory Networks.
  • Citing a paper published by Google last year, Adrian explains why using personal knowledge questions for account recovery is a really bad thing.

— Data Artisans’ Robert Metzger explains Apache Flink.

— In a video, Eric Kramer explains how to leverage patient data with Dataiku Data Science Studio.

Perspectives

— In InfoWorld, Serdar Yegulalp examines Flink 1.0 and swallows whole the argument that Flink’s “pure” streaming is inherently superior to Spark’s microbatching.

— On the MapR blog, Jim Scott offers a more balanced view of Flink, noting that streaming benchmarks are irrelevant unless you control for processing semantics and fault tolerance.  Scott is excited about Flink ease of use and CEP API.

— John Leonard interviews Vincent de Lagabbe, CTO of bitcoin tracker Kaiko, who argues that Hadoop is unnecessary if you have less than a petabyte of data.  Lagabbe prefers Datastax Enterprise.

— Also in InfoWorld, Martin Heller reviews Azure Machine Learning, finds it too hard for novices.  I disagree.  I used AML in a classroom lab, and students were up and running in minutes.

Open Source Announcements

— Flink announces Release 1.0.  DataArtisans celebrates.

Teradata Watch

CEO Mike Koehler demonstrates confidence in TDC’s future by selling 11,331 shares.

Commercial Announcements

— Objectivity announces that Databricks has certified ThingSpan, a graph analytics platform, to work with Spark and HDFS.

— Databricks announces that adtech company Sellpoints has selected the Databricks platform to deliver a predictive analytics product.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).