The Year in Machine Learning (Part Three)

This is the third installment in a four-part review of 2016 in machine learning and deep learning. In Part One, I covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

In this installment, we will review the machine learning and deep learning initiatives of Big Tech Brands — industry leaders with big budgets for software development and marketing. Big Tech Brands fall into three groups:

— SAS is the software revenue leader in predictive analytics. It has a unique business model and falls into its own category.

— Companies such as IBM, Microsoft, Oracle, SAP, and Teradata have all have strong franchises in the data warehousing market, and all except Teradata offer widely used business intelligence software. These companies have the financial strength to develop, market and cross-sell machine learning software to their existing customer base, and can impact the market if they choose to do so.

Dell and HPE dabbled in advanced analytics and exited the market in 2016.

I covered Google and Amazon Web Services in Part One. Although neither company has a strong position in business analytics at present, they are making moves in that direction. Google set up Google Cloud Machine Learning as a distinct product group this year to service that market, and Amazon introduced QuickSight, a business analytics service.

Regular readers know that I favor open source software — as do most data scientists. Among the companies covered in this installment, IBM and Microsoft are making substantial commitments to the open source model, including direct contributions to open source software projects. They deserve kudos for that. Teradata is investing in Presto SQL, for which they get polite applause. Oracle and SAP leverage open source software in their solutions but make no significant contributions. SAS embraces open source the way a cat embraces a porcupine.

In Part Four, I will survey machine learning startups, and deliver results from the Bottom Story of the Year poll.

SAS

SAS leads the market in licensing revenue for advanced and predictive analytics software, according to IDC. The company has a loyal following among statisticians, actuaries, life scientists and others whose work depends on statistical analysis.

Partnering with IBM, SAS built its business in the 1970s on the strength of its software for the IBM System/360 mainframe. IBM promoted the software to its enterprise customers to increase adoption and use of its hardware. SAS software still runs on the mainframe, and the company continues to earn a significant share of its revenue on that platform. IBM has mainframe customers who use the big box exclusively for SAS.

In the 1990s, SAS successfully transitioned to a multi-vendor architecture and rebuilt its software to run on many different hardware platforms and operating systems. During this period, SAS established a reputation for industrial-strength and enterprise-grade software — in contrast to vendors like SPSS, who focused on building easy-to-use software for the desktop.

On the face of it, SAS has struggled to transition from server-based computing to the contemporary world of distributed architecture and cloud platforms. In the past ten years, the company has announced multiple initiatives to improve the performance and scalability of its products, with mixed success. In April, SAS announced Viya, its third attempt to deliver advanced analytics in a distributed MPP architecture.

What is SAS Viya? How does it differ from SAS’ previous attempts at high-performance design? Let’s peruse the brochure:

Cloud-ready, elastic and scalable

 

SAS Viya is built to be elastic and scalable for both private and public clouds. Analytical, in-memory computations are optimized for unconstrained environments, but they can also adjust for constrained environments. The elastic processing automatically adapts to needs and available resources – spinning up or winding down computing capacity as needed. Elastic scalability lets you quickly experiment with different scenarios and apply more complex approaches to larger amounts of streaming data.

Ahem. Any software is “cloud-ready,” in the sense that a Linux instance is a Linux instance whether it runs on-premises or in the cloud. And any software is elastic when you deploy it in a virtual appliance, such as an Amazon Machine Image. That includes SAS 9.4, which SAS touted as “cloud-ready” in 2014, and previous versions of SAS, which you could deploy in AWS even though SAS did not formally support the platform.

If you want to spin up software instances, however, you need software licenses. With open source software, such as Python, R, or Spark, that’s not an issue — you can spin up as many instances as you like without violating license agreements. Commercial software is more complicated since you need to pay for the licenses you want to spin up. Some vendors, like HPE and Teradata, tried to address this problem by marketing their own cloud platforms to compete with Amazon Web Services; they failed miserably. Others, like Oracle, partner with AWS to deliver their software in the cloud — either as a bundled managed service or on a “Bring Your Own License” (BYOL) model.

You can’t have elastic computing with commercial software without a flexible licensing model. Pay-for-what-you-use licensing poses a problem for vendors like SAS, because if customers only pay for what they use, they invariably pay a lot less than they do under term licensing. Most commercial software customers are over-licensed — they’re paying for a lot of software they don’t use. That is why revenue from on-premises software licensing is declining much faster than revenue from cloud-based subscriptions is rising. In the cloud, you can do more with less.

The bottom line is this: unless Viya is available under an elastic pricing model, nobody cares that it is “cloud-ready, elastic and scalable.”

If you want to have a little fun, the next time your SAS rep touts Viya’s elasticity, ask him what it will cost per hour to license the software. Watch him squirm.

Open analytics coding environment

 

Empower your data scientists with SAS Analytics that are easily available from a variety of programming languages. Whether it’s a Python notebook, Java client, Lua scripting interface or SAS, your modelers and data scientists can easily access the power of SAS for data manipulation, advanced analytics and analytical reporting.

We’ve all been waiting for the ability to run SAS from Lua.

Resilient architecture with guaranteed failover

 

For answers you depend on, you need analytical processing power you can count on. You need all your analytical computations to finish processing without interruption. The fault-tolerant design of SAS Viya automatically detects server failure, even in multiplatform processing environments, and redistributes processing as needed. It also manages several copies of data on the processing cluster. If a machine in the cluster becomes unavailable or fails, the required data is retrieved from another block to quickly continue processing. These self-healing mechanisms ensure high availability for uninterrupted processing and automated recovery.

“It runs on Hadoop.”

Interviewed in Forbes, SAS CEO Jim Goodnight speaks at length about Viya:

We are ready for big data…(we) just released our first version of our new Viya architecture, which is massively parallel computing where we spread the data out over dozens of servers and then use all the cores inside those servers to process the data in parallel. So we might have 500 cores working on the data all at once in parallel, and that allows it to handle some really, really big problems that we’ve never even thought of before. Things like logistic regression.

Someone should feed Dr. G. better talking points. Just for the record, commercially available software for logistic regression running in a massively parallel (MPP) environment first hit the market in 1989. Distributed logistic regression is currently available in multiple software packages, including one introduced by SAS five years ago.

Logistic regression (a non-linear model) is an iterative process. Essentially, you’re trying to estimate the parameters in the model, and so you take a guess, you’ve got to run through the data using that guess, then to refine it and do another guess and run through the data again, and you keep doing this over and over and over until the parameters converged or they don’t change much at all anymore. That can take 25 to 30 passes of the data. Now, in the old days, we used to have to read the data that many times. Now, it’s in memory. We put it in memory and it stays in memory. It’s spread out over 500 cores and then each one just does a little piece of the work, and so we can do those 25 iterations in just a few minutes, whereas it used to take hours.

It’s just like Spark, but with a license key.

(Viya’s) really our third generation of massively parallel computing. We’ve been working on this problem for seven years, and this is our third major crack at doing it, and this time we’ve got everything figured out.

In 2018 he’ll be talking about a fourth crack in nine years.

It’s possible that Viya works better than SAS’ previous cracks at high-performance analytics. That is a weak hurdle, however; SAS needs to demonstrate that its high-cost proprietary distributed framework is better than Apache Spark, which is rapidly emerging as the standard enterprise platform for Big Data.

While SAS supports machine learning techniques in several different products, it lags in deep learning. The SAS Marketing team created some helpful content about deep learning, but look carefully at that page — you won’t find an actual product for deep learning. Yes, I know that SAS Enterprise Miner supports multilayer perceptrons; but SAS does not support GPUs, Xeon Phi, Intel Nervana or any other high-performance architecture that will make it possible for you to train a deep neural net while you’re young.

If you think that an eighteen-year-old product running on one server is sufficient for your deep learning project, you should definitely talk to SAS. Keep in mind, though, that there is a reason that NVIDIA’s DGX-1 GPU-accelerated deep learning box has the power of 250 conventional servers: you actually need that kind of horsepower.

The rest of SAS’ business seems to be chugging along well enough. A combination of renewals, upgrades and upsells in existing accounts should produce low single-digit revenue growth for 2016, which is not a bad track record when you consider the declines reported by IBM, Oracle, and Teradata.

Business Analytics Leaders

The five companies in this group sell at least a billion dollars a year in business analytics software, according to IDC’s most recent worldwide software market share report. However, most of their revenue comes from data warehousing and business intelligence software; they all trail SAS in predictive analytics revenue.

Software licensing revenue is a misleading measure, however, due to the growing presence of open source software. IBM, Microsoft, and Oracle for example, actively use open source machine learning software to extend the reach of their data warehousing and business intelligence platforms, where they both have strong entries. IBM uses Spark as a foundation for many of its products; Microsoft has integrated R with SQL Server and PowerBI, and actively promotes the use of R for its enterprise customers. Oracle has taken a similar approach.

IBM

Unlike SAS, declining tech giant IBM never invested in a proprietary distributed framework for SPSS, its flagship software for advanced analytics. Instead, the company chose to leverage in-database engines (DB2, Netezza, and Oracle) and open source frameworks (MapReduce and Spark.)

IBM contributes to Apache Spark, which it uses in several products, and also to Apache SystemML. IBM Research developed the core of SystemML, which IBM donated to Apache in 2015. IBM has also visibly contributed to the Spark community through its efforts in education and training.

In 2016, IBM continued to market SPSS Statistics and SPSS Modeler, software brands it acquired in 2007. Release 18 of SPSS Modeler, announced in March, includes such things as support for machine learning in DB2 and support for IBM’s General Parallel File System (GPFS) in BigInsights. There aren’t too many data scientists who care about such things, but they appeal to the 150 or so enterprises with CIOs who still believe that nobody ever got fired for buying IBM.

In Part One of this review, I covered IBM’s machine learning moves in IBM Cloud, which I would characterize as Shakespearean, as in Much Ado About Nothing.

Microsoft

Microsoft had quite a year in machine learning and deep learning. As I noted in Parts One and Two, in 2016 MSFT launched cognitive APIs in Azure for vision, speech, language, knowledge, and search; a managed service for Spark in Azure HDInsight; enhancements to Azure Machine Learning and Version 2.0 of its deep learning framework, rebranded as Microsoft Cognitive Toolkit.

That’s just for starters.

In January, Microsoft announced Microsoft R Server, a rebranding of the product it acquired with Revolution Analytics in 2015. Microsoft R Server includes an enhanced R distribution, a scalable back-end, and integration tools. During the year, Microsoft two major releases for R Server. In Release 8, the company added push-down integration with Spark. Release 9 updated the Spark integration for Spark 2.0, and added MicrosoftML, a new R package for machine learning.

Microsoft announced SQL Server 2016 in March with embedded SQL Server R Services. On the Revolutions blog, David Smith reports on the launch. Tomaž Kaštrun explains what you can do with R services in SQL Server.

In November, after an extended preview, Microsoft announced the general availability of R Server for Azure HDInsight, a scale-out implementation of R integrated with Spark clusters created from HDInsight.

Also in Azure, Microsoft added a Linux version of the Data Science Virtual Machine (DSVM). Previously available as a Windows instance, DSVM includes Revolution R Open, Anaconda, Visual Studio Community Edition, PowerBI Desktop, SQL Server Express and the Azure SDK.

PowerBI, Microsoft’s powerful visualization tool, added R support in August. In ComputerWorld, Sharon Machlis, an R user, enthused. More here, on the Revolutions blog.

R Tools for Visual Studio launched to public preview in March, and to general availability in September. Also in September, Microsoft released the Microsoft R Client, a free data science tool that works with Microsoft R Open and the ScaleR distributed back end.

Microsoft data scientists Gopi Krishna Kumar, Hang Zhang and Jacob Spoelstra developed a methodology for data science, which they presented at the Microsoft Machine Learning and Data Science Summit 2016 in September. David Smith reports. The method, which the authors call Team Data Science Process, includes a standard directory structure for managing project artifacts using a system such as Git. It also includes open source utilities to support the process.

Other than that, it was a quiet year in Redmond.

Oracle

Oracle has a surprisingly robust set of machine learning tools that appeal to Oracle-centric organizations. They include:

Oracle Data Mining (ODM), a suite of machine learning algorithms that run as native SQL functions in Oracle Database.

Oracle Data Miner, a client application for ODM with a business user interface.

Oracle R Distribution (ORD), an enhanced free R distribution.

Oracle R Enterprise (ORE), Oracle R Distribution packaged with tools to integrate R with Oracle Database.

Oracle R Advanced Analytics for Hadoop (ORAAH), a set of R bindings with native algorithms and an interface to Spark.

Oracle claims that ORAAH’s native algorithms are faster than Spark, but ORAAH has only two algorithms, so nobody cares. Oracle OEMs Cloudera, so the Spark release is at least one major release behind the rest of the world.

Other than some dot releases for the components cited above, I don’t see a lot of movement for Oracle in 2016.

SAP

SAP introduced an update to its predictive analytics capabilities, now branded as SAP Business Objects Predictive Analytics 3.0. This product includes two separate automation capabilities, one branded as Predictive Factory, the second as HANA Automated Predictive Library. Predictive Factory, like SAS Factory Miner, is a scripting tool that enables a data scientist to create a modeling pipeline and schedules it for execution; it does not automate the data science process itself.  HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts.

HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts. It’s a product that might appeal to SAP HANA bigots and nobody else.

SAP acquired KXEN and its InfiniteInsight software in 2014. Customer satisfaction promptly dropped through the floor, and SAP trails all other advanced analytics vendors rated in a Gartner survey. Legacy InfiniteInsight customers fall into two camps: (a) those whose IT organizations are heavily invested in SAP, and (b) everyone else. The former seem to be sticking with the software as SAP integrates it into its product line; the latter are heading for the exits.

Teradata

Declining data warehouse vendor Teradata thinks of itself as an analytics powerhouse. In reality, most of its revenue comes from data warehousing, where the company gets high marks from analysts like Gartner.

You could say that Teradata has a commanding position at the bottom of the analytics stack.

Teradata’s executive leadership — if you can call it that — completely missed the implications of Hadoop and cloud computing. Instead, they bet that the Teradata brand was beloved by IT executives, who would keep on buying boxes in bulk. As a result of that blinkered view of the world, the company today is worth a third of what it was worth five years ago. Its product sales have declined for ten straight quarters, seven in a row at double digits.

After a dismal first quarter, Teradata’s board fired accepted the resignation of CEO Mike Koehler; longtime board member Victor Lund stepped into the breach. In September, at the Teradata Partners conference, Lund announced that Teradata would reposition itself as an “analytics solutions” firm.

That may not sit well with SAS, Teradata’s primary partner for advanced analytics software, which also views itself as an “analytic solutions” firm. The difference, of course, is that SAS has been delivering solutions for a long time and has street cred with executives because it actually has sophisticated business solutions, with actual software and intellectual property, while Teradata appears to have little more than big ideas and PowerPoint.

Pro tip for Teradata management: just because you want to move up the value chain does not mean that you have the ability to do so.

In other developments, the company announced that Aster finally supports Spark, two years after anyone might have cared. Teradata also announced that Aster’s analytics are now available for deployment in Hadoop. Aster on Hadoop is a bladeless knife without a handle — a commercial machine learning library that competes with umpteen open source libraries. Aster also competes with another Teradata partner, Fuzzy Logix, whose dbLytix library is six times richer and more mature.

If someone proposes to bet that “solutions” and unbundled Aster will reverse Teradata’s decline, take the under.

Other Tech Giants

We mention two remaining giants, Dell and HPE, only to note their passing from the scene.

HPE

HPE announced the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also granted equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets. The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for Haven or Vertica, this may be a good time to dust off your resume.

In March, HPE announced Haven OnDemand, available on Microsoft Azure. Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, initially branded as HAVEn and announced by HP in June 2013.  In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So the March announcement is a re-re-release of the software.

Three years into its product life cycle, Haven hasn’t exactly caught on with data scientists. Just 2 out of 2,895 respondents to the KDnuggets 2016 Data Science Software Usage poll and none in the O’Reilly 2016 Data Science Salary Survey said they use the software. Adding insult to injury, Haven failed to make KDnuggets’ list of the top 50 machine learning APIs, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

Vertica still has some traction with data lovers whose analysis needs are simple enough to satisfy with SQL. Currently, it’s the 28th most popular relational database, according to DB-Engines, which is about on par with Netezza and Greenplum and a lot better than Aster. Expect this ranking to drop like a stone in the hands of Micro Focus.

Dell/EMC

Dell entered the advanced analytics business by acquiring Statsoft in 2014, a move that impressed nobody. In 2016, Dell exited by selling its software division to private equity investors.

Goodbye, Dell. We hardly knew ye.

Disruption: It’s All About the Business Model

This post is an excerpt adapted from my book, Disruptive Analytics, available soon from Apress and Amazon. (Note: under my contract with Apress I am legally obligated to link to their site, but it’s not yet possible to order the book there. Use the Amazon link if you want the book.)

The analytics business is booming. Technology consultant IDC estimates total spending for analytic services, software and hardware exceeded $120 billion in 2015; through 2019, IDC forecasts that spending will increase to $187 billion, an 11% compound annual growth rate.

Powerful forces are at work in the economy today:

  • Digital transformation of the economy and rapidly declining storage costs combine to create a flood of data.
  • The number of data sources is exploding. Data sources are everywhere: on-premises, in the cloud, in consumers’ pockets, in vehicles, in RFID chips, and so forth.
  • The “long march” of Moore’s Law: cheap computing power makes machine learning and deep learning techniques practical.

So, if analytics is such a hot field, why are the industry leaders struggling?

  • Oracle’s cloud revenue growth fails to offset declining software and hardware sales.
  • SAP’s cloud revenue grows, but total software revenue is flat.
  • IBM reports seventeen straight quarters of declining revenue. Mass layoffs
  • Microsoft underperforms analysts’ expectations despite 120% growth in Azure cloud revenue.
  • Predictive analytics leader SAS reports five years of low single-digit revenue growth; Executive Vice President and Chief Marketing Officer departs.
  • Data warehousing leader Teradata shuffles its leadership team after four years of declining product revenue.

Product quality is not the problem. Each company offers products that industry analysts rate highly:

  • Forrester and Gartner recognize IBM, SAS, SAP and Oracle as leaders in data quality tools.
  • Gartner rates Oracle, SAP, IBM, Microsoft and Teradata as leaders in data warehousing.
  • Forrester rates Microsoft, SAP, SAS, and Oracle as leaders in agile business intelligence.
  • Gartner recognizes SAS and IBM as leaders in Advanced Analytics.

The answer, in a word, is disruption. Clayton Christensen of the Harvard Business School outlined the theory of disruptive innovation in 1997. Summarizing the argument briefly:

  • Industries consist of value networks, collections of suppliers, channels, and buyers linked by relationships.
  • Innovations disrupt industries when they create a new value network.
  • Not all innovations are disruptive. Many are introduced by market leaders to sustain a competitive position.
  • Disruptive innovations tend to be introduced by outsiders.
  • Purely technological innovation is not disruptive; what matters is the business model enabled by the new technology.

For a more detailed exposition of the theory, read Christensen’s book.

Christensen identified two forms of disruption. Low-end disruption occurs when industry leaders enhance products faster than customers can assimilate the enhancements; the disruptor enters the market with a “good enough” product and a better value proposition. The disruptor’s innovation makes it possible to serve customers at a lower cost than the industry leaders can deliver.

New market disruption takes place when the disruptor innovates in ways enabling it to serve customers that are not served by the industry leaders.

Technology alone does not disrupt industries; incumbents can and do innovate. New business models enabled by new technology are the cutting edge of disruption. Frequently, incumbents cannot respond effectively to new business models; this is partly due to “blinders” caused by changing value networks, and partly out of fear of cannibalizing existing business arrangements. Two business models, in particular, are disrupting the business analytics world today:

  • Open source software business models offer an increasingly attractive alternative to commercial software licensing. The Hadoop ecosystem displaces conventional data warehousing; R and Python displace commercial software for advanced analytics.
  • The elastic business model made possible by cloud computing undercuts conventional software licensing. When customers pay only for what they use, they pay a lot less.

Disruption does not mean that leading companies like Oracle, IBM and SAS will go out of business. Blockbuster may be the poster child for disrupted businesses, but most cases are less dire; for the business analytics leaders, disruption means they will struggle to grow. Slow growth is less benign than it sounds. As McKinsey notes, the rule today is “Grow or Go”: companies that cannot define a credible growth strategy will be acquired by other companies or by private equity.

The alternative to revenue growth is increasing profitability. But when revenue is flat or declining, that usually means job cuts.

job-cuts
Disruption looks like this.

Consider what happened to Teradata. Late in 2012, the company started missing sales targets; in early 2013, it stunned investors by reporting an absolute decline in sales. Management offered excuses; Wall Street punished the stock, driving it down by half in the face of a bull market for tech stocks.

Teradata’s leadership continued to miss sales and earnings targets; Wall Street drove the stock price down to a fraction of its 2012 peak. While it is tempting to blame the problem on poor leadership, Teradata’s persistent failure to accurately forecast its sales and earnings is a clear sign that its leadership no longer understood the value networks in which they operated. The world had changed; the value networks created in Teradata’s rise to leadership no longer existed; the mental models managers used to understand the market no longer worked.

There are two distinct types of disruption. The first is disruptive innovation within the analytics value chain. Here are two recent examples:

Hadoop. The Hadoop ecosystem disrupts the data warehousing industry from below. Hadoop does not do everything a relational database can do, but it does just enough to offer an attractive value proposition for the right use cases. When first introduced, Hadoop’s capabilities were very limited compared to data warehouse appliances. But Hadoop’s flexibility and low cost were highly attractive for applications that did not need the performance and features of a data warehouse appliance. While established vendors struggle to maintain flat and declining revenue, companies that offer solutions built on Hadoop grow at double-digit rates.

Tableau. Tableau virtually created the market for agile, self-service discovery. The charting and visualization features in Tableau are available in mainstream business intelligence tools. But while business intelligence vendors target the IT organization and continually add complexity to their product, Tableau targets the end user with a simple, easy to use and versatile tool. As a result, Tableau has increased its revenue tenfold in five years, leapfrogging over many other BI vendors.

Disruption within the analytics value chain is pertinent for readers who plan to invest in analytics technology for their organization. Technologies at risk of disruption are risky investments; they may have abbreviated useful lives, and their suppliers may suffer from business disruption. Taking a “wait-and-see” attitude towards disrupted technologies makes good sense, if only because prices will likely decline in the future.

The second type is disruption by innovations in analytics. Examples of disruption by analytics are harder to find, but they do exist:

Credit Scoring. General-purpose credit scoring introduced by Fair, Isaac and Co. in 1987 virtually created a national market in credit cards.  Previously, banks issued credit cards to their local customers, with whom they had an established relationship. Uniform credit scoring enabled a few large issuers to identify creditworthy clients in the general population, without a prior relationship.

Algorithmic Trading. When the U.S. Securities and Exchange Commission authorized electronic trading in regulated securities in 1998, market participants quickly moved to develop algorithms that could arbitrage between markets, arbitrage between indexes and the underlying stocks and exploit other short-term opportunities. Traders that most effectively deployed machine learning for electronic trading grew at the expense of other traders.

For startups and analytics practitioners, disruption by analytics is essential. Startups must disrupt their industries if they want to succeed. Using analytics to differentiate a product is a way to create a disruptive business model or to create new markets.

There is a common theme across the four examples: the business model enabled by the technology and not the technology itself drives the disruption. Hadoop and Tableau do less than the legacy products they compete against; what they do, however, is sufficient for a class of use cases, for which they provide a better value proposition. Credit scoring and algorithmic trading created fundamentally new ways to lend and invest; while these applications attracted technological innovations as they expanded, it was the new business models they created that disrupted the lending and investing industries.

To illustrate the importance of the business model, consider the case of columnar serialization, a significant innovation in data warehousing that did not disrupt the industry. In 2005, Vertica introduced a commercial columnar database, a technology that is well-suited to high-performance analytics (as we explain in Chapter Two of Disruptive Analytics). Vertica successfully built a customer base, but did not create a unique business model; by 2010 the leading data warehouse vendors had introduced columnar serialization into their products. HP acquired Vertica in 2011 for about $250 million, a price well below the $1.7 billion IBM paid for Netezza, a competing data warehouse appliance vendor.

Here are some takeaways for the reader to consider.

First, if you want to invest in new business analytics technology, ask yourself:

  • Are we paying for what we use, or for what we might use?
  • What particular value do commercial software options offer over open source alternatives?

Second, if you want to use analytics to create a disruptive innovation, ask yourself:

  • What new business model does this support?
  • Can we disrupt incumbents from below with a better value proposition?
  • Can we reach new markets and new customers who are underserved by existing value networks?

There is one additional takeaway: nobody ever disrupted anything by managing data. Keep that in mind the next time a data warehousing vendor tries to tell you that their Big Box is a “strategic” investment. We’ll explore that in another excerpt from the book.

Big Analytics Roundup (July 25, 2016)

We have some more summer reading this week; plus, Splice Machine announces availability of its open source Community Edition, and Google launches two new machine learning APIs. There are so many Spark stories I’ve created a special section for them. Plus we have the usual explainers, perspectives, and news.

Quant headhunter Linda Burtch repeats her survey of working analysts in her network. Preference for using SAS has steadily declined over the three years she has conducted the poll; this year a clear majority chose R or Python over SAS. Preference for open source correlates with education; the more you know, the less likely you are to use SAS.

Oracle, IBM, SAP, and Microsoft have all reported Q2 revenue and earnings, but Teradata is still crunching the numbers. I’ll do a general earnings roundup when TDC gets around to reporting its numbers. TDC’s stock price has outperformed the others since June 30, which suggests the market expects a good second quarter. Meanwhile, TDC acquires another consultancy and reveals who bought Aprimo.

Summer Reading

Adrian Colyer lists his five favorite papers from the past several months and outlines his philosophy, which you must read. And here is another link to last week’s top paper on data bazaars versus data cathedrals.

Splice Machine Shifts to Open Core

Hadoop-based RDBMS vendor Splice Machine announces general availability for its open source community edition and offers a sandbox hosted on AWS.  Sam Dean approves; Andrew Brust reports; Dave Ramel explains. Jack Germain describes Splice Machine’s changing business model.

Spark Stories

— Databricks’ Spark survey is still accepting responses. Go and fill it out if you have not done so already.

— The Spark PMC has voted favorably on a release candidate for Spark 2.0, which is now in packaging for general availability.

— On the Databricks blog, Jules Damji corrals Spark news from the past two weeks.

— Alex Woodie touts LevyxSpark, an enhanced Spark distribution based on open source Apache Spark. LevyxSpark includes some open source enhancements, plus Levyx Helium, an SSD-based key-value store.

— In a webcast, Alexander Ulanov summarizes options for deep learning on Spark.

— Sam Weaver explains how to use the new MongoDB connector for Spark.

Explainers

— Nita Dembla and Gopal Vijayaraghavan explain improvements in Hive 2.1.

— Siddharth Anand introduces Apache Airflow (Incubating), a platform to author, schedule, and monitor DAGs. Sounds like Apache Beam.

— Data Artisans’ Stephan Ewan explains savepoints in Apache Flink.

Perspectives

— Jack Clark profiles Google’s land grab in deep learning. Short version: TensorFlow is blowing away Caffe, Torch, Theano, dl4j, CNTK, and DSSTNE.

— Greg Satell theorizes about Google’s open source strategy as if a “razor and blades” strategy is something new and brilliant.

— In Fortune, Barb Darrow profiles cloud computing’s disruptive impact.

— Sam Dean confuses machine learning with artificial intelligence.

— Syncsort’s Paige Roberts interviews Dr. Ellen Friedman.

— Drew Breunig poses a theory about the business implications of machine learning.

— BuzzFeed’s Adam Kelleher attempts to explain bias, fails.

— IBM exec Rob Thomas co-authors a blog about machine learning. It’s about what you would expect from an IBM exec.

Open Source News

— Open source columnar storage engine Apache Kudu graduates to top-level status.

— Apache Chukwa announces Release 0.8, with security bug fixes, FWIW. Chukwa captures logs from distributed systems for monitoring and analysis. No, I never heard of it either.

Commercial Announcements

— Google announces open beta for its Cloud Natural Language and Cloud Speech APIs.

Hardware News

— Inspur, which claims to be China’s largest server manufacturer, announces availability of the Memory1 line of servers for big analytics. Inspur uses high-capacity flash DIMMs and memory expansion software to deliver up to 2TB of memory per server and up to 80TB per rack.

— Startup Wave Computing announces plans for a family of deep learning computers. Good luck to them. The history of computing isn’t kind to special purpose machines, which tend to eventually get buried by general purpose machines.

Funding News

— Redis Labs lands a $14 million “C” round led by Bain Capital and Carmel Ventures. Redis claims 6,200 enterprise customers and 55,000 accounts for its cloud service.

— Sift Security emerges from stealth, announces $3.25 million in angel funding. Sift uses graph analytics running on Spark and TitanDB to identify linked threats and incidents.

Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

Big Analytics Roundup (September 7, 2015)

Top news for this week: SAP and Syncsort release Spark connectors, and HP says it wants to develop one; Pivotal abandons Hawq; Flink, Spark and H2O publish agendas for the upcoming events.

Postdoc Opportunity

Peter Rose, Site Head of the RCSB Protein Data Bank West at SDSC just landed a $1.4 million grant from the National Institutes of Health under its Big Data to Knowledge (BD2K) initiative.   Peter writes to say that the team plans to use Apache Spark, and seeks postdocs to join the group.   See the flyer here: Postdoc Big Data UCSD flyer

SQL on Hadoop

Pivotal gives up trying to sell Hawq, donates it to Apache where it is now an incubator project.  Hawq is a SQL tool that federates queries across Hadoop and Greenplum database, introduced with much hoopla two years ago.

Apache Drill

On the MapR blog, Carol McDonald offers the ultimate guide to Drill’s architecture.  (h/t Hadoop Weekly)

Apache Flink/ Data Artisans

announces Flink 0.9.1, a maintenance release.

Flink Forward’s organizers publish the conference program, which will be held October 12-13 at Berlin’s KulturBrauerei in the Prenzler Berg district.  Conference organizers have arranged discounted rooms at the nearby Hotel4Youth, where presumably old folks like me are unwelcome.

On Slideplayer, Data Artisans co-founder Stephan Ewan presents an overview of Flink.

On the Data Artisans blog, Robert Metzger and Kostas Tzoumas present a practical guide to integrating Flink and Kafka.  Go here for video.

Dataconomy interviews Romeo Kienzler of IBM, who plans to present at the Flink Forward conference in October.

Apache Spark/ Databricks

Databricks publishes agenda for the Spark Summit Europe.  I’ve attended every Spark Summit to date, but will skip this one.

HP announces its future “commitment” to integrate Vertica with Spark, featuring accelerated data transfer and a model scoring capability.  Translation:  HP wants some Spark buzz to counter IBM, but they don’t have a working release or even a definite plan.

MapR’s Carol McDonald publishes a guide to integrating Spark Streaming with HBase.  (h/t Hadoop Weekly)

SAP announces SAP HANA Vora, an in-memory query engine that runs on Spark and supports in-memory queries, OLAP and drill-down analysis.  SAP’s PR engine kicks into overdrive.  Timothy Prickett Morgan touts it; additional coverage here, here, and here.

Syncsort contributes an IBM z-system mainframe connector to Spark Packages.   Analysis ensues: here, here, here, and here.  Some analysts confuse this announcement with IBM’s plan to support Spark on its mainframes.  The two things are completely different: the point of the connector is to extract data from the mainframe and move it to Spark for processing.

Bernard Marr asks whether the rise of Spark spells the end of Hadoop, explains why the answer is no.

H2O/ H2O.ai

publishes schedule and speaker lineup for H2O World 2015 (November 9-11).   Featured speakers include Hilary Mason, Monica Rogati, Stephen Boyd and Rob Tibshirani.

R

Hadley Wickham announces dplyr 0.4.3, with 30 minor improvements and bug fixes.  Release notes here.

Revolution R/ Microsoft

…announces Revolution R Open 3.2.2, an enhanced R distribution.

Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

Automated Predictive Modeling

A colleague asks: can we automate predictive modeling?

How we answer the question depends on the context.   Consider the two variations on the question below, with more precise wording:

  1. Can we completely eliminate the need for expertise in predictive modeling — so that an “ordinary business user” can do it?
  2. Can we make expert analysts more productive by automating certain repetitive tasks?

The first form of the question — the search for “business user” analytics — is a common vision among software marketing folk and industry analysts; it is based on the premise that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics.   That premise is largely false, for reasons that warrant a separate blog post; for now, let’s just stipulate that the answer is no, it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.

However, if we focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising.  Many data preparation tasks are easy to automate; these include such tasks as detecting and eliminating zero-variance columns, treating missing values and handling outliers.  The most promising area for automation, however, is in model testing and assessment.

Optimizing a predictive model requires experimentation and tuning.  For any given problem, there are many available modeling techniques, and for each technique there are many ways to specify and parameterize a model.  For the most part, trial and error is the only way identify the best model for a given problem and data set. (The No Free Lunch theorem formalizes this concept).

Since the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model.  In applied predictive analytics, however, the analyst’s time is strictly limited; a client in the marketing services industry reports an SLA of thirty minutes or less to build a predictive model.  Strict time constraints do not permit much time for experimentation.

Analysts tend to deal with this problem by settling for sub-optimal models, arguing that models need only be “good enough,” or defending use of one technique above all others.  As clients grow more sophisticated, however, these tactics become ineffective.  In high-stakes hard-money analytics — such as trading algorithms, catastrophic risk analysis and fraud detection — small improvements in model accuracy have a bottom line impact, and clients demand the best possible predictions.

Automated modeling techniques are not new.  Before Unica launched its successful suite of marketing automation software, the company’s primary business was advanced analytics, with a particular focus on neural networks.  In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model.   Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models.  Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.

Two other commercial attempts at automated predictive modeling date from the late 1990s.  The first, MarketSwitch, was less than successful.  MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers”.  Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine and replaced the “automated modeling” capability with outsourced analytic services.

KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization.   The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for Marketing analytics, which it attempted to sell directly to C-level executives.  This effort was modestly successful, leading to sale of the company in 2013 to SAP for an estimated $40 million.

In the last several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products.  In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner.  Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection and model selection.  The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques.  The user can control the scope of techniques to test by selecting basic, intermediate or advanced methods.

IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster and Auto Numeric nodes.  The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning and variable recasting.   The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules and set limits on model training.

All of the software products discussed so far are commercially licensed.  There are two open source projects worth noting: the caret package in open source R and the MLBase project.  The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques.   The package includes pre-processing tools to support tasks such as dummy coding, detecting zero variance predictors, identifying correlated predictors as well as tools to support model training and tuning.  The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique, but does not optimize across techniques.  To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.

MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group is an ambitious effort to develop a scalable machine learning platform on Apache Spark.  The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors.   The Optimizer project is still in active development, with Alpha release expected in 2014.

What have we learned from various attempts to implement automated predictive modeling?  Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether.  Most organizations understand that human judgement plays a key role in analytics, and they aren’t willing to entrust hard money analytics entirely to a black box.

What will the next generation of automated modeling platforms look like?  There are seven key features that are critical for an automated modeling platform:

  • Automated model-dependent data transformations
  • Optimization across and within techniques
  • Intelligent heuristics to limit the scope of the search
  • Iterative bootstrapping to expedite search
  • Massively parallel design
  • Platform agnostic design
  • Custom algorithms

Some methods require data to be transformed in certain specific ways; neural nets, for example, typically work with standardized predictors, while Naive Bayes and CHAID require all predictors to be categorical.  The analyst should not have to perform these operations manually; instead, the transformation operations should be built into the test plan script and run automatically; this ensures the maximum number of possible techniques for any data set.

To find the best predictive model, we need to be able to search across techniques and to tune parameters within techniques.  Potentially, this can mean a massive number of model train-and-test cycles to run; we can use heuristics to limit the scope of techniques to be evaluated based on characteristics of the response measure and the predictors.   (For example, a categorical response measure rules out a number of techniques, and a continuous response measure rules out a different set of techniques).  Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.

Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing.  Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own).  The next generation of automated predictive modeling will be in the cloud from its inception.

Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions.  Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter we should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.

We can’t eliminate the need for human expertise from predictive modeling.   But we can build tools that enable analysts to build better models.

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS and HDP

(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.