Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in Machine Learning (Part Three)

This is the third installment in a four-part review of 2016 in machine learning and deep learning. In Part One, I covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms. In Part Two, I surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

In this installment, we will review the machine learning and deep learning initiatives of Big Tech Brands — industry leaders with big budgets for software development and marketing. Big Tech Brands fall into three groups:

— SAS is the software revenue leader in predictive analytics. It has a unique business model and falls into its own category.

— Companies such as IBM, Microsoft, Oracle, SAP, and Teradata have all have strong franchises in the data warehousing market, and all except Teradata offer widely used business intelligence software. These companies have the financial strength to develop, market and cross-sell machine learning software to their existing customer base, and can impact the market if they choose to do so.

Dell and HPE dabbled in advanced analytics and exited the market in 2016.

I covered Google and Amazon Web Services in Part One. Although neither company has a strong position in business analytics at present, they are making moves in that direction. Google set up Google Cloud Machine Learning as a distinct product group this year to service that market, and Amazon introduced QuickSight, a business analytics service.

Regular readers know that I favor open source software — as do most data scientists. Among the companies covered in this installment, IBM and Microsoft are making substantial commitments to the open source model, including direct contributions to open source software projects. They deserve kudos for that. Teradata is investing in Presto SQL, for which they get polite applause. Oracle and SAP leverage open source software in their solutions but make no significant contributions. SAS embraces open source the way a cat embraces a porcupine.

In Part Four, I will survey machine learning startups, and deliver results from the Bottom Story of the Year poll.

SAS

SAS leads the market in licensing revenue for advanced and predictive analytics software, according to IDC. The company has a loyal following among statisticians, actuaries, life scientists and others whose work depends on statistical analysis.

Partnering with IBM, SAS built its business in the 1970s on the strength of its software for the IBM System/360 mainframe. IBM promoted the software to its enterprise customers to increase adoption and use of its hardware. SAS software still runs on the mainframe, and the company continues to earn a significant share of its revenue on that platform. IBM has mainframe customers who use the big box exclusively for SAS.

In the 1990s, SAS successfully transitioned to a multi-vendor architecture and rebuilt its software to run on many different hardware platforms and operating systems. During this period, SAS established a reputation for industrial-strength and enterprise-grade software — in contrast to vendors like SPSS, who focused on building easy-to-use software for the desktop.

On the face of it, SAS has struggled to transition from server-based computing to the contemporary world of distributed architecture and cloud platforms. In the past ten years, the company has announced multiple initiatives to improve the performance and scalability of its products, with mixed success. In April, SAS announced Viya, its third attempt to deliver advanced analytics in a distributed MPP architecture.

What is SAS Viya? How does it differ from SAS’ previous attempts at high-performance design? Let’s peruse the brochure:

Cloud-ready, elastic and scalable

 

SAS Viya is built to be elastic and scalable for both private and public clouds. Analytical, in-memory computations are optimized for unconstrained environments, but they can also adjust for constrained environments. The elastic processing automatically adapts to needs and available resources – spinning up or winding down computing capacity as needed. Elastic scalability lets you quickly experiment with different scenarios and apply more complex approaches to larger amounts of streaming data.

Ahem. Any software is “cloud-ready,” in the sense that a Linux instance is a Linux instance whether it runs on-premises or in the cloud. And any software is elastic when you deploy it in a virtual appliance, such as an Amazon Machine Image. That includes SAS 9.4, which SAS touted as “cloud-ready” in 2014, and previous versions of SAS, which you could deploy in AWS even though SAS did not formally support the platform.

If you want to spin up software instances, however, you need software licenses. With open source software, such as Python, R, or Spark, that’s not an issue — you can spin up as many instances as you like without violating license agreements. Commercial software is more complicated since you need to pay for the licenses you want to spin up. Some vendors, like HPE and Teradata, tried to address this problem by marketing their own cloud platforms to compete with Amazon Web Services; they failed miserably. Others, like Oracle, partner with AWS to deliver their software in the cloud — either as a bundled managed service or on a “Bring Your Own License” (BYOL) model.

You can’t have elastic computing with commercial software without a flexible licensing model. Pay-for-what-you-use licensing poses a problem for vendors like SAS, because if customers only pay for what they use, they invariably pay a lot less than they do under term licensing. Most commercial software customers are over-licensed — they’re paying for a lot of software they don’t use. That is why revenue from on-premises software licensing is declining much faster than revenue from cloud-based subscriptions is rising. In the cloud, you can do more with less.

The bottom line is this: unless Viya is available under an elastic pricing model, nobody cares that it is “cloud-ready, elastic and scalable.”

If you want to have a little fun, the next time your SAS rep touts Viya’s elasticity, ask him what it will cost per hour to license the software. Watch him squirm.

Open analytics coding environment

 

Empower your data scientists with SAS Analytics that are easily available from a variety of programming languages. Whether it’s a Python notebook, Java client, Lua scripting interface or SAS, your modelers and data scientists can easily access the power of SAS for data manipulation, advanced analytics and analytical reporting.

We’ve all been waiting for the ability to run SAS from Lua.

Resilient architecture with guaranteed failover

 

For answers you depend on, you need analytical processing power you can count on. You need all your analytical computations to finish processing without interruption. The fault-tolerant design of SAS Viya automatically detects server failure, even in multiplatform processing environments, and redistributes processing as needed. It also manages several copies of data on the processing cluster. If a machine in the cluster becomes unavailable or fails, the required data is retrieved from another block to quickly continue processing. These self-healing mechanisms ensure high availability for uninterrupted processing and automated recovery.

“It runs on Hadoop.”

Interviewed in Forbes, SAS CEO Jim Goodnight speaks at length about Viya:

We are ready for big data…(we) just released our first version of our new Viya architecture, which is massively parallel computing where we spread the data out over dozens of servers and then use all the cores inside those servers to process the data in parallel. So we might have 500 cores working on the data all at once in parallel, and that allows it to handle some really, really big problems that we’ve never even thought of before. Things like logistic regression.

Someone should feed Dr. G. better talking points. Just for the record, commercially available software for logistic regression running in a massively parallel (MPP) environment first hit the market in 1989. Distributed logistic regression is currently available in multiple software packages, including one introduced by SAS five years ago.

Logistic regression (a non-linear model) is an iterative process. Essentially, you’re trying to estimate the parameters in the model, and so you take a guess, you’ve got to run through the data using that guess, then to refine it and do another guess and run through the data again, and you keep doing this over and over and over until the parameters converged or they don’t change much at all anymore. That can take 25 to 30 passes of the data. Now, in the old days, we used to have to read the data that many times. Now, it’s in memory. We put it in memory and it stays in memory. It’s spread out over 500 cores and then each one just does a little piece of the work, and so we can do those 25 iterations in just a few minutes, whereas it used to take hours.

It’s just like Spark, but with a license key.

(Viya’s) really our third generation of massively parallel computing. We’ve been working on this problem for seven years, and this is our third major crack at doing it, and this time we’ve got everything figured out.

In 2018 he’ll be talking about a fourth crack in nine years.

It’s possible that Viya works better than SAS’ previous cracks at high-performance analytics. That is a weak hurdle, however; SAS needs to demonstrate that its high-cost proprietary distributed framework is better than Apache Spark, which is rapidly emerging as the standard enterprise platform for Big Data.

While SAS supports machine learning techniques in several different products, it lags in deep learning. The SAS Marketing team created some helpful content about deep learning, but look carefully at that page — you won’t find an actual product for deep learning. Yes, I know that SAS Enterprise Miner supports multilayer perceptrons; but SAS does not support GPUs, Xeon Phi, Intel Nervana or any other high-performance architecture that will make it possible for you to train a deep neural net while you’re young.

If you think that an eighteen-year-old product running on one server is sufficient for your deep learning project, you should definitely talk to SAS. Keep in mind, though, that there is a reason that NVIDIA’s DGX-1 GPU-accelerated deep learning box has the power of 250 conventional servers: you actually need that kind of horsepower.

The rest of SAS’ business seems to be chugging along well enough. A combination of renewals, upgrades and upsells in existing accounts should produce low single-digit revenue growth for 2016, which is not a bad track record when you consider the declines reported by IBM, Oracle, and Teradata.

Business Analytics Leaders

The five companies in this group sell at least a billion dollars a year in business analytics software, according to IDC’s most recent worldwide software market share report. However, most of their revenue comes from data warehousing and business intelligence software; they all trail SAS in predictive analytics revenue.

Software licensing revenue is a misleading measure, however, due to the growing presence of open source software. IBM, Microsoft, and Oracle for example, actively use open source machine learning software to extend the reach of their data warehousing and business intelligence platforms, where they both have strong entries. IBM uses Spark as a foundation for many of its products; Microsoft has integrated R with SQL Server and PowerBI, and actively promotes the use of R for its enterprise customers. Oracle has taken a similar approach.

IBM

Unlike SAS, declining tech giant IBM never invested in a proprietary distributed framework for SPSS, its flagship software for advanced analytics. Instead, the company chose to leverage in-database engines (DB2, Netezza, and Oracle) and open source frameworks (MapReduce and Spark.)

IBM contributes to Apache Spark, which it uses in several products, and also to Apache SystemML. IBM Research developed the core of SystemML, which IBM donated to Apache in 2015. IBM has also visibly contributed to the Spark community through its efforts in education and training.

In 2016, IBM continued to market SPSS Statistics and SPSS Modeler, software brands it acquired in 2007. Release 18 of SPSS Modeler, announced in March, includes such things as support for machine learning in DB2 and support for IBM’s General Parallel File System (GPFS) in BigInsights. There aren’t too many data scientists who care about such things, but they appeal to the 150 or so enterprises with CIOs who still believe that nobody ever got fired for buying IBM.

In Part One of this review, I covered IBM’s machine learning moves in IBM Cloud, which I would characterize as Shakespearean, as in Much Ado About Nothing.

Microsoft

Microsoft had quite a year in machine learning and deep learning. As I noted in Parts One and Two, in 2016 MSFT launched cognitive APIs in Azure for vision, speech, language, knowledge, and search; a managed service for Spark in Azure HDInsight; enhancements to Azure Machine Learning and Version 2.0 of its deep learning framework, rebranded as Microsoft Cognitive Toolkit.

That’s just for starters.

In January, Microsoft announced Microsoft R Server, a rebranding of the product it acquired with Revolution Analytics in 2015. Microsoft R Server includes an enhanced R distribution, a scalable back-end, and integration tools. During the year, Microsoft two major releases for R Server. In Release 8, the company added push-down integration with Spark. Release 9 updated the Spark integration for Spark 2.0, and added MicrosoftML, a new R package for machine learning.

Microsoft announced SQL Server 2016 in March with embedded SQL Server R Services. On the Revolutions blog, David Smith reports on the launch. Tomaž Kaštrun explains what you can do with R services in SQL Server.

In November, after an extended preview, Microsoft announced the general availability of R Server for Azure HDInsight, a scale-out implementation of R integrated with Spark clusters created from HDInsight.

Also in Azure, Microsoft added a Linux version of the Data Science Virtual Machine (DSVM). Previously available as a Windows instance, DSVM includes Revolution R Open, Anaconda, Visual Studio Community Edition, PowerBI Desktop, SQL Server Express and the Azure SDK.

PowerBI, Microsoft’s powerful visualization tool, added R support in August. In ComputerWorld, Sharon Machlis, an R user, enthused. More here, on the Revolutions blog.

R Tools for Visual Studio launched to public preview in March, and to general availability in September. Also in September, Microsoft released the Microsoft R Client, a free data science tool that works with Microsoft R Open and the ScaleR distributed back end.

Microsoft data scientists Gopi Krishna Kumar, Hang Zhang and Jacob Spoelstra developed a methodology for data science, which they presented at the Microsoft Machine Learning and Data Science Summit 2016 in September. David Smith reports. The method, which the authors call Team Data Science Process, includes a standard directory structure for managing project artifacts using a system such as Git. It also includes open source utilities to support the process.

Other than that, it was a quiet year in Redmond.

Oracle

Oracle has a surprisingly robust set of machine learning tools that appeal to Oracle-centric organizations. They include:

Oracle Data Mining (ODM), a suite of machine learning algorithms that run as native SQL functions in Oracle Database.

Oracle Data Miner, a client application for ODM with a business user interface.

Oracle R Distribution (ORD), an enhanced free R distribution.

Oracle R Enterprise (ORE), Oracle R Distribution packaged with tools to integrate R with Oracle Database.

Oracle R Advanced Analytics for Hadoop (ORAAH), a set of R bindings with native algorithms and an interface to Spark.

Oracle claims that ORAAH’s native algorithms are faster than Spark, but ORAAH has only two algorithms, so nobody cares. Oracle OEMs Cloudera, so the Spark release is at least one major release behind the rest of the world.

Other than some dot releases for the components cited above, I don’t see a lot of movement for Oracle in 2016.

SAP

SAP introduced an update to its predictive analytics capabilities, now branded as SAP Business Objects Predictive Analytics 3.0. This product includes two separate automation capabilities, one branded as Predictive Factory, the second as HANA Automated Predictive Library. Predictive Factory, like SAS Factory Miner, is a scripting tool that enables a data scientist to create a modeling pipeline and schedules it for execution; it does not automate the data science process itself.  HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts.

HANA Automated Predictive Library is a set of functional calls that users can include in SQL scripts. It’s a product that might appeal to SAP HANA bigots and nobody else.

SAP acquired KXEN and its InfiniteInsight software in 2014. Customer satisfaction promptly dropped through the floor, and SAP trails all other advanced analytics vendors rated in a Gartner survey. Legacy InfiniteInsight customers fall into two camps: (a) those whose IT organizations are heavily invested in SAP, and (b) everyone else. The former seem to be sticking with the software as SAP integrates it into its product line; the latter are heading for the exits.

Teradata

Declining data warehouse vendor Teradata thinks of itself as an analytics powerhouse. In reality, most of its revenue comes from data warehousing, where the company gets high marks from analysts like Gartner.

You could say that Teradata has a commanding position at the bottom of the analytics stack.

Teradata’s executive leadership — if you can call it that — completely missed the implications of Hadoop and cloud computing. Instead, they bet that the Teradata brand was beloved by IT executives, who would keep on buying boxes in bulk. As a result of that blinkered view of the world, the company today is worth a third of what it was worth five years ago. Its product sales have declined for ten straight quarters, seven in a row at double digits.

After a dismal first quarter, Teradata’s board fired accepted the resignation of CEO Mike Koehler; longtime board member Victor Lund stepped into the breach. In September, at the Teradata Partners conference, Lund announced that Teradata would reposition itself as an “analytics solutions” firm.

That may not sit well with SAS, Teradata’s primary partner for advanced analytics software, which also views itself as an “analytic solutions” firm. The difference, of course, is that SAS has been delivering solutions for a long time and has street cred with executives because it actually has sophisticated business solutions, with actual software and intellectual property, while Teradata appears to have little more than big ideas and PowerPoint.

Pro tip for Teradata management: just because you want to move up the value chain does not mean that you have the ability to do so.

In other developments, the company announced that Aster finally supports Spark, two years after anyone might have cared. Teradata also announced that Aster’s analytics are now available for deployment in Hadoop. Aster on Hadoop is a bladeless knife without a handle — a commercial machine learning library that competes with umpteen open source libraries. Aster also competes with another Teradata partner, Fuzzy Logix, whose dbLytix library is six times richer and more mature.

If someone proposes to bet that “solutions” and unbundled Aster will reverse Teradata’s decline, take the under.

Other Tech Giants

We mention two remaining giants, Dell and HPE, only to note their passing from the scene.

HPE

HPE announced the sale of its software assets (including Vertica and Haven) to U.K.-based Micro Focus for $2.5 billion in cash. Under terms of the deal, Micro Focus also granted equity with a soft valuation of $6.3 billion directly to HPE shareholders. HPE paid almost $20 billion over ten years for these assets. The valuation works out to about 2.4 times revenue, which means that both parties agree the business has little or no growth potential. Micro Focus has a reputation for firing people cutting costs, so if you’re working for Haven or Vertica, this may be a good time to dust off your resume.

In March, HPE announced Haven OnDemand, available on Microsoft Azure. Haven is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, initially branded as HAVEn and announced by HP in June 2013.  In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform. So the March announcement is a re-re-release of the software.

Three years into its product life cycle, Haven hasn’t exactly caught on with data scientists. Just 2 out of 2,895 respondents to the KDnuggets 2016 Data Science Software Usage poll and none in the O’Reilly 2016 Data Science Salary Survey said they use the software. Adding insult to injury, Haven failed to make KDnuggets’ list of the top 50 machine learning APIs, a list that includes the likes of Ersatz, Hutoma, and Skyttle.

Vertica still has some traction with data lovers whose analysis needs are simple enough to satisfy with SQL. Currently, it’s the 28th most popular relational database, according to DB-Engines, which is about on par with Netezza and Greenplum and a lot better than Aster. Expect this ranking to drop like a stone in the hands of Micro Focus.

Dell/EMC

Dell entered the advanced analytics business by acquiring Statsoft in 2014, a move that impressed nobody. In 2016, Dell exited by selling its software division to private equity investors.

Goodbye, Dell. We hardly knew ye.

Disruption: It’s All About the Business Model

This post is an excerpt adapted from my book, Disruptive Analytics, available soon from Apress and Amazon. (Note: under my contract with Apress I am legally obligated to link to their site, but it’s not yet possible to order the book there. Use the Amazon link if you want the book.)

The analytics business is booming. Technology consultant IDC estimates total spending for analytic services, software and hardware exceeded $120 billion in 2015; through 2019, IDC forecasts that spending will increase to $187 billion, an 11% compound annual growth rate.

Powerful forces are at work in the economy today:

  • Digital transformation of the economy and rapidly declining storage costs combine to create a flood of data.
  • The number of data sources is exploding. Data sources are everywhere: on-premises, in the cloud, in consumers’ pockets, in vehicles, in RFID chips, and so forth.
  • The “long march” of Moore’s Law: cheap computing power makes machine learning and deep learning techniques practical.

So, if analytics is such a hot field, why are the industry leaders struggling?

  • Oracle’s cloud revenue growth fails to offset declining software and hardware sales.
  • SAP’s cloud revenue grows, but total software revenue is flat.
  • IBM reports seventeen straight quarters of declining revenue. Mass layoffs
  • Microsoft underperforms analysts’ expectations despite 120% growth in Azure cloud revenue.
  • Predictive analytics leader SAS reports five years of low single-digit revenue growth; Executive Vice President and Chief Marketing Officer departs.
  • Data warehousing leader Teradata shuffles its leadership team after four years of declining product revenue.

Product quality is not the problem. Each company offers products that industry analysts rate highly:

  • Forrester and Gartner recognize IBM, SAS, SAP and Oracle as leaders in data quality tools.
  • Gartner rates Oracle, SAP, IBM, Microsoft and Teradata as leaders in data warehousing.
  • Forrester rates Microsoft, SAP, SAS, and Oracle as leaders in agile business intelligence.
  • Gartner recognizes SAS and IBM as leaders in Advanced Analytics.

The answer, in a word, is disruption. Clayton Christensen of the Harvard Business School outlined the theory of disruptive innovation in 1997. Summarizing the argument briefly:

  • Industries consist of value networks, collections of suppliers, channels, and buyers linked by relationships.
  • Innovations disrupt industries when they create a new value network.
  • Not all innovations are disruptive. Many are introduced by market leaders to sustain a competitive position.
  • Disruptive innovations tend to be introduced by outsiders.
  • Purely technological innovation is not disruptive; what matters is the business model enabled by the new technology.

For a more detailed exposition of the theory, read Christensen’s book.

Christensen identified two forms of disruption. Low-end disruption occurs when industry leaders enhance products faster than customers can assimilate the enhancements; the disruptor enters the market with a “good enough” product and a better value proposition. The disruptor’s innovation makes it possible to serve customers at a lower cost than the industry leaders can deliver.

New market disruption takes place when the disruptor innovates in ways enabling it to serve customers that are not served by the industry leaders.

Technology alone does not disrupt industries; incumbents can and do innovate. New business models enabled by new technology are the cutting edge of disruption. Frequently, incumbents cannot respond effectively to new business models; this is partly due to “blinders” caused by changing value networks, and partly out of fear of cannibalizing existing business arrangements. Two business models, in particular, are disrupting the business analytics world today:

  • Open source software business models offer an increasingly attractive alternative to commercial software licensing. The Hadoop ecosystem displaces conventional data warehousing; R and Python displace commercial software for advanced analytics.
  • The elastic business model made possible by cloud computing undercuts conventional software licensing. When customers pay only for what they use, they pay a lot less.

Disruption does not mean that leading companies like Oracle, IBM and SAS will go out of business. Blockbuster may be the poster child for disrupted businesses, but most cases are less dire; for the business analytics leaders, disruption means they will struggle to grow. Slow growth is less benign than it sounds. As McKinsey notes, the rule today is “Grow or Go”: companies that cannot define a credible growth strategy will be acquired by other companies or by private equity.

The alternative to revenue growth is increasing profitability. But when revenue is flat or declining, that usually means job cuts.

job-cuts
Disruption looks like this.

Consider what happened to Teradata. Late in 2012, the company started missing sales targets; in early 2013, it stunned investors by reporting an absolute decline in sales. Management offered excuses; Wall Street punished the stock, driving it down by half in the face of a bull market for tech stocks.

Teradata’s leadership continued to miss sales and earnings targets; Wall Street drove the stock price down to a fraction of its 2012 peak. While it is tempting to blame the problem on poor leadership, Teradata’s persistent failure to accurately forecast its sales and earnings is a clear sign that its leadership no longer understood the value networks in which they operated. The world had changed; the value networks created in Teradata’s rise to leadership no longer existed; the mental models managers used to understand the market no longer worked.

There are two distinct types of disruption. The first is disruptive innovation within the analytics value chain. Here are two recent examples:

Hadoop. The Hadoop ecosystem disrupts the data warehousing industry from below. Hadoop does not do everything a relational database can do, but it does just enough to offer an attractive value proposition for the right use cases. When first introduced, Hadoop’s capabilities were very limited compared to data warehouse appliances. But Hadoop’s flexibility and low cost were highly attractive for applications that did not need the performance and features of a data warehouse appliance. While established vendors struggle to maintain flat and declining revenue, companies that offer solutions built on Hadoop grow at double-digit rates.

Tableau. Tableau virtually created the market for agile, self-service discovery. The charting and visualization features in Tableau are available in mainstream business intelligence tools. But while business intelligence vendors target the IT organization and continually add complexity to their product, Tableau targets the end user with a simple, easy to use and versatile tool. As a result, Tableau has increased its revenue tenfold in five years, leapfrogging over many other BI vendors.

Disruption within the analytics value chain is pertinent for readers who plan to invest in analytics technology for their organization. Technologies at risk of disruption are risky investments; they may have abbreviated useful lives, and their suppliers may suffer from business disruption. Taking a “wait-and-see” attitude towards disrupted technologies makes good sense, if only because prices will likely decline in the future.

The second type is disruption by innovations in analytics. Examples of disruption by analytics are harder to find, but they do exist:

Credit Scoring. General-purpose credit scoring introduced by Fair, Isaac and Co. in 1987 virtually created a national market in credit cards.  Previously, banks issued credit cards to their local customers, with whom they had an established relationship. Uniform credit scoring enabled a few large issuers to identify creditworthy clients in the general population, without a prior relationship.

Algorithmic Trading. When the U.S. Securities and Exchange Commission authorized electronic trading in regulated securities in 1998, market participants quickly moved to develop algorithms that could arbitrage between markets, arbitrage between indexes and the underlying stocks and exploit other short-term opportunities. Traders that most effectively deployed machine learning for electronic trading grew at the expense of other traders.

For startups and analytics practitioners, disruption by analytics is essential. Startups must disrupt their industries if they want to succeed. Using analytics to differentiate a product is a way to create a disruptive business model or to create new markets.

There is a common theme across the four examples: the business model enabled by the technology and not the technology itself drives the disruption. Hadoop and Tableau do less than the legacy products they compete against; what they do, however, is sufficient for a class of use cases, for which they provide a better value proposition. Credit scoring and algorithmic trading created fundamentally new ways to lend and invest; while these applications attracted technological innovations as they expanded, it was the new business models they created that disrupted the lending and investing industries.

To illustrate the importance of the business model, consider the case of columnar serialization, a significant innovation in data warehousing that did not disrupt the industry. In 2005, Vertica introduced a commercial columnar database, a technology that is well-suited to high-performance analytics (as we explain in Chapter Two of Disruptive Analytics). Vertica successfully built a customer base, but did not create a unique business model; by 2010 the leading data warehouse vendors had introduced columnar serialization into their products. HP acquired Vertica in 2011 for about $250 million, a price well below the $1.7 billion IBM paid for Netezza, a competing data warehouse appliance vendor.

Here are some takeaways for the reader to consider.

First, if you want to invest in new business analytics technology, ask yourself:

  • Are we paying for what we use, or for what we might use?
  • What particular value do commercial software options offer over open source alternatives?

Second, if you want to use analytics to create a disruptive innovation, ask yourself:

  • What new business model does this support?
  • Can we disrupt incumbents from below with a better value proposition?
  • Can we reach new markets and new customers who are underserved by existing value networks?

There is one additional takeaway: nobody ever disrupted anything by managing data. Keep that in mind the next time a data warehousing vendor tries to tell you that their Big Box is a “strategic” investment. We’ll explore that in another excerpt from the book.

Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Big Analytics Roundup (July 25, 2016)

We have some more summer reading this week; plus, Splice Machine announces availability of its open source Community Edition, and Google launches two new machine learning APIs. There are so many Spark stories I’ve created a special section for them. Plus we have the usual explainers, perspectives, and news.

Quant headhunter Linda Burtch repeats her survey of working analysts in her network. Preference for using SAS has steadily declined over the three years she has conducted the poll; this year a clear majority chose R or Python over SAS. Preference for open source correlates with education; the more you know, the less likely you are to use SAS.

Oracle, IBM, SAP, and Microsoft have all reported Q2 revenue and earnings, but Teradata is still crunching the numbers. I’ll do a general earnings roundup when TDC gets around to reporting its numbers. TDC’s stock price has outperformed the others since June 30, which suggests the market expects a good second quarter. Meanwhile, TDC acquires another consultancy and reveals who bought Aprimo.

Summer Reading

Adrian Colyer lists his five favorite papers from the past several months and outlines his philosophy, which you must read. And here is another link to last week’s top paper on data bazaars versus data cathedrals.

Splice Machine Shifts to Open Core

Hadoop-based RDBMS vendor Splice Machine announces general availability for its open source community edition and offers a sandbox hosted on AWS.  Sam Dean approves; Andrew Brust reports; Dave Ramel explains. Jack Germain describes Splice Machine’s changing business model.

Spark Stories

— Databricks’ Spark survey is still accepting responses. Go and fill it out if you have not done so already.

— The Spark PMC has voted favorably on a release candidate for Spark 2.0, which is now in packaging for general availability.

— On the Databricks blog, Jules Damji corrals Spark news from the past two weeks.

— Alex Woodie touts LevyxSpark, an enhanced Spark distribution based on open source Apache Spark. LevyxSpark includes some open source enhancements, plus Levyx Helium, an SSD-based key-value store.

— In a webcast, Alexander Ulanov summarizes options for deep learning on Spark.

— Sam Weaver explains how to use the new MongoDB connector for Spark.

Explainers

— Nita Dembla and Gopal Vijayaraghavan explain improvements in Hive 2.1.

— Siddharth Anand introduces Apache Airflow (Incubating), a platform to author, schedule, and monitor DAGs. Sounds like Apache Beam.

— Data Artisans’ Stephan Ewan explains savepoints in Apache Flink.

Perspectives

— Jack Clark profiles Google’s land grab in deep learning. Short version: TensorFlow is blowing away Caffe, Torch, Theano, dl4j, CNTK, and DSSTNE.

— Greg Satell theorizes about Google’s open source strategy as if a “razor and blades” strategy is something new and brilliant.

— In Fortune, Barb Darrow profiles cloud computing’s disruptive impact.

— Sam Dean confuses machine learning with artificial intelligence.

— Syncsort’s Paige Roberts interviews Dr. Ellen Friedman.

— Drew Breunig poses a theory about the business implications of machine learning.

— BuzzFeed’s Adam Kelleher attempts to explain bias, fails.

— IBM exec Rob Thomas co-authors a blog about machine learning. It’s about what you would expect from an IBM exec.

Open Source News

— Open source columnar storage engine Apache Kudu graduates to top-level status.

— Apache Chukwa announces Release 0.8, with security bug fixes, FWIW. Chukwa captures logs from distributed systems for monitoring and analysis. No, I never heard of it either.

Commercial Announcements

— Google announces open beta for its Cloud Natural Language and Cloud Speech APIs.

Hardware News

— Inspur, which claims to be China’s largest server manufacturer, announces availability of the Memory1 line of servers for big analytics. Inspur uses high-capacity flash DIMMs and memory expansion software to deliver up to 2TB of memory per server and up to 80TB per rack.

— Startup Wave Computing announces plans for a family of deep learning computers. Good luck to them. The history of computing isn’t kind to special purpose machines, which tend to eventually get buried by general purpose machines.

Funding News

— Redis Labs lands a $14 million “C” round led by Bain Capital and Carmel Ventures. Redis claims 6,200 enterprise customers and 55,000 accounts for its cloud service.

— Sift Security emerges from stealth, announces $3.25 million in angel funding. Sift uses graph analytics running on Spark and TitanDB to identify linked threats and incidents.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.

Big Analytics Roundup (March 14, 2016)

HPE wins the internet this week by announcing the re-re-release of Haven, this time on Azure.  The other big story this week: Flink announces Release 1.0.

Third Time’s a Charm

Hewlett Packard Enterprise (HPE) announces Haven on Demand on Microsoft Azure; PR firestorm ensues.  Haven  is a loose bundle of software assets salvaged from the train wreck of Autonomy, Vertica, ArcSight and HP Operations Management machine learning suite, originally branded as HAVEn and announced by HP in June, 2013.  Since then, the software hasn’t exactly gone viral; Haven failed to make KDnuggets’ list of the top 50 machine learning APIs last December, a list that includes the likes of Ersatz, Hutoma and Skyttle.

One possible reason for the lack of virality: although several analysts described Haven as “open source”, HP did not release the Haven source code, and did not offer the software under an open source license.

Other than those two things, it’s open source.

In 2015, HP released Haven on Helion Public Cloud, HP’s failed cloud platform.

So this latest announcement is a re-re-release of the software. On paper, the library looks like it has some valuable capabilities in text, images video and audio analytics.  The interface and documentation look a bit rough, but, after all, this is a first third release.

Jim’s Latest Musings

Angus Loten of the WSJ’s CIO Journal interviews SAS CEO Jim Goodnight, who increasingly sounds like your great-uncle at Thanksgiving dinner, the one who complains about “these kids today.”  Goodnight compares cloud computing to mainframe time sharing.  That’s ironic, because although SAS runs in AWS, it does not offer elastic pricing, the one thing that modern cloud computing shares with timesharing.

Goodnight also pooh-poohs IoT, noting that “we don’t have any major IoT customers, and I haven’t seen a good example of IoT yet.”  SAS’ Product Manager for IoT could not be reached for comment.

Meanwhile, SAS held its annual analyst conference at a posh resort in Steamboat Springs, Colorado; in his report for Ventana Research, David Menninger gushes.

Herbalife Messes Up, Blames Data Scientists

Herbalife discloses errors reporting non-financial information, blames “database scripting errors.” The LA Times reports; Kaiser Fung comments.

Explainers

— Several items from the morning paper this week:

  • Adrian Colyer explains CryptoNets, a combination of Deep Learning and homohorphic encryption.  By encrypting your data before you load it into the cloud, you make it useless to a hacker.
  • Adrian explains Neural Turing Machines.
  • Adrian explains Memory Networks.
  • Citing a paper published by Google last year, Adrian explains why using personal knowledge questions for account recovery is a really bad thing.

— Data Artisans’ Robert Metzger explains Apache Flink.

— In a video, Eric Kramer explains how to leverage patient data with Dataiku Data Science Studio.

Perspectives

— In InfoWorld, Serdar Yegulalp examines Flink 1.0 and swallows whole the argument that Flink’s “pure” streaming is inherently superior to Spark’s microbatching.

— On the MapR blog, Jim Scott offers a more balanced view of Flink, noting that streaming benchmarks are irrelevant unless you control for processing semantics and fault tolerance.  Scott is excited about Flink ease of use and CEP API.

— John Leonard interviews Vincent de Lagabbe, CTO of bitcoin tracker Kaiko, who argues that Hadoop is unnecessary if you have less than a petabyte of data.  Lagabbe prefers Datastax Enterprise.

— Also in InfoWorld, Martin Heller reviews Azure Machine Learning, finds it too hard for novices.  I disagree.  I used AML in a classroom lab, and students were up and running in minutes.

Open Source Announcements

— Flink announces Release 1.0.  DataArtisans celebrates.

Teradata Watch

CEO Mike Koehler demonstrates confidence in TDC’s future by selling 11,331 shares.

Commercial Announcements

— Objectivity announces that Databricks has certified ThingSpan, a graph analytics platform, to work with Spark and HDFS.

— Databricks announces that adtech company Sellpoints has selected the Databricks platform to deliver a predictive analytics product.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

End of the Jim and Jim Show

On Monday, December 7, SAS EVP and CMO Jim Davis resigned to take “a leadership role” at Informatica.  Davis was effectively the second in command at SAS since 2001, and widely seen as next in line of succession when owner and CEO Jim Goodnight decides to sell, retire or exit in some other fashion.

“Highly marketable people get job opportunities presented to them all the time,” said SAS spokeswoman Shannon Heath in a statement for The News & Observer.  “This was obviously an opportunity that he felt strongly about and we wish him the best in that and appreciate all of his contributions.

Uh-huh.

On the WRAL TechWire blog, Rick Smith opines that the loss of Davis is a “crushing blow” for SAS.  SAS pushed back against that post, while Smith wondered what sort of role Davis would take.

A “crushing blow?”  There’s only one person at SAS whose departure will blow the place up.  I remember hearing Jim Goodnight speak about ten years ago; he spoke for fifteen minutes, without notes, about the business.  The audience loved it.  Jim Davis was next on the agenda; he delivered about a hundred professionally produced Powerpoint slides, complete with animated pyramids and such.  For more than an hour he went on and on, talking nonsense, while the back half of the auditorium headed for the exits.

In an interview with Smith, Davis disclosed that he will be the EVP and CMO of Informatica.  According to Smith, Davis says that even though he will have the same title at Informatica, which is a third the size of SAS, he “does not see it as a lateral move.”  He also said that he would not have gone to work for a direct competitor of SAS, and he “did not see Informatica and SAS as direct competitors” even though SAS earns a quarter of its revenue from data quality and ETL software.

Perhaps we should call Davis “Baghdad Jim.”

Personally, I suspect that Davis was toast from the day about a year ago when Goodnight had to walk back a prediction of double-digit sales growth in 2014.  (Revenue actually grew 2%).

As a rule, CMOs do not walk or get axed when the topline looks good.  It’s possible that Davis’ departure is just what SAS says it is, a personal decision.  It’s also possible that SAS will post ugly numbers for 2015.  We should know by the end of the month.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.