Teradata’s Dim Prospects

On August 6, 2012, Teradata released its earnings report for the second quarter of 2012.  Results excelled; revenue grew 18% and earnings per share (EPS) increased 28% over the previous year.

In a press release, CEO Mike Koehler wrote: “Our technology leadership and expertise in data warehousing, big data analytics and integrated marketing management uniquely position Teradata to help customers realize the greatest value from their information assets, while enabling them to reduce infrastructure costs.”

Teradata seemed poised to profit from the tsunami of Big Data hitting enterprises everywhere.  Soon thereafter, in September, 2012, Teradata’s stock price hit $80, up more than 500% from a low in November 2008.

That was a high water mark for Teradata.  In the third and fourth quarters of 2012, sales grew at single digits rather than double digits.  While company insiders whispered about missed sales targets, Koehler remained optimistic: “Teradata’s competitive position has never been stronger, and we are well positioned with our market-leading technology.”

Results belied Koehler’s optimism.  In the first quarter of 2013, Teradata stunned investors, reporting a double-digit drop in product sales.  “Teradata got off to a slow start in the first quarter of 2013,” wrote Koehler in an epitome of understatement.   Wall Street punished the stock, driving it down by half.

After leveling off in 2014, Teradata’s product sales fell again by double digits in 2015.  In August, 2015, company employees reported several rounds of layoffs.

Since August 6, 2012, Teradata has lost 75% of its market value, a little less than $10 billion in value destroyed.  Adding insult to injury, Fortune dropped the company from its list of Most Admired Companies.

Heckuva job, Mike.

About two weeks ago, on February 4, Teradata held an earnings conference call for investors and analysts.  A questioner asked Koehler under what conditions Teradata would grow again.

He couldn’t answer the question.

Teradata: Three Years of Pain

The chart below illustrates how Teradata hit a wall in 2013, with flat revenue after years of rapid growth.

Screen Shot 2016-02-11 at 8.16.49 PM

Focusing exclusively on the top line masks the depth of Teradata’s growth problem.  Teradata’s revenue breaks into three major categories:

Product revenue: sales of licenses to use Teradata’s databases on boxes.  Sales of perpetual licenses, Teradata’s preferred licensing model, are booked as revenue on delivery of the license.

Consulting revenue: the earned value of professional services in an accounting period.  Unlike product revenue, consulting revenue for large projects can span multiple accounting periods, since Teradata only recognizes revenue as the work is performed.

Maintenance revenue:  while Teradata customers own a perpetual license to use their data warehouses, they pay annual fees for technical support and software upgrades.  Vendors peg maintenance fees in the range of 15-20% of the cost of a perpetual license.  Most customers who continue to use a Teradata data warehouse continue to pay maintenance, so maintenance revenue is a good proxy for the size of the active installed base.

Screen Shot 2016-02-14 at 12.51.06 PM

The chart below highlights the decline in Teradata’s product revenue.

Screen Shot 2016-02-11 at 8.17.53 PM

Product revenue is the critical measure of Teradata’s business, because it drives the other two categories.  Captive consulting operations in companies like Teradata tend to be product-centric, relying on product sales to drive business in installation, configuration, training, warehouse builds and so forth.  The value proposition differs markedly from independent systems integrators, who position themselves as vendor-neutral.

Consequently, Teradata’s consulting revenue is highly correlated with its product revenue, as shown in the chart below, which plots Teradata’s consulting revenue in each quarter for the past six years against the product revenue in the same quarter.

Screen Shot 2016-02-11 at 8.24.30 PM

Consulting revenue isn’t exactly correlated with product sales due to differences in revenue recognition.  Suppose that Teradata sells a large data warehouse project, with a total value of $X in product licenses and $Y in consulting services.  It will take several quarters to complete the project.  If the deal closes in the fourth quarter, Teradata recognizes the product revenue immediately, but recognizes the consulting revenue over subsequent quarters as the work is performed.

That is why Teradata’s consulting revenue continued to increase in 2013 while product revenue declined, as consulting teams worked off the backlog of projects sold in 2012.  Unlike other vendors like IBM with significant consulting businesses, Teradata does not report the size of its consulting backlog.

Maintenance revenue can only grow through product sales that add to the active installed base.  If a customer buys a new Teradata box and uses it to decommission another box, maintenance revenue will remain roughly the same (depending on details of the negotiation.)  Teradata’s maintenance revenue continued to increase through 2014, but was flat in 2015.

Screen Shot 2016-02-11 at 8.19.02 PM

Bear in mind, though, that Teradata sold more than a billion dollars of product in 2014 and 2015, so maintenance should be increasing by $150-200 million a year.  Since maintenance revenue did not increase, the implication is that all or most of those sales were replacement business that did not expand the Teradata footprint.

Why Teradata Hit a Wall

Why did Teradata stop growing?

Management blames external factors, including a strong dollar, a soft economy, soft capital spending, long sales cycles and tight IT budgets.  These factors are real, but they do not explain Teradata’s sales weakness.

Currency movements affect commoditized products more than those with a strong customer franchise, since the vendor cannot sustain volume in the face of higher prices in the local currency.  All firms must deal with the same currency environment, but firms with a compelling value proposition grow anyway.  Apple sells a lot of product in non-dollar currencies, and its revenue is affected by a strong dollar; but Apple’s management does not whine about the strong dollar.

Soft capital spending affects big-ticket items like perpetual licenses for big-box data warehouses.  One solution, of course, is subscription pricing.  Many software companies, including leaders like Oracle and IBM, figured this out a long time ago, but Teradata has resisted except in its own Cloud.

Tightening IT budgets mean that vendors must work harder to demonstrate value and stay on the organization’s “must buy” list.  If Teradata is losing sales when IT budgets are tight it is because Teradata has failed to define a compelling value proposition. and it has failed to persuade the customer that it can deliver value.  Tight IT budgets are a reality, and will continue to be a reality; Teradata must offer solutions to the customer that solve that problem.

It’s also important to note that while worldwide IT spending declined in 2015 (according to Gartner), the biggest decline (by far) was in communication services.  Meanwhile, IDC reports that worldwide dollar-denominated spending on Business Analytics software has increased every year since 2012.  IT organizations may be cutting back in some areas, but spending in Business Analytics remains strong.

In other words, organizations are buying.  They’re just not buying Teradata.

Why not?

The first reason is market saturation.  Virtually every enterprise that ever will invest in a conventional data warehouse already has one; those that don’t likely never will.  Koehler says that one pillar of Teradata’s growth strategy will be selling to the “thousands of companies that do not use Teradata.”   There’s an obvious problem with that approach: those companies aren’t using Teradata because they are using Oracle, DB2, SQL Server or something else, and they’re not going to toss what they have and buy Teradata just so Koehler will get a performance bonus this year.

The second reason is the maturation of Hadoop.  In Hadoop’s early years, most data architects imagined Hadoop as a kind of dumping ground for data, with batch processes to structure the data and load it into high-performance relational databases.  End users would work primarily with the relational databases, where they could have sub-second query responses, while Hadoop would serve as a batch ETL platform.

As Hadoop matures, however, that model is obsolete.  Tools like Impala, Hive-on-Tez, Spark and Drill deliver query response times that approach those that can be achieved with relational databases.  OLAP-on-Hadoop platforms like Kylin and AtScale make it possible for end users to point familiar tools like Excel and Tableau directly at Hadoop.

Given the disruptively low costs of Hadoop compared to Teradata, anything that makes Hadoop more “enterprise-ready” cuts into Teradata’s franchise.

Structured data in a high-performance database remains the gold standard for high-value data.  However, most of the data that makes up the Big Data tsunami is data whose value is either unknown or speculative.  In the past, it would have been discarded, but low-cost storage makes it possible to retain it and mine it for value.  Low cost platforms are inherent in the DNA of Big Data, and Teradata, like Downton Abbey and its army of servants, symbolizes a different era.

Going forward, most of the growth in data warehousing will be on top of Hadoop and NoSQL datastores.  High value data will move to in-memory databases; conventional relational databases will not disappear, but will decline in importance.

Grow or Go

Business schools used to teach two models for public companies: the growth company that retains its earnings and rewards shareholders through capital gains, and the stable profitable company that rewards investors through dividends and share buybacks.

Today, there is only one model for public companies: grow or go.  Companies that do not articulate a growth strategy do not survive.  Tax and other incentives drive the public equities markets to demand capital gains through growth.  Stable cash-generating businesses either finance themselves through private equity, or they become cash cows within larger and stronger public companies.

Teradata has the potential to be a stable and profitable company. Its gross profit margin has declined a bit in recent years, but the company generates cash like Kim Kardashian generates tweets.  Its operating loss in 2015 is attributable to a one-time accounting charge related to the proposed sale of Aprimo, the Marketing Resource Management company it acquired in 2011.   If Teradata continues to serve its existing customers with product upgrades, extensions and consulting services, the $2.5 billion in total revenue produced in 2015 should be sustainable for some time.

But stable companies can’t structure themselves like growth companies.  Companies with a clear growth vision can invest heavily in sales, marketing and engineering; stable companies must be lean.  Teradata now spends more “below the line” — engineering, sales, marketing, general and administrative functions — than it did in 2012, when it seemed poised for growth.  Management talks about “restructuring” and “transition”, but it does not appear to be actually restructuring anything.

Meanwhile, while the company invested a little over $600 million in research and development over the past three years, it spent $1.6 billion repurchasing its own stock.   Many companies repurchase their own stock to avoid dilution from stock-option grants, and because it is a more tax-efficient way to reward investors.  However, while companies like Apple spend a fraction of their operating cash flow on share repurchases, in the first three quarters of 2015 Teradata spent more on share repurchases than it produced in operating cash flow, borrowing to cover the difference.  Effectively, Teradata is performing a stealth leveraged buy-out.

A company that spends three times as much buying its own stock as it spends on R&D is a company that has no confidence in the growth potential of its own business, and no ideas for building a better product.

 

End of the Jim and Jim Show

On Monday, December 7, SAS EVP and CMO Jim Davis resigned to take “a leadership role” at Informatica.  Davis was effectively the second in command at SAS since 2001, and widely seen as next in line of succession when owner and CEO Jim Goodnight decides to sell, retire or exit in some other fashion.

“Highly marketable people get job opportunities presented to them all the time,” said SAS spokeswoman Shannon Heath in a statement for The News & Observer.  “This was obviously an opportunity that he felt strongly about and we wish him the best in that and appreciate all of his contributions.

Uh-huh.

On the WRAL TechWire blog, Rick Smith opines that the loss of Davis is a “crushing blow” for SAS.  SAS pushed back against that post, while Smith wondered what sort of role Davis would take.

A “crushing blow?”  There’s only one person at SAS whose departure will blow the place up.  I remember hearing Jim Goodnight speak about ten years ago; he spoke for fifteen minutes, without notes, about the business.  The audience loved it.  Jim Davis was next on the agenda; he delivered about a hundred professionally produced Powerpoint slides, complete with animated pyramids and such.  For more than an hour he went on and on, talking nonsense, while the back half of the auditorium headed for the exits.

In an interview with Smith, Davis disclosed that he will be the EVP and CMO of Informatica.  According to Smith, Davis says that even though he will have the same title at Informatica, which is a third the size of SAS, he “does not see it as a lateral move.”  He also said that he would not have gone to work for a direct competitor of SAS, and he “did not see Informatica and SAS as direct competitors” even though SAS earns a quarter of its revenue from data quality and ETL software.

Perhaps we should call Davis “Baghdad Jim.”

Personally, I suspect that Davis was toast from the day about a year ago when Goodnight had to walk back a prediction of double-digit sales growth in 2014.  (Revenue actually grew 2%).

As a rule, CMOs do not walk or get axed when the topline looks good.  It’s possible that Davis’ departure is just what SAS says it is, a personal decision.  It’s also possible that SAS will post ugly numbers for 2015.  We should know by the end of the month.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Teradata Lays Another Egg

Teradata reports Q3 revenue of $606 million, down 3% in “constant” dollars, down 9% in actual dollars, the kind you can spend.  Product revenue, from selling software and boxes, declined 14%.

In a brutal call with analysts, CEO Mike Koehler noted: “revenue was not what we expected.”  It could have been a recorded message.

Teradata executives tried to blame the weak revenue on the strong dollar.  When pressed, however, they admitted that deferred North American sales drove the shortfall, as companies put off investments in Teradata’s big box solutions.

In other words, the dogs don’t like the dog food.

From the press release:

Teradata is in the process of making transformational changes to improve the long-term performance of the company, including offering more flexibility and options in the way customers buy Teradata products such as a software-only version of Teradata as well as making Teradata accessible in the public cloud. The initial cloud version of Teradata will be available on Amazon’s Web Services in the first quarter of 2016.

An analyst asked about expected margins in the software-only business; Teradata executives clammed up.  The answer is zero.  Teradata without a box is a bladeless knife without a handle, competing directly with open source databases, such as Apache Greenplum.

Another analyst asked about Teradata on AWS, noting that Teradata executives previously declared that their customers would never use AWS.  Response from the executives was more mush.  HP just shuttered its cloud business; Teradata’s move to AWS implies that Teradata Cloud is toast.

Koehler also touted Teradata’s plans to offer Aster on Hadoop, citing “100 pre-built applications”.  Good luck with that.  Aster on Hadoop is a SQL engine that still runs through MapReduce; in other words it’s obsolete, a point reinforced by Teradata’s plans to move forward with Presto.  Buying an analytic database with pre-built applications is like buying a car with pre-built rides.

More from the press release:

“We remain confident in Teradata’s technology, our roadmaps and competitive leadership position in the market and we are taking actions to increase shareholder value.  We are making transformative changes to the company for longer term success, and are also aligning our cost structure for near term improvement,” said Mike Koehler, chief executive officer, Teradata Corporation. 

In other words, expect more layoffs.

“Our Marketing Applications team has made great progress this year, and has market leading solutions. As part of our business transformation, we determined it best to exclusively focus our investments and attention on our core Data and Analytics business.  We are therefore selling our Marketing Applications business. As we go through this process, we will work closely with our customers and employees for continued success.

“We overpaid for Aprimo five years ago, so now we’re looking for some greater fool to buy this dog.”

In parallel, we are launching key transformation initiatives to better align our Data and Analytics solutions and services with the evolving marketplace and to meet the needs of the new Teradata going forward.”

Update your resumes.

During the quarter, Teradata purchased approximately 8.5 million shares of its stock worth approximately $250 million.  Year to date through September 30, Teradata purchased 15.5 million shares, worth approximately $548 million.

“We have no vision for how to invest in our business, so we’re buying back the stock.”

In early trading, Teradata’s stock plunges.

In 2012, five companies led the data warehousing platform market: Oracle, IBM, Microsoft, Teradata and SAP.  Here’s how their stocks have fared since then:

  • Oracle: Up 24%
  • IBM: Down 29%
  • Microsoft: Up 77%
  • Teradata: Down 61%
  • SAP: Up 22%

Nice work, Teradata!  Making IBM look good…

Mets Use SAS, Royals Win Series

In a bit of premature chest-thumping, SAS touts its alliance with the New York Mets.  As a SAS blogger notes, “when the Mets take the field…SAS will be there with them.”

The Mets committed five errors in the Series.

Last year, the Mets signed an agreement with SAS for analytics, joining the Orlando Magic (thirteenth in the NBA Eastern Conference) and the Toronto Maple Leafs (eight in the NHL Atlantic Division.)

There’s a metaphor in there.  Spending big money on software won’t help if you don’t execute on the fundamentals.

Here are profiles of KC’s top analysts:

Notice something missing from those profiles?

IBM Adds Spark Support to Analytics Server

With its customary PR blitz, IBM announces that it has added Spark integration to several products, including SPSS.   IBM gets a small pat on the head for adding Spark support to its Analytics Server software, under the premise that something is better than nothing.

There is a very narrow pool of SPSS users who will benefit from this enhancement.  Spark integration is only available to the subset of SPSS users who license SPSS Modeler; most SPSS users work with SPSS Statistics.  Users must also license SPSS Analytics Server, a product that only runs on Hortonworks HDP or IBM BigInsights.

So, if you’re using the high-end version of the second most popular commercial analytic server, and you’re willing to pay extra to integrate with the third and fourth ranked Hadoop distributions, you’re in luck today.

Analytics Server is a software middle layer installed on Hortonworks or BigInsights; it selectively supports SPSS Modeler operations in Hadoop.  Previous versions ran through MapReduce only;  IBM claims that the latest version runs through Spark when available, although the product documentation is surprisingly quiet on the subject.  There is no reference to Spark in IBM’s Release NotesInstallation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting; so the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

Analytics Server 2.1 partially supports most Modeler record and field operations.  Out of Modeler’s 37 data mining nodes, Analytic Server fully supports 8, partially supports 5 and does not support 24.  Among the missing:

  • Logistic Regression
  • k-Means
  • Support Vector Machines
  • PCA
  • Feature Selection
  • Anomaly Detection

Everyone understands that software engineering takes time, but IBM’s priorities are muddled. Logistic regression, k-means, SVM and PCA are all available today in Spark’s open source library; I suspect that IBM figures they can’t justify additional license fees if they point to algorithms that anyone can use for free  (*).  Clustering, PCA, feature selection and anomaly detection are precisely the kind of analyses users want to run on all of the data, not a sample extracted back to a server.

(*) IBM is mistaken on that point, of course.  There are a lot of business users who want the power of Spark but don’t want to mess with a programming API.  These users would happily pay for a nice business user front end like SPSS Modeler, and they won’t care what happens in the back end.

Assuming that this product actually works — not guaranteed, given the sloppy and incomplete documentation — it is better than the previous version of Analytics Server, but that is a low bar.  Spark or no, IBM is way behind SAS in this space; I’m not a great believer in SAS’ proprietary approach to distributed in-memory analytics, but compared to IBM’s offering SAS wins on depth of features and breadth of platform support.  There are no published benchmarks, but I suspect that SAS wins on performance as well.

Also, SAS knows how to write documentation, which seems to be a problem for IBM.

To its credit, IBM’s Analytic Server offers more Spark capability than current offerings by Alpine, Alteryx and RapidMiner; but H2O and Skytree offer richer and better engines for serious machine learning.

As for the majority of SPSS users, wouldn’t it be great if SPSS could just connect to a Spark DataFrame?  Or if Spark could ingest SPSS datasets?

Benchmark: Spark Beats MapReduce

A group of scientists affiliated with IBM and several universities report on a detailed analysis of MapReduce and Spark performance across four different workloads.  In this benchmark, Spark outperformed MapReduce on Word Count, k-Means and Page Rank, while MapReduce outperformed Spark on Sort.

On the ADT Dev Watch blog Dave Ramel summarizes the paper, arguing that it “brings into question..Databricks Daytona GraySort claim”.  This point refers to Databricks’ record-setting entry in the 2014 Sort Benchmark run by Chris Nyberg, Mehul Shah and Naga Govindaraju.

However, Ramel appears to have overlooked section 3.3.1 of the paper, where the researchers explicitly address this question:

This difference is mainly because our cluster is connected using 1 Gbps Ethernet, as compared to a 10 Gbps Ethernet in, i.e., in our cluster configuration network can become a bottleneck for Sort in Spark.

In other words, had they deployed Spark on a cluster with high-speed network connections, it likely would run the Sort faster than MapReduce did.

I guess we’ll know when Nyberg et. al. release the 2015 GraySort results.

The IBM benchmark team found that k-means ran about 5X faster in Spark than in MapReduce.  Ramel highlights the difference between this and the Spark team’s claim that machine learning algorithms run “up to” 100X faster.

The actual performance comparison shown on the Spark website compares logistic regression, which the IBM researchers did not test.  One possible explanation — the Spark team may have tested against Mahout’s logistic regression algorithm, which runs on a single machine.  It’s hard to say, since the Spark team provides no backup documentation for its performance claims.  That needs to change.

O’Reilly Data Science Survey 2015

O’Reilly releases its 2015 Data Science Salary Survey.  The report, authored by John King and Roger Magoulas summarizes results from an ongoing web survey.  The 2015 survey includes responses from “over 600” participants, down from the “over 800” tabulated in 2014.

The authors note that the survey includes self-selected respondents from the O’Reilly audience and may not generalize to the population of data scientists.  This does not invalidate results of the survey — all surveys of data scientists, including Rexer and KDnuggets — use unscientific samples.  It does mean one should keep the survey audience in mind when interpreting results.

Moreover, since O’Reilly’s data collection methods are consistent from year to year, changes from 2014 may be significant.

The primary purpose of the survey is to collect data about data scientist salaries.  While some find that fascinating, I am more interested in what data scientists say about the tasks they perform and tools they use, and will focus this post on those topics.

Concerning data scientist tasks, the survey confirms what we already know: data scientists spend a lot of time in exploratory data analysis and data cleaning.  However, those who spend more time in meetings and those who spend more time presenting analysis earn more.  In other words, the real value drivers in data science are understanding the client’s business problem and explaining the results.  (This is also where many data science projects fail.)

The authors’ analysis of tool usage has improved significantly over the three iterations of the survey.  In the 2015 survey, for example, they analyze operating systems and analytic tools separately; knowing that someone says they use “Windows” for analysis tells us exactly nothing.

SQL, Excel and Python remain the most popular tools, while reported R usage declined from 2014.  The authors say that the change in R usage is “only marginally significant”, which tells me they need to brush up on statistics.  (In statistics, a finding either is or is not significant at the preselected significance level; this prevents fudging.)  The reported decline in R usage isn’t reflected in other surveys so it’s likely either (a) noise, or (b) an artifact of the sampling and data collection methods used.

The 2015 survey shows a marked increase in reported use of Spark and Scala.  Within the Spark user community, the recent Databricks survey shows Python rapidly gaining on Scala as the preferred Spark interface.  Scala offers little in the way of native machine learning capability, so I doubt that the language has legs among data scientists.  On the other hand, respondents were much less likely to use Java, a finding mirrored in the Databricks survey.  Data scientists use Scala and Java to “roll their own” algorithms; but given the rapid growth of open source and commercial algorithms (and rapidly growing Python use), I expect that we will see less of that in the future.

Reported use of Mahout collapsed since the last survey.  As I’ve written elsewhere, you can stick a fork in Mahout — it’s done.  Respondents also said they were less likely to use Apache Hadoop; I guess folks have figured out that doing logistic regression in MapReduce is a loser.

Respondents also reported increased use of Tableau, which is not surprising.  It’s everywhere.

The authors report discovering nine clusters of respondents based on tool usage, shown below.  (In the 2014 survey, they found five clusters.)

Screen Shot 2015-10-05 at 11.19.29 AM

The clustering is interesting.  The top three clusters correspond roughly to a “Power Analyst” user persona, a business user who is able to use tools for analysis but is not a hardcore developer.  The lower right quadrant corresponds to a developer persona, an individual with an Engineering background able to work actively in hardcore programming languages.  Hive and BusinessObjects fall into a middle category; neither tool is accessible to most business users without some significant commitment and training.

Some of the findings will satisfy Captain Obvious:

  • R and ggplot
  • SAP HANA and BusinessObjects
  • C and C++
  • JavaScript and D3
  • PostgreSQL and Amazon Redshift
  • Hive, Pig, Hortonworks and Cloudera
  • Python, Scala and Java

Others are surprising:

  • Tableau and SAS
  • SPSS and C#
  • Hive and Weka

It’s also interesting to note that Amazon EMR and Amazon Redshift usage fall into different clusters, and that EMR clusters separately from Cloudera and Hortonworks.

Since the authors changed clustering methods from 2014 to 2015, it’s difficult to identify movement in the respondent population.  One clear change is reflected in the separate cluster for R, which aligns more closely with the business user profile in the 2015 clustering.  In the 2014 clustering, R clustered together with Python and Weka.  This could easily be an artifact of the different clustering methods used — which the authors can rule out by clustering respondents to the 2014 survey using the 2015 methods.

Instead, the authors engage in silly speculation about R usage, citing tiny changes in tiny correlation coefficients.  (They don’t show the p-values for the correlations, but I suspect we can’t reject the hypothesis that they are all zero; so the change from year to year is also zero.)  Revolution Analytics’ acquisition by Microsoft has exactly zero impact on R users’ choice of operating system; and Teradata’s support for R in 2014 (which is limited to its Aster boxes) can’t have had a material impact on data scientists’ choice of tools.

It’s also telling that the most commonly used tools fall into a single cluster with the least commonly used tools.  Folks who dabble with survey segmentation are often surprised to find that there is one big segment that is kind of a catchall for features that do not differentiate respondents.   The way to deal with that is to remove the most and least cited responses from the list of active variables, since these do not differentiate respondents; spinning an interpretation of this “catchall” cluster is rubbish.

Spark 1.5 Released

On September 9, the Spark team announced availability of Release 1.5.  (Release notes here.)  230 developers contributed more than 1,400 commits, the largest release to date.  Spark continues to expand its contributor base, the best measure of health for an open source project.

Screen Shot 2015-09-09 at 8.06.28 PM

On the Databricks blog, Reynold Xin and Patrick Wendell summarize the key new bits:  Some highlights:

  • Project Tungsten, a set of major changes to Spark’s internal architecture will be on by default.  Spark 1.5 includes binary processing and a new code generation framework, with more than 100 built-in functions for common tasks.
  • Other performance enhancements include improved Parquet support (with predicate push-down and a faster metadata lookup path), and improved joins.
  • Usability enhancements include visualization of the SQL and DataFrame query plans in the web UI; the ability to connect to multiple versions of Hive metastores and the ability to read several Parquet variants.
  • Spark Streaming adds stability features, backpressure support, load balancing and several Python APIs.
  • The R interface is expanded to include Generalized Linear Models
  • New machine learning features include eight new transformers, three new estimators (naive Bayes, k-means and isotonic regression) plus three new algorithms (multilayer perceptron classifier, PrefixSpan for sequential pattern mining and FP-Growth for association rule learning)
  • Enhancements to existing algorithms include improvements to LDA, decision tree and ensemble features, an improved Pregel API for GraphX plus an ability to distribute matrix inversions for Gaussian Mixture Models (GMM).
  • Other new machine learning features include model summaries for linear and logistic regression, a splitting tool to define train and validation samples and a multiclass classification evaluator.

GraphX development has flatlined since the component graduated from Alpha in Spark 1.2.

Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks all participated in release testing.  Note that IBM, for all its marketing hoopla, contributes little or nothing to the project.