Big Analytics Roundup (March 28, 2016)

Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.

— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.

— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.

— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)

Explainers

— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.

— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.

— Frances Perry and Tyler Akidau explain runners in Apache Beam.

— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.

— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.

— Two interesting posts on PySpark:

  • On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
  • On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.

Perspectives

— Eric Kavanagh delivers a nice overview of the history of open source analytics.

— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.

— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.

— Ian Allison profiles Seldon, an open source machine learning platform that specializes in content and product recommenders.

— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.

— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.

— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.

— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.

Open Source Announcements

— AirBNB donates Airflow, a workflow automation system, to Apache.

— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.

— Several Apache projects have new releases:

  • Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
  • BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
  • OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4.  On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
  • SQL engine MRQL releases version 0.6, with new features for incremental query processing.

Commercial Announcements

— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.

— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.

— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.

— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.

IBM Adds Spark Support to Analytics Server

With its customary PR blitz, IBM announces that it has added Spark integration to several products, including SPSS.   IBM gets a small pat on the head for adding Spark support to its Analytics Server software, under the premise that something is better than nothing.

There is a very narrow pool of SPSS users who will benefit from this enhancement.  Spark integration is only available to the subset of SPSS users who license SPSS Modeler; most SPSS users work with SPSS Statistics.  Users must also license SPSS Analytics Server, a product that only runs on Hortonworks HDP or IBM BigInsights.

So, if you’re using the high-end version of the second most popular commercial analytic server, and you’re willing to pay extra to integrate with the third and fourth ranked Hadoop distributions, you’re in luck today.

Analytics Server is a software middle layer installed on Hortonworks or BigInsights; it selectively supports SPSS Modeler operations in Hadoop.  Previous versions ran through MapReduce only;  IBM claims that the latest version runs through Spark when available, although the product documentation is surprisingly quiet on the subject.  There is no reference to Spark in IBM’s Release NotesInstallation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting; so the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

Analytics Server 2.1 partially supports most Modeler record and field operations.  Out of Modeler’s 37 data mining nodes, Analytic Server fully supports 8, partially supports 5 and does not support 24.  Among the missing:

  • Logistic Regression
  • k-Means
  • Support Vector Machines
  • PCA
  • Feature Selection
  • Anomaly Detection

Everyone understands that software engineering takes time, but IBM’s priorities are muddled. Logistic regression, k-means, SVM and PCA are all available today in Spark’s open source library; I suspect that IBM figures they can’t justify additional license fees if they point to algorithms that anyone can use for free  (*).  Clustering, PCA, feature selection and anomaly detection are precisely the kind of analyses users want to run on all of the data, not a sample extracted back to a server.

(*) IBM is mistaken on that point, of course.  There are a lot of business users who want the power of Spark but don’t want to mess with a programming API.  These users would happily pay for a nice business user front end like SPSS Modeler, and they won’t care what happens in the back end.

Assuming that this product actually works — not guaranteed, given the sloppy and incomplete documentation — it is better than the previous version of Analytics Server, but that is a low bar.  Spark or no, IBM is way behind SAS in this space; I’m not a great believer in SAS’ proprietary approach to distributed in-memory analytics, but compared to IBM’s offering SAS wins on depth of features and breadth of platform support.  There are no published benchmarks, but I suspect that SAS wins on performance as well.

Also, SAS knows how to write documentation, which seems to be a problem for IBM.

To its credit, IBM’s Analytic Server offers more Spark capability than current offerings by Alpine, Alteryx and RapidMiner; but H2O and Skytree offer richer and better engines for serious machine learning.

As for the majority of SPSS users, wouldn’t it be great if SPSS could just connect to a Spark DataFrame?  Or if Spark could ingest SPSS datasets?

Big Analytics Roundup (October 12, 2015)

Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics.  Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics.  StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.

EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica.  It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances.  Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.

EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf.  Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.

What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms.  But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.

In other news, the Apache Software Foundation announces three interesting software releases:

  • Apache AccumuloRelease 1.6.4, a maintenance release.
  • Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
  • Apache Kafka: Release 0.8.2.2, a maintenance release.

Spark

On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.

Spark Performance

Dave Ramel summarizes a paper he thinks is too long for you to read.  That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads.  As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.

Spark Appliances

Ordinarily I don’t link sponsored content, but this article from Numascale is interesting.  Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.

Spark on Amazon EMR

On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR.  The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.

Use Cases

In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.

Stitch Fix offers personalized style recommendations to its customers.  Jas Khela describes how the startup uses Spark.   (h/t Hadoop Weekly)

SQL/OLAP/BI

Apache Drill

MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data.  Without a trace of irony, he explains how to bring SQL to NoSQL datastores.

Apache Hawq

On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib.  Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library.   MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so.  In Datanami, George Leopold reports.

In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.

At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive.  The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.

Apache Phoenix

At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase.  Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver.  SQL support is broad enough to run TPC benchmark queries.  Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.

Data Blending

On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.

Presto

On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR.  Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.

Machine Learning

Apache MADLib

MADLib is an open source project for machine learning in SQL.  Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community.  Machine learning functionality is quite rich.  Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq.  In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)

Apache Spark (SparkR)

On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.

Apache Spark (MLLib)

On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.

Apache Zeppelin

At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.

H2O and Spark (Sparkling Water)

In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.

How Yahoo Does Deep Learning on Spark

Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark.  Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node.  The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.

Screen Shot 2015-10-06 at 10.32.39 AM

Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC).  In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)

Streaming Analytics

Apache Flink

On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.

Apache Kafka

Cloudera’s Gwen Shapira presents Kafka “worst practices”: true stories that happened to Kafka clusters. (h/t Hadoop Weekly)

Apache Spark Streaming

On the MapR blog, Jim Scott offers a guide to Spark Streaming.

Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.

Big Analytics Roundup (March 30, 2015)

Lots of Spark news this week, following last week’s Sparkalanche, plus some other non-Spark news just to show that Big Analytics isn’t entirely about Spark.

Alteryx

  • In IntelligentHQ, Maria Fonseca interviews Alteryx COO George Mathew, argues that analytics is for people.  Left unanswered: who else it could be for.

Analytic Startups

  • Analytics vendor Ayasdi lands a $55 million “C” round.
  • Localytics, which specializes in analytics for mobile and web apps, secures a $35 million “D” round.

Apache Drill

  • MicroStrategy announces certification of Apache Drill with MicroStrategy Analytics Enterprise Platform.

Apache Spark

Analysis

  • IBM Big Data “evangelist” James Kobelius confirms that IBM has no idea what to do with Spark.
  • In TechRepublic, Matt Asay argues that Hadoop won’t disappear just because it’s slow, knocking over several straw men in the process.   On readwrite, he makes similar points; and on InfoWorld, he goes for the hat trick.
  • In InfoWorld, Platfora’s Peter Schlampp offers five reasons why Spark is the next big thing.

Applications

  • On the Cloudera blog, Sam Shuster of Edmunds.com describes a dashboard built with Spark Streaming, SparkOnHbase and Morphlines.
  • In InfoQ, Srini Penchikala of Pinterest explains why he’s using Spark Streaming, Kafka and MemSQL for a real-time application.

Data Science

  • On the Databricks blog, Joseph Bradley writes an excellent article on Topic Modeling with Spark’s new Latent Dirichlet Allocation capability.

Developer

  • On the Databricks blog, Michael Armbrust describes new Spark SQL features in Spark 1.3
  • On Slideshare, Vida Ha and Holden Karau share tips for writing better Spark programs; video here.

Deep Learning

  • Tomasz Malisiewicz of Vision.ai blogs on Deep Learning versus Machine Learning versus Pattern Recognition.

RapidMiner

  • RapidMiner publishes a white paper on code-free analytics in Hadoop, and another on Hadoop security.