Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Big Analytics Roundup (May 23, 2016)

Google announces that it has designed an application-specific integrated circuit (ASIC) expressly for deep neural nets. Tech press goes bananas. The chips, branded Tensor Processing Units (TPUs) require fewer transistors per operation, so Google can fit more operations per second into the chip. In about a year of operation, Google has achieved an order of magnitude improvement in performance per watt for machine learning.

Google’s Felipe Hoffa summarizes Mark Litwintschik’s work benchmarking different platforms with the New York City Taxi and Limo Commission’s public dataset of 1.1 billion trips. So far, Mark has tested PostgreSQL on AWS, ElasticSearch on AWS, Spark on AWS EMR, Redshift, Google BigQuery, Presto on AWS and Presto on Cloud Dataproc. Results make Google look good, but you should read Mark’s original posts.

Meanwhile, IBM fires more people. More here and here.

Open Data Science Conference

The second annual Open Data Science Conference (ODSC) East met in Boston over the weekend. Attendance doubled from last year, to 2,400.

Registration was a snafu, because the conference organizers did not accurately predict walk-in traffic or staffing needs. The jokes write themselves.

Content was excellent. Keynoters included Stefan Karpinski (Julia co-creator), Kirk Borne of Booz Allen Hamilton, Ingo Mierswa, CTO of RapidMiner and Lukas Biewald, CEO of Crowdflower. Track leaders included JJ Allaire and Joe Cheng of RStudio, Usama Fayyad of Barclays and John Thompson of the US Census Bureau. Sponsors included Basis Technology, CartoDB, CrowdFlower, Dataiku, DataRobot, Dato, Exaptive, Facebook, H2O.ai, MassMutual, McKinsey, Metis, Microsoft, RapidMiner, SFL Scientific and Wayfair.

Prompted by a tweet, I stopped at the Dataiku table. The conversation went like this:

  • Me: What does Dataiku do, in 25 words or less?
  • Dataiku: DataRobot.
  • Me: What?
  • Dataiku: We do what DataRobot does.

At this point, it was clear to me that Mr. Dataiku either did not know what DataRobot does, or thought I don’t know what DataRobot does. So I changed the subject.

The next ODSC event is in October, in London.

Explainers

— Michael Armbrust and Tathagata Das explain Structured Streaming in Spark 2.0

— Adrian Colyer goes 5 for 5 for the week:

— Tim Hunter, Hossein Falaki and Joseph Bradley explain HyperLogLog and Quantiles in Spark.

— Microsoft’s Raymond Laghaeian explains how to use Azure ML predictions in Google Spreadsheet.

Perspectives

— Serdar Yegulalp cites PayScale data in noting that if you know Scala, Go, Python and Spark you can expect to make more money.

— Tim Spann weighs the advantages of Java and Scala, and explains DL4J.

— Sam Dean celebrates Drill’s first anniversary.

— Taylor Goetz delivers a brief history of Apache Storm.

Open Source Announcements

— MongoDB releases a new Spark Connector.

— Apache Tajo announces Release 0.11.3, with five bug fixes.

— Apache Mahout announces Release 0.12.1, a maintenance release that resolves an issue with Flink integration.

Commercial Announcements

— RedPoint Global snags a $12 million “C” round.

— TIBCO announces something called Accelerator for Apache Spark, a bundle of tools that connect TIBCO products with open source packages. While TIBCO refers to this component as open source, the software is available only to TIBCO customers, which means it isn’t Free and Open Source.

— MapR applauds itself.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

Big Analytics Roundup (October 5, 2015)

Announcements timed to coincide with Strata NYC 2015 drive the news this week.  The single most interesting item for Big Analytics is the O’Reilly 2015 Data Science Survey, which warrants a post of its own.  Two key points:

  • Data scientists still use SQL, Excel, Python and R. (Doh!)
  • Data scientists who spend time in meetings and presenting results of analysis earn more than the grunts who muck around with data.

The lesson is clear: if you want to earn the big bucks, stop messing around with Zeppelin and learn PowerPoint.

There are a few interesting presentations from Strata embedded in this post, plus two that stand out:

  • Ron Kasabian of Intel and Michael Draugelis of Penn Medicine explain how they improve medical decision-making with predictive analytics.
  • Iulia Pasov and Calin-Andrei Burloiu show how they use data science to measure and prevent churn at Avira.

Paul Kent has the toughest job at SAS — promoting an initiative his boss thinks is hype.  In this sponsored presentation, Paul does a professional job presenting SAS’ Big Data story, which seems compelling.  However, the challenge for SAS in Big Data remains: name a reference customer.

Spark

Commentary

MapR’s Jim Scott takes another shot at the “will Spark replace Hadoop?” meme.  All together now:

  • Spark, like MapReduce, is a compute engine
  • Hadoop = MapReduce + HDFS + YARN + (an ecosystem of other bits)
  • Spark lacks a native file system, so it can never replace Hadoop
  • Coming in 2016:  Hadoop = Spark + HDFS + YARN + (an ecosystem…)
  • Also possible: Spark + Cassandra, Spark + MongoDB, Spark + Druid, Spark + (your database here)

On the Intersog blog, Jenny Richards gets it right by focusing on the differences between Spark and MapReduce.

Spark Maintenance Release

The Spark team announces Spark 1.5.1, a maintenance release with about 80 bug fixes.  On the Databricks blog, Reynold Xin explains Spark version numbers work.  Short version: top level numbers correspond to API compatibility, dot releases include features and enhancements, double-dot releases have bug fixes.

Spark Use Cases and Success Stories

At Strata NYC 2015, Databricks’ Reynold Xin describes  “sketching” with Spark (aka exploratory analysis and feature engineering).

Also at Strata, Edd Dumbill presents the business case for Spark, Kafka “and friends.”

On the MongoDB blog, Mat Keep interviews Thiago Cardoso, co-founder and CTO of Hekima, a social media analytics startup.  Hekima uses Spark, Hadoop and (you guessed it) MongoDB.

On SmartDataCollective, the ubiquitous Jim Scott describes what he calls use cases for Spark: exploratory analytics; machine learning; real-time dashboards and ETL.

At Big Data Dat LA 2015, ESRI’s Adam Mollenkops explains how to apply GeoSpatial Analytics with Spark. Video here.

Spark Integration

A slew of software vendors announce integration with Spark.

–Talend announces Release 6, which leverages Spark and Spark Streaming.  Stories here, herehere, here, here, here, and here.

Syncsort announces integration with Spark and Kafka.

–TIBCO announces Spotfire integration with Spark through the Spark SQL and SparkR APIs.  More here, here and here.

–Dataiku announces integration of its Data Science Studio (DSS) software with Spark.  DSS offers a commercially licensed visual workbench enabling the user to build pipelines integrating a number of data sources and formats.  Analytic functionality is modest.

–SnapLogic announces pending release of what it calls SnapLogic Elastic Integration Platform, which includes components branded as “Sparkplex” and Spark “Snaps”.  The former includes a code generator that translates user requests from the SnapLogic visual pipeline designer into Spark code.  The latter is SnapLogic branding for “prebuilt connectors”, of which there are many.  SnapLogic claims that Snaps are plug and play, so it’s a “snap” to convert your pipeline from MapReduce to Spark.  Stories here, here, here, here, and here.

Spark as a Service

In case you’re not happy with offerings from Databricks, Qubole, Amazon Web Services, Google, BlueData and MemSQL, Altiscale announces you-know-what.

SQL/OLAP

Apache Drill

Dremio, a startup led by MapR’s Drill gurus, lands a $10 million A round.

At the Hadoop Meeting NYC, the folks from Dremio present Drill use cases and a roadmap.

Polymath Abhishek Tiwari reflects on Drill.

On the MapR blog, Joseph Blue explains how to identify a data breach with Drill.

AtScale

adds support for Microsoft PowerBI to its BI on Hadoop story.

Presto

On ZDNet, Natalie Gagliordi touts Teradata’s embrace of Presto.  (Note to ZDNet’s headline editor: Presto is not an Apache project).  She correctly notes that Presto is faster than Hive-on-MapReduce, but that’s a low bar; just about everything is faster than Hive releases prior to 0.13, but Hive-on-Tez competes well with Drill, Impala, Presto and Spark SQL.  That’s a problem for Presto, because a challenger has to be outstanding at something.  Once again, it appears that Teradata is betting on the wrong horse.

Machine Learning

SparkR

Databricks’ Hossein Falaki delivers a presentation at Strata on supercharging R with Spark; slides here.  Spark’s R API is incomplete; as of Spark 1.5.1 it supports DataFrames operations (including SQL queries) and generalized linear models.  That’s better than nothing, but R users who need to do serious machine learning now need to look elsewhere.

Streaming Analytics

Apache Flink

On the MapR blog, Ellen Friedman introduces you to Flink.

Data Artisans’ Robert Metzger delivers a presentation about the architecture of Flink’s streaming runtime at ApacheCon Europe.

Spark Streaming

At Strata NY 2015, Databricks’ Tathagata Das describes the new bits in Spark Streaming.

Big Analytics Roundup (September 28, 2015)

Strata+Hadoop World NYC is upon us.  Andrew Brust opines that there will be three themes at Strata this year: (1) Spark “versus” Hadoop; (2) streaming goes mainstream; (3) data governance matters.  My take:

  1. “Spark versus Hadoop” is controversy for the sake of people who like controversy.  Spark works with Hadoop, and Spark works with other platforms, or by itself.  Use cases will determine the best platform.
  2. We’ve been hearing that streaming is mainstream for something like ten years now.  There are a half-dozen commercial products in the space, plus multiple open source frameworks.
  3. Data governance is a soporific.

Due to the spate of Spark stories this week, this week’s roundup has four sections: Spark, SQL, Machine Learning and Streaming.  The top story is Databricks’ Spark survey, which provoked a flurry of analysis.

Spark

2015 Spark Survey

Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here.  The “report” is a somewhat informative mashup of survey findings, plus other information, such as the headcount from Spark Summits.  (Spoiler: it’s increasing.)  On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points.  Additional analysis herehereherehereherehere, here and here.

Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop).  Andrew Oliver, for example, commenting on Cloudera’s One Platform  announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey.  But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.

Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop.  But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.

The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java.  The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.

Spark as a Service

Google announces Cloud Dataproc, a managed Spark and Hadoop service, currently available in beta.  Key benefits claimed: cheap, fast, integrated with the other Google Cloud platform services, easy to manage, simple and familiar.  Google claims that they can set up or knock down a cluster in ninety seconds or less.  Billing is by the minute, which is cool.  Stories here, here, herehere, here, herehere, here, herehere, here, herehere, here, and here.

BlueData offers Yet Another Spark Service.

In case you’re not happy with available offerings for Spark-as-a-service from Databricks, Qubole, Amazon Web Services, Google and BlueData, MemSQL offers Streamliner.  Stories here, here, here, here and here.

Miscellaneous Spark Bits

Jim Scott enters the Spark vs. Hadoop fray and gets it wrong.  No, Spark does not need HDFS; it works perfectly well with other datastores.

Jim Scott (again) lists five use cases for Spark Streaming: credit card fraud detection, network security, genomic sequencing, real-time ad targeting and hospital readmission.

On the MapR blog, the ubiquitous Jim Scott explains why Spark is a great companion to Hadoop.

In IT Jungle, Alex Woodie wonders what IBM’s embrace of Spark means for the product line IBM now brands as “i-series” and everyone else calls “AS-400”.  His answer: nothing, IBM has no plans to put Spark on these tired old boxes.

Writing for American Banker, Tom Groenfeldt interviews Tom Davenport, several vendors (Rob Thomas of IBM, David Wallace of SAS and Abhi Mehta of Tresata) and one banker.  Tom Davenport says that bankers use different things, touts Teradata; Rob Thomas talks about IBM’s Spark initiative; David Wallace says that banks use SAS, and the one banker talks about using Accenture.  From this muddle, Mr. Groenfeldt concludes that banks are turning to Spark.

In an article titled Retail Gains with Distributed Systems, Daniel Gutierrez talks about Hadoop and Spark, but provides no actual examples of retailers using these platforms.

SQL

Drill

MapR’s Drill team walks to start Dremio.

Jim Scott, who was quite busy last week, profiles Apache Drill.

On YouTube, a disembodied voice representing Syntelli Solutions offers you a Test Drive using Drill and Spotfire on AWS.

Impala

Cloudera benchmarks Impala with TPC-DS queries, concludes that maximum concurrency with good performance increases with the size of the cluster.  This does not seem surprising at all; more nodes in the cluster means more horsepower.

Spark

Harish Butani of Sparkline Data benchmarks TPCH queries using Spark SQL on Druid, summarizes results on LinkedIn.  Conclusion: Spark on Druid runs a lot faster than Spark on Parquet.  Full report here. Sparkline publishes a Spark Druid interface in Spark Packages.

On the MapR blog, Michele Nemschoff touts the Hadoop and Spark platform for retail analytics it sold to Quantium, an Australian analytic services provider.

Platfora announces Release 5.0, which leverages Spark behind the scenes for data preparation.  Alex Woodie explains.  More stories here, herehere and here.

ClearStory Data announces a triumph of branding (“Intelligent Data Harmonization”) and a few new features in a muddled press release.

Machine Learning

Graphlab/Dato

Carlos Guestrin announces that Dato is a big believer in open source software, which will make you feel good when you pay the subscription fees on Dato’s commercial software.   Dato has released its SFrame columnar data frame to open source under a BSD license.  SFrames are like Pandas or R Frames, with some additional features useful to data scientists, like out-of-memory operations and support for wide datasets.

No doubt SFrames are cool, but the key challenge for companies in this space is to figure out how to make analytics work with mainstream data formats.  Any advantages of a new format are offset by the time and cost needed to ingest and export the data.

H2O/H2O.ai

At the Moscow Data Fest, H2O argues that machine learning is the new SQL.

Sam Dean interviews H2O.ai VP Marketing Oleg Rogynskyy.

Spark

Two items from the Databricks blog cover improvements to Spark’s machine learning capabilities in Spark 1.5:

Cloudera’s Sandy Ryza et. al. contribute Spark-Timeseries, a Python and Scala library for analyzing large-scale time series datasets. (h/t Hadoop Weekly)

Streaming Analytics

Flink/Data Artisans

Concurrent and Data Artisans announce “strategic partnership” to support Cascading on Flink.  Cascading touts.

On the MapR blog, Ellen Friedman introduces you to Flink.

TIBCO Streambase

TIBCO’s Kai Wahner presents a nice overview of stream processing frameworks and products.  Not surprisingly, he likes Tibco Streambase, but the deck nicely summarizes differences between the commercial and open source options.

Forrester “Wave” for Predictive Analytics

Last week, Forrester published its 2015 “Wave” report for Big Data Predictive Analytics Solutions.  You can pay $2,495 and buy it directly from Forrester (here), or you can get the same report for free from SAS (here).

The report is inaptly named, as it commingles software that scales to Big Data (such as Alpine Chorus) with software that does not scale (such as Dell Statistica.)  Nor does Big Data capability appear to impact the ratings; otherwise Alpine and Oracle would have scored higher than they did, and SAP would have scored lower.  IBM SPSS alone does not scale without Netezza or BigInsights; SAS only scales if you add one of its distributed in-memory back ends.  These products aren’t listed among the evaluated software components.

Also, Forrester seriously needs to hire an editor.  Alteryx does not currently offer software branded as “Alteryx Analytics”, nor does SAS currently offer a bundle called the “SAS Analytics Suite.”

Forrester previously published this wave in 2013; key changes since then:

  • Among the Leaders, IBM edged past SAS for the top rating.
  • SAP’s rating did not change but its brand presence improved considerably, which demonstrates the uselessness of brand presence as a measure of value.
  • Oracle showed up at the beauty show this time, and improved its position slightly.
  • Statistica’s rating did not change, but its brand presence improved due to the acquisition by Dell.  (See SAP, above).  Shockingly, the addition of “Toad Data Point” to the Dell/Statistica solution did not move the needle.
  • Angoss improved its ratings and brand strength slightly.
  • TIBCO and Salford switched their analyst relations budgets from Forrester to Gartner and are gone from this report.
  • KXEN and Revolution Analytics are also gone due to acquisitions.  Interestingly, the addition of KXEN to SAP had no impact on SAP’s ratings, thus demonstrating that two plus zero is still two.
  • RapidMiner, Alteryx, FICO, Alpine, KNIME and Predixion are all new to the report.

Gartner issued its “Magic Quadrant” back in February; the comparisons are interesting:

  • KNIME is a “leader” in Gartner’s view, while Forrester considers the product to be decidedly mediocre.  Seems to me that Forrester has it about right.
  • Oracle did not participate in the Gartner MQ.
  • RapidMiner, a “leader” in the Gartner MQ, scores very well on Forrester’s “Current Offering” axis, but less well on “Strategy.”   This strikes me as a good way for Forrester to sell strategy consulting.
  • Microsoft and Alpine landed in Gartner’s Visionary quadrant but scored relatively low in Forrester’s assessment.  Both vendors have appealing strategies, and need to roll up their sleeves to deliver.
  • Predixion trails the pack in both reports.  Reminds me of high school gym class.

Forrester’s methodology places more weight on the currently available software, while Gartner places more emphasis on the vendor’s “vision.”  Vision is certainly important to consider when selecting a software vendor, but leadership tends to be self-sustaining; today’s category leaders are likely to be tomorrow’s category leaders, except when markets are disrupted — in which case analysts are rarely able to pick winners.

More Comments on Microsoft + Revolution Analytics

My inbox continues to fill with Google Alerts about Microsoft’s announced purchase of Revolution Analytics — too numerous to link.

Most of these stories simply repackage the Microsoft announcement.

Clint Boulton of the WSJ’s CIO Journal writes one of the best analyses:

Microsoft is betting on the timeliness of its acquisition as more businesses adopt analytics. Revolution’s software helps companies use R, an open source programming language that more than two million programmers use daily to build predictive models. R is popular among university computer science students, many of whom continue to use it in their careers as data scientists.

Data scientists who extract data from of a data warehouse or Hadoop processing system, use R to slice and dice it for insights, and visualize the results. But businesses analyzing financial, social media and other data often need to scale the analytics across clusters of computers.

Several analysts pass along the factoid that two million people use R.   The truth is that nobody has any idea how many people use R; we don’t even know how many have downloaded the software.  The New York Times pointed out the difficulty in its piece five years ago:

While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that close to 250,000 people work with it regularly.

It’s possible that R has gained 1,750,000 users in the interceding five years.  It’s also possible that R has gained 10,000,000 users.  “Those most familiar with the software” are simply guessing.

While most analysts are neutral to positive on Microsoft’s move, Mr. Dan Woods takes a contrary view.  In an article published in Forbes and cross-posted on multiple platforms, Mr. Woods argues that Microsoft was wrong to buy Revolution Analytics, and instead should buy Tibco.   (That is the implication of his argument that Microsoft should “emulate” Tibco, since the only way to “emulate” Tibco is to own the clump of software Tibco packages up as TERR.)

Mr. Woods is a “content specialist”, as freelance writers call themselves today, and his expertise in analytics is exemplified by his most recent book, Wikis for Dummies, published in 2007.  One suspects that the private equity firm that acquired Tibco in September is peddling the pieces, and has engaged “content specialists” to bang the drum.

Mr. Woods gets two things right.  It’s true that R is a mess, and it is also true that the GPL license makes R difficult to commercialize.  R’s messiness is a byproduct of crowdsourced development; it is a feature to its devotees and a bug for everyone else.  (For those who simply cannot tolerate R’s messiness there is a simple solution: use Python.)  Under the GPL license, any enhancements become part of the free distribution, so if you distribute a product built with R you must share the source code of your product as well.

At the crux of his argument, though, Mr. Woods gets it wrong:

Revolution Analytics has made a business, like many open source-based companies, of supporting Open Source R.

This is factually incorrect.  Revolution only recently started to offer a consulting service for open source R users; for most of its history, its business was built around Revolution R Enterprise, a commercially supported enhanced R distribution.  This is not a trivial distinction.  Cloudera Hadoop, for example, is based on Apache Hadoop, but it is not the same thing; while many enterprises use commercially supported Hadoop distributions (from vendors like Cloudera, Hortonworks or MapR), hardly anyone uses open source Apache Hadoop in production.

The same is true for R; while many enterprises have an issue using open source R, they are willing to deploy commercially supported R distributions (such as Oracle R or Revolution R).  This is the business Microsoft enters by acquiring Revolution Analytics.

Regarding Mr. Woods’ point about the need to rebuild R from the ground up, that is neither possible nor necessary.  The GPL license prevents anyone from “rebuilding” R as a commercial venture; if anyone “rebuilds” the language it will be the open source development team itself.

In any case, one need not “make R scale” — one need only provide an R API to other platforms (such as Apache Spark or dbLytix) that can scale, so that R users can interface with them.   This is the approach taken by Revolution Analytics’ ScaleR software, which is actually written in C, but includes an interface from the R programming language.  By building this component into Azure, Microsoft can offer those who use R locally a scaleable back end.

Update: Mr. Woods doubles down here.