Big Analytics Roundup (August 8, 2016)

So, Apple acquires Turi for $200 million. Hopefully, Apple did not pay for brand equity.

Bridget Botelho argues that businesses must either disrupt or be disrupted, and outlines the role of machine learning. Someone should write a book about that.

Conference Announcements

— Flink Forward announces the schedule for its second annual event, to be held September 12-14 in Berlin.

— Databricks announces the agenda for Spark Summit Europe 2016 in Brussels (October 25-27)

Apple Buys GraphLab Dato Turi

Geekwire breaks the story, reporting a purchase price of $200 million. According to TechCrunch, Turi notified customers that its products would no longer be available. Apple adds Turi to the portfolio of machine learning startups it has acquired in the past year, including Emotient, Perceptio, and VocalIQ. More reporting here.

GraphLab started in 2009 as an open source project led by Carlos Guestrin of Carnegie Mellon. (According to OpenHub Guestrin never contributed any code.) In May 2013, Guestrin raised $6.75M to start an eponymous venture to provide commercial support for GraphLab. In October 2014, GraphLab announced the availability of GraphLab Create, a commercially licensed software product. Contributions to the open source project actually ended in 2013; while the code remains on GitHub, the project is dead.

GraphLab changed its name to Dato in January 2015. They should have googled the name; at the time, the top links in a search included Dato Foland, a gay porn star, and Datto Inc, a data backup and recovery company in Connecticut. The latter proved problematic; Datto sued, forcing Dato to rebrand as Turi earlier this month.

Turi’s open source SFrame project remains for those who think introducing another file system into the mix is a smart thing to do.

Teradata: 9 Straight Quarters of Declining Product Revenue

For the second quarter of 2016, declining data warehouse giant Teradata reports an 11% decline in product revenue compared to Q2 2015. (Product revenue includes revenue from licensing software and hardware — boxes with the Teradata brand.) Maintenance revenue increased slightly, which means that customers aren’t pulling the plug on Teradata databases as fast as they did last year. Consulting revenue declined by 1%, which casts doubt on TDC’s stated strategy to become a services powerhouse.

Screen Shot 2016-08-08 at 10.38.16 AM

Count me as skeptical about the merits of that plan. Teradata’s consulting revenue remains highly correlated with product revenue; in other words, if Teradata can’t sell its boxes, it’s not going to sell billable hours for consultants to implement those boxes. Teradata is not a credible competitor in the market for consulting-led solutions; companies like Oracle, IBM and SAS have a twenty-year head start.

Since Teradata performed better than “expectations”, Wall Street rewarded the stock with a bounce above $30.  It’s a dead-cat bounce. As the Wall Street Journal notes, companies routinely game analyst expectations. TDC currently trades at 32 times trailing earnings, well above its peers; moreover, its peers are growing rather than declining.


— Kaarthik Sivashanmugam explains how to develop Apache Spark applications in .NET with Mobius.

— On the Cloudera Engineering blog, Devadutta Ghat et. al. explain the latest performance improvements in Impala 2.6.

— Parsey McParseface now has 40 cousins. On the Google Research Blog, Chris Alberti et. al. explain.

— Ujjwal Ratan explains how to use Amazon Machine Learning to predict patient readmission.


— Curt Monash offers his assessment of Spark. Highlights:

  • Spark replaces MapReduce, in particular for data transformation.
  • Spark is becoming the default platform for machine learning.
  • Spark SQL is OK as an adjunct for other analysis.
  • Spark Streaming is doing well, but there are challengers. (See below).
  • Databricks’ managed service for Spark has more than 200 subscribers.

— Serdar Yegulalp deploys the tired old “pure streaming versus microbatch” argument to claim that Apache Apex, Heron, Apache Flink and Onyx are “contenders” versus Spark. Someone should show him this graph:

Screen Shot 2016-07-18 at 8.26.11 AM

— In Datanami, Alex Woodie profiles Flink.

— Vance McCarthy touts MapR’s Spyglass Initiative for analytics on the MapR Converged Data Platform.

— Trevor Jones describes Microsoft Azure’s big data tools.

— Sam Dean champions Sparkling Water, H2O’s interface to Spark.

Commercial Announcements

— Dataiku announces the release of Data Science Studio 3.1, with five machine learning back ends and a visual coding interface (which it labels “code-free”).  Dave Ramel reports.

— John Snow Labs announces it will deliver curated data in Parquet format.

— Lexalytics announces the availability of its Semantria text analytics software on Azure.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.


— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.


— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato,, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

Big Analytics Roundup (October 19, 2015)

Ten stories this week.  Don’t miss story #10, which recaps an analysis of collaboration and influence in the U.S.Congress using open source graph engines and a rich database of legislation.

(1) Rexer: R Continues to Lead

Rexer Analytics has released preliminary results from its 2015 survey of working analysts; Bob Muenchin reports.  One interesting snippet — reported tool use, as shown in the graphic below.


Several interesting changes from the previous survey:

  • Reported primary and total use of R continues to increase
  • SPSS/Statistics declined slightly in reported usage, remains #2
  • RapidMiner is way down, from third to ninth.  Also interesting to note that ~95% of RapidMiner users say they use the free version.
  • SAS usage remained constant, but moved up in rank to third as RapidMiner fell
  • Reported usage of Excel Data Mining and Tableau are way up from previous rounds of the survey

Like most surveys on this topic, there are issues with Rexer’s sampling methodology that mandate careful interpretation.  Rexer’s methods are largely consistent from year to year, however, so changes between iterations of the survey are interesting and may reflect real-world trends.

(2) CfP for Spark Summit East Opens

Spark Summit East will meet at the New York Hilton February 16-18; I will be there, with bells on.  The Call for Presentations is now open, link here.

(3) DataTorrent Explains DAGs

On the DataTorrent blog, Thomas Weise explains directed acyclic graphs, or DAGs, which is a fancy name for a way to describe logical dependencies with dots and arrows.  It sounds prosaic, but DAGs are fundamental to Storm, Spark, Tez and Apex, all of which play a role in bringing high-performance computing to the Hadoop ecosystem.

(4) New Apache Drill Release

SQL platform Apache Drill announces Release 1.2.  Key new bits:

  • Relational database support (through JDBC)
  • Additional window functions
  • Parquet metadata caching
  • Performance improvements on HBase and Hive tables
  • Drop table capability for files and directories
  • Enhanced MongoDB integration

Plus many bug fixes.  Nice work, Drill team, but it feels like rearranging the deck chairs.  Drill lags the other SQL engines in Kerberos support, YARN integration and query fault tolerance; while Teradata is stepping in to do something with Presto, Drill is an orphan.  There is no UI, and no sign that the BI vendors are looking to build on Drill, so it’s not clear where Drill goes from here.

(5) Fans Flock to Flink Forward ’15

The first Flink Forward conference met for two days in Berlin last week.  Data Artisans organized the program and delivered a number of the presentations.  Capital One’s Slim Baltagi has kindly shared the deck from his keynoter on Flink versus Spark.

(6) Big Data Spain Meets in Madrid

The 4th Edition of Big Data Spain met last week in Madrid.  On Slideshare, evil mad scientist Paco Nathan offers two decks:

Data Science in 2016, his keynote address, covers architectural design patterns; observations on trends; example applications and use cases; and offers a glimpse ahead.

Crash Introduction to Apache Spark, slides from a workshop, is exactly what it sounds like it is.

(7) MIT Researchers Build Data Science Machine

James Max Kanter and Kalyan Veeramachaneni of MIT develop an automated Data Science Machine (DSM), enroll it in three data science competitions, beat 615 out of 906 teams.  DSM performed “nearly” as well as the human teams; but while humans spent months developing their models, the DSM spent 2-12 hours.

In a paper that describes their approach, Kanter and Veeramachaneni describe an approach to feature engineering they call Deep Feature Synthesis, which generates features based on automated analysis of a relational data model.  The authors note that a naive grid search for the optimal model specification would require trillions of experiments; they use Bayesian optimization to find the best model.

(8) Spark-Based Security Platform Lands Funding

DataVisor, founded in 2013, announces a $14.5 million “A” round from GSR and NEA to develop its eponymous security analysis engine, which runs on Spark.  The company, based in Mountain View, claims that its software can process billions of events per hour, and boasts Yelp and Momo as customers.

(9) Dato Releases Spark-GraphLab Interface

On the Dato Blog, Emad Soroush introduces the spark-sframe package, which enables a GraphLab user to ingest Spark RDDs as GraphLab SFrames.  Dato introduced SFrames a couple of weeks ago.  As I noted at the time, it doesn’t really matter how cool the SFrame is, it’s YADF — Yet Another Data Format.

Rather than forcing data scientists to convert data to a new format, machine learning vendors need to figure out how to work with existing Hadoop formats.  Dato isn’t going to build a complete Business Analytics stack; it’s going to have to integrate with SQL engines and other tools, and YADF makes that harder, not easier.

I also have to wonder why Dato hasn’t registered this package on Spark Packages, like everyone else who integrates with Spark.

(10) Spark Plus GraphX Equals Mazerunner

On his personal blog, William Lyon demonstrates an analysis of influence in the U.S.Congress using the Neo4j graph database, Apache Spark GraphX and Mazerunner, an open source project that merges the capabilities of Neo4j and Spark.  In a previous post, Lyon showed how he loaded data from into Neo4j to build a rich graph of collaboration among different members of Congress.


Next, he uses Mazerunner’s PageRank tooling to calculate the influence for each Senator and Member of Congress.  Mazerunner selects and extracts the relevant subgraph from Neo4j, runs a Spark GrapX job and writes the results back to Neo4j.

Mazerunner is free and open source under an Apache 2.0 license, and is distributed on Git.  Currently, it supports algorithms for PageRank, Closeness Centrality, Betweenness Centrality, Triangle Counting, Connected Components and Strongly Connected Components.

Big Analytics Roundup (September 28, 2015)

Strata+Hadoop World NYC is upon us.  Andrew Brust opines that there will be three themes at Strata this year: (1) Spark “versus” Hadoop; (2) streaming goes mainstream; (3) data governance matters.  My take:

  1. “Spark versus Hadoop” is controversy for the sake of people who like controversy.  Spark works with Hadoop, and Spark works with other platforms, or by itself.  Use cases will determine the best platform.
  2. We’ve been hearing that streaming is mainstream for something like ten years now.  There are a half-dozen commercial products in the space, plus multiple open source frameworks.
  3. Data governance is a soporific.

Due to the spate of Spark stories this week, this week’s roundup has four sections: Spark, SQL, Machine Learning and Streaming.  The top story is Databricks’ Spark survey, which provoked a flurry of analysis.


2015 Spark Survey

Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here.  The “report” is a somewhat informative mashup of survey findings, plus other information, such as the headcount from Spark Summits.  (Spoiler: it’s increasing.)  On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points.  Additional analysis herehereherehereherehere, here and here.

Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop).  Andrew Oliver, for example, commenting on Cloudera’s One Platform  announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey.  But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.

Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop.  But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.

The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java.  The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.

Spark as a Service

Google announces Cloud Dataproc, a managed Spark and Hadoop service, currently available in beta.  Key benefits claimed: cheap, fast, integrated with the other Google Cloud platform services, easy to manage, simple and familiar.  Google claims that they can set up or knock down a cluster in ninety seconds or less.  Billing is by the minute, which is cool.  Stories here, here, herehere, here, herehere, here, herehere, here, herehere, here, and here.

BlueData offers Yet Another Spark Service.

In case you’re not happy with available offerings for Spark-as-a-service from Databricks, Qubole, Amazon Web Services, Google and BlueData, MemSQL offers Streamliner.  Stories here, here, here, here and here.

Miscellaneous Spark Bits

Jim Scott enters the Spark vs. Hadoop fray and gets it wrong.  No, Spark does not need HDFS; it works perfectly well with other datastores.

Jim Scott (again) lists five use cases for Spark Streaming: credit card fraud detection, network security, genomic sequencing, real-time ad targeting and hospital readmission.

On the MapR blog, the ubiquitous Jim Scott explains why Spark is a great companion to Hadoop.

In IT Jungle, Alex Woodie wonders what IBM’s embrace of Spark means for the product line IBM now brands as “i-series” and everyone else calls “AS-400”.  His answer: nothing, IBM has no plans to put Spark on these tired old boxes.

Writing for American Banker, Tom Groenfeldt interviews Tom Davenport, several vendors (Rob Thomas of IBM, David Wallace of SAS and Abhi Mehta of Tresata) and one banker.  Tom Davenport says that bankers use different things, touts Teradata; Rob Thomas talks about IBM’s Spark initiative; David Wallace says that banks use SAS, and the one banker talks about using Accenture.  From this muddle, Mr. Groenfeldt concludes that banks are turning to Spark.

In an article titled Retail Gains with Distributed Systems, Daniel Gutierrez talks about Hadoop and Spark, but provides no actual examples of retailers using these platforms.



MapR’s Drill team walks to start Dremio.

Jim Scott, who was quite busy last week, profiles Apache Drill.

On YouTube, a disembodied voice representing Syntelli Solutions offers you a Test Drive using Drill and Spotfire on AWS.


Cloudera benchmarks Impala with TPC-DS queries, concludes that maximum concurrency with good performance increases with the size of the cluster.  This does not seem surprising at all; more nodes in the cluster means more horsepower.


Harish Butani of Sparkline Data benchmarks TPCH queries using Spark SQL on Druid, summarizes results on LinkedIn.  Conclusion: Spark on Druid runs a lot faster than Spark on Parquet.  Full report here. Sparkline publishes a Spark Druid interface in Spark Packages.

On the MapR blog, Michele Nemschoff touts the Hadoop and Spark platform for retail analytics it sold to Quantium, an Australian analytic services provider.

Platfora announces Release 5.0, which leverages Spark behind the scenes for data preparation.  Alex Woodie explains.  More stories here, herehere and here.

ClearStory Data announces a triumph of branding (“Intelligent Data Harmonization”) and a few new features in a muddled press release.

Machine Learning


Carlos Guestrin announces that Dato is a big believer in open source software, which will make you feel good when you pay the subscription fees on Dato’s commercial software.   Dato has released its SFrame columnar data frame to open source under a BSD license.  SFrames are like Pandas or R Frames, with some additional features useful to data scientists, like out-of-memory operations and support for wide datasets.

No doubt SFrames are cool, but the key challenge for companies in this space is to figure out how to make analytics work with mainstream data formats.  Any advantages of a new format are offset by the time and cost needed to ingest and export the data.


At the Moscow Data Fest, H2O argues that machine learning is the new SQL.

Sam Dean interviews VP Marketing Oleg Rogynskyy.


Two items from the Databricks blog cover improvements to Spark’s machine learning capabilities in Spark 1.5:

Cloudera’s Sandy Ryza et. al. contribute Spark-Timeseries, a Python and Scala library for analyzing large-scale time series datasets. (h/t Hadoop Weekly)

Streaming Analytics

Flink/Data Artisans

Concurrent and Data Artisans announce “strategic partnership” to support Cascading on Flink.  Cascading touts.

On the MapR blog, Ellen Friedman introduces you to Flink.

TIBCO Streambase

TIBCO’s Kai Wahner presents a nice overview of stream processing frameworks and products.  Not surprisingly, he likes Tibco Streambase, but the deck nicely summarizes differences between the commercial and open source options.

Big Analytics Roundup (September 21, 2015)

Top story of the week: release of AtScale’s Hadoop Maturity Survey, which triggered a flurry of analysis.  Meanwhile, the Economist ventures into the world of open source software and venture capital, embarrassing itself in the process; and IBM announces plans to use Spark in its search for extraterrestrial intelligence, a project that would be more useful if pointed toward IBM headquarters.

AtScale Releases Hadoop Adoption Survey

OLAP-on-Hadoop vendor AtScale publishes results of a survey of 2,200 respondents who are either actively working with Hadoop today or planning to do so in the near future.  AtScale partnered with Cloudera, Hortonworks, MapR and Tableau to recruit respondents for the survey.

A copy of the survey report is here; the survey instrument is here.  AtScale will deliver a webinar summarizing results from the survey; you can register here.

There are multiple stories about this survey in the media: here, here, here, here, here, here, here, here, and here.  Some highlights:

  • Andrew Oliver compares this survey to Gartner’s Hadoop assessment back in May and concludes that Gartner blew it.  While I agree that Gartner’s outlook on Hadoop is too conservative, (and said so at the time) the two surveys are apples and oranges: while AtScale surveyed people who are either already using Hadoop or plan to do so, Gartner surveyed a panel of CIOs.  Hence, it is not surprising that AtScale’s respondents are more positive about prospects for Hadoop.
  • Matt Asay notes that “Cost saving” is the third most frequently cited reason for adopting Hadoop, after “Scale-out needs” and “New applications.”  This is somewhat surprising, given Hadoop’s reputation as a cheap datastore.  Cost is still a factor driving Hadoop adoption, it’s just not the primary factor.

Here are a few insights from this survey not mentioned by other analysts.  First look at the difference in BI tool usage between those currently using Hadoop and those planning to use Hadoop.  Compared to current users, planners are significantly more likely to say they want to use Excel and less likely to say they want to use Tableau or SAS.  (Current and planned use of SAP Business Objects and IBM Cognos are about the same.)

Screen Shot 2015-09-21 at 10.06.17 AM

Also interesting to note differences in Hadoop maturity among the BI users.  SAS users are more likely than others to self-identify as “Low Maturity”:

Screen Shot 2015-09-21 at 10.06.37 AM

Finally, a significant minority of current Hadoop users cite Management, Security, Performance, Governance and Accessibility as challenges using Hadoop.  However, most who plan to use Hadoop do not anticipate these challenges — which suggest these respondents are in for a rude awakening.

Screen Shot 2015-09-21 at 10.07.01 AM

SQL on Hadoop

For those who like things distilled to sound bites, eWeek offers a point of view on when to select Apache Spark, Hadoop or Hive.   Brevity is the soul of wit, but sometimes it’s just brevity.

Amazon Web Services

Redshift is an OEM version of Actian’s ParAccel columnar database with analytic capabilities removed, which is why data scientists say that Redshift is where data goes to die.  Amazon Web Services has taken baby steps to ameliorate this, adding Python UDFs.  Christopher Crosbie reports, on the AWS Big Data Blog. (h/t Hadoop Weekly)

Apache Apex/DataTorrent

On the DataTorrent blog, Amol Kekre introduces you to Apache Apex, which was just accepted by Apache as an incubator project.  DataTorrent touts Apex as kind of like Spark, only better, thereby demonstrating the importance of timing in life.  (h/t Hadoop Weekly)

If you think that Apex does nothing, Munagala Ramanath shares the good news that Apex supports the Malhar library.  Honestly, though, it still seems to do nothing.

In an email to David Ramel, DataTorrent CEO Phu Hoang identifies flaws in Spark, points to his Apache Apex project as a solution.  Bad move on his part.

Apache Drill

Chloe Green discusses implications of the European Commission’s digital single market, and suggests that retailers will use Apache Drill to analyze the data that will be produced under this regulatory framework.  There are two problems with this article.  First, Green makes no effort to consider alternatives to Drill.  Second, the article itself accepts the premise that more regulation will produce business growth; in fact, the opposite is more likely (except for those in the compliance industry.)

The Drill team explains how to implement Drill in ten minutes.

Jim Scott summarizes the benefits of Drill for the BI user.

On O’Reilly Radar, Ellen Friedman recaps the history of Drill as an open source project.

Zygimantas Jacikevicius offers an introduction to Drill and explains why it is useful.

Apache Flink

On the DataArtisans blog, Kostas Tzoumas seeks to position Flink against Spark by arguing that batch is a special case of streaming.  Of course, you can argue the opposite just as easily — that streaming is batch with very small batches.

If you care about Off-heap Memory in Apache Flink, Stephan Ewen offers a summary.

At a DC Area Flink Meetup, Capital One’s Slim Baltagi explains unified batch and real-time stream processing with Flink.

Flink sponsor DataArtisans announces partnership with SciSpike, a training and consulting provider.

Apache NiFi

Yves de Montcheuil explains why you should care about Apache NiFi, a project that connects data-generating systems with data processing systems.  Spoiler: it’s all about security and reliability.

Apache Spark

In Fortune, Derrick Harris describes Microsoft’s “Spark-inspired” and “Spark-like” Prajna project, does not explain why MSFT is reinventing the wheel.

Cloudera announces a Spark training curriculum.  For those without prior Hadoop experience, two courses cover data ingestion with Sqoop and Flume, data modeling, data processing with Spark and Spark Streaming with Kafka.  There is also a single shorter course covering the same ground for those with prior Hadoop experience.  Finally, a data science course covers advanced analytics with MLLib.

Document analytics vendor Ephesoft introduces new software built on Spark.

Matt Asay uses the Spark/Fire metaphor once too often.

In a post about DataStax, Curt Monash notes synergies between Spark and Cassandra.

MongoDB offers a white paper which explains, not surprisingly, how to use Spark with Mongo.

On the Basho blog, Korrigan Clark discusses his work using Spark to develop an algorithmic stock trading program.

Here are two items from Cloudera’s Kostas Sakellis on SlideShare.  The first explains why your Spark job fails; the second reviews how to get Spark customers to production.


Dato, the University of Washington and Coursera announce a machine learning specialization consisting of five courses and a capstone project.  The curriculum is platform neutral, though I suspect that co-creator Carlos Guestrin manages to get in a good word for his project.


Two items on slideshare:

  • From a meetup at 6Sense, Mark Landry explains H2O Gradient Boosted Machines for Ad Click Prediction.
  • Avni Wadhwa and Vinod Iyengar demonstrate how to build machine learning applications with Sparkling Water, H2O’s interface to Spark.

Big Analytics Roundup (March 2, 2015)

Here is a roundup of some recent Big Analytics news and analysis.


  • SiliconAngle covers the Big Data money trail.

Apache Spark

  • Curt Monash writes about Databricks and Spark on his DBMS2 blog.
  • On the Databricks blog, Dave Wang summarizes Spark highlights from Strata + Hadoop World.
  • In this post, Hammer Lab describes how to monitor Spark with Graphite and Grafana.
  • Cloudera announces Hive on Spark beta.
  • InfoWorld covers Spark’s planned support for R in Release 1.3.
  • Qubole announces Spark as a Service.


  • Dato announces new version of GraphLab Create.


  • From Strata + Hadoop World, Prithvi Pravu talks about using H2O.
  • Also from Strata, here is Cliff Click’s presentation on H2O, Spark and Python.
  • On the H2O blog, Arno Candel publishes a performance tuning guide for H2O Deep Learning.