Spark is the Future of Analytics

At the 2016 Spark Summit, Gartner Research Director Nick Heudecker asked: Is Spark the Future of Data Analysis?  It’s an interesting question, and it requires a little parsing. Nobody believes that Spark alone is the future of data analysis, even its most ardent proponents. A better way to frame the question: Does Spark have a role in the future of analytics? What is that role?

Unfortunately, Heudecker didn’t address the question but spent the hour throwing shade at Spark.

Spark is overhyped! He declared. His evidence? This:

screen-shot-2017-02-09-at-2-58-05-pm

One might question an analysis that equates real things like optimization with fake things like “Citizen Data Science.” Gartner’s Hype Cycle by itself proves nothing; it’s a conceptual salad, with neither empirical foundation nor predictive power.

If you want to argue that Spark is overhyped, produce some false or misleading claims by project principals, or documented cases where the software failed to work as claimed. It’s possible that such cases exist. Personally, I don’t know of any, and neither does Nick Heudecker, or he would have included them in his presentation.

Instead, he cited a Gartner survey showing that organizations don’t use Spark and Flink as much as they use other tools for data analysis. From my notes, here are the percentages:

  • EDW: 57%
  • Cloud: 44%
  • Hadoop: 42%
  • Stat Packages: 32%
  • Spark or Flink: 9%
  • Graph Databases: 8%

That 42% figure for Hadoop is interesting. In 2015, Gartner concern-trolled the tech community, trumpeting the finding that “only” 26% of respondents in a survey said they were “deploying, piloting or experimenting with Hadoop.” So — either Hadoop adoption grew from 26% to 42% in a year, or Gartner doesn’t know how to do surveys.

In any event, it’s irrelevant; statistical packages have been available for 40 years, EDWs for 25, Spark for 3. The current rate of adoption for a project in its youth tells you very little about its future. It’s like arguing that a toddler is cognitively challenged because she can’t do integral calculus without checking the Wolfram app on her iPad.

Heudecker closed his presentation with the pronouncement that he had no idea whether or not Spark is the future of data analysis, and bolted the venue faster than a jackrabbit on Ecstasy. Which begs the question: why pay big bucks for analysts who have no opinion about one of the most active projects in the Big Data ecosystem?

Here are eight reasons why Spark has a central role in the future of analytics.

(1) Nearly everyone who uses Hadoop will use Spark.

If you believe that 42% of enterprises use Hadoop, you must believe that 41.9% will use Spark. Every Hadoop distribution includes Spark. Hive and Pig run on Spark. Hadoop early adopters will gradually replace existing MapReduce applications and build most new applications in Spark. Late adopters may never use MapReduce.

The only holdouts for MapReduce will be those who want their analysis the way they want their barbecue: low and slow.

Of course, Hadoop adoption isn’t static. Forrester’s Mike Gualtieri argues that 100% of enterprises will use Hadoop within a few years.

(2) Lots of people who don’t use Hadoop will use Spark.

For Hadoop users, Spark is a fast replacement for MapReduce. But that’s not all it is. Spark is also a general-purpose data processing environment for advanced analytics. Hadoop has baggage that data science teams don’t need, so it’s no surprise to see that most Spark users aren’t using it with Hadoop. One of the key advantages of Spark is that users aren’t tied to a particular storage back end, but can choose from many different options. That’s essential in real-world data science.

(3) For scalable open source data science, Spark is the only game in town.

If you want to argue that Spark has no future, you’re going to have to name an alternative. I’ll give you a minute to think of something.

Time’s up.

You could try to approximate Spark’s capabilities with a collection of other projects: for example, you could use Presto for SQL, H2O for machine learning, Storm for streaming, and Giraph for graph analysis. Good luck pulling those together. H2O.ai was one of the first vendors to build an interface to Spark because even if you want to use H2O for machine learning, you’re still going to use Spark for data wrangling.

“What about Flink?” you ask. Well, what about it? Flink may have a future, too, if anyone ever supports it other than ten guys in a loft on the Tempelhofer Ufer. Flink’s event-based runtime seems well-suited for “pure” streaming applications, but that’s low-value bottom-of-the-stack stuff. Flink’s ML library is still pretty limited, and improving it doesn’t appear to be a high priority for the Flink team.

(4) Data scientists who work exclusively with “small data” still need Spark.

Data scientists satisfy most business requests for insight with small datasets that can fit into memory on a single machine. Even if you measure your largest dataset in gigabytes, however, there are two ways you need Spark: to create your analysis dataset and to parallelize operations.

Your analysis dataset may be small, but it comes from a larger pool of enterprise data. Unless you have servants to pull data for you, at some point you’re going to have to get your hands dirty and deal with data at enterprise scale. If you are lucky, your organization has nice clean data in a well-organized data warehouse that has everything anyone will ever need in a single source of truth.

Ha ha! Just kidding. Single sources of truth don’t exist, except in the wildest fantasies of data warehouse vendors. In reality, you’re going to muck around with many different sources and integrate your analysis data on the fly. Spark excels at that.

For best results, machine learning projects require hundreds of experiments to identify the best algorithm and optimal parameters. If you run those tests serially, it will take forever; distribute them across a Spark cluster, and you can radically reduce the time needed to find that optimal model.

(5) The Spark team isn’t resting on its laurels.

Over time, Spark has evolved from a research project for scalable machine learning to a general purpose data processing framework. Driven by user feedback, Spark has added SQL and streaming capabilities, introduced Python and R APIs, re-engineered the machine learning libraries, and many other enhancements.

Here are some projects under way to improve Spark:

— Project Tungsten, an ongoing effort to optimize CPU and memory utilization.

— A stable serialization format (possibly Apache Arrow) for external code integration.

— Integration with deep learning frameworks, including TensorFlow and Intel’s new BigDL library.

— A cost-based optimizer for Spark SQL.

— Improved interfaces to data sources.

— Continuing improvements to the Python and R APIs.

Performance improvement is an ongoing mission; for selected operations, Spark 2.0 runs 10X faster than Spark 1.6.

(6) More cool stuff is on the way.

Berkeley’s AMPLab, the source of Spark, Mesos, and Tachyon/Alluxio, is now RISELab. There are four projects under way at RISELab that will extend Spark capabilities:

Clipper is a prediction serving system that brokers between machine learning frameworks and end-user applications. The first Alpha release, planned for mid-April 2017, will serve scikit-learn, Spark ML and Spark MLLib models, and arbitrary Python functions.

Drizzle, an execution engine for Apache Spark, uses group scheduling to reduce latency in streaming and iterative operations. Lead developer Shivaram Venkataraman has filed a design document to implement this approach in Spark.

Opaque is a package for Spark SQL that uses Intel SGX trusted hardware to deliver strong security for DataFrames. The project seeks to enable analytics on sensitive data in an untrusted cloud, with data encryption and access pattern hiding.

Ray is a distributed execution engine for Spark designed for reinforcement learning.

Three Apache projects in the Incubator build on Spark:

— Apache Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run on Hive, Pig or Spark SQL with MapReduce, Tez or Spark.

— Apache PredictionIO is a machine learning server built on top of an open source stack, including Spark, HBase, Spray, and Elasticsearch.

— Apache SystemML is a library of machine learning algorithms that run on Spark and MapReduce, originally developed by IBM Research.

MIT’s CSAIL lab is working on ModelDB, a system to manage machine learning models. ModelDB extracts and stores model artifacts and metadata, and makes this data available for easy querying and visualization. The current release supports Spark ML and scikit-learn.

(7) Commercial vendors are building on top of Spark.

The future of analytics is a hybrid stack, with open source at the bottom and commercial software for business users at the top. Here is a small sample of vendors who are building easy-to-use interfaces atop Spark.

Alpine Data provides a collaboration environment for data science and machine learning that runs on Spark (and other platforms.)

AtScale, an OLAP on Big Data solution, leverages Spark SQL and other SQL engines, including Hive, Impala, and Presto.

Dataiku markets Data Science Studio, a drag-and-drop data science workflow tool with connectors for many different storage platforms, scikit-learn, Spark ML and XGboost.

StreamAnalytix, a drag-and-drop platform for real-time analytics, supports Spark SQL and Spark Streaming, Apache Storm, and many different data sources and sinks.

Zoomdata, an early adopter of Spark, offers an agile visualization tool that works with Spark Streaming and many other platforms.

All of the leading agile BI tools, including Tableau, Qlik, and PowerBI, support Spark. Even stodgy old Oracle’s Big Data Discovery tool runs on Spark in Oracle Cloud.

(8) All of the leading commercial advanced analytics platforms use Spark.

All of them, including SAS, a company that embraces open source the way Sylvester the Cat embraces a skunk. SAS supports Spark in SAS Data Loader for Hadoop, one of SAS’ five different Hadoop architectures. (If you don’t like SAS architecture, wait six months for another.)

screen-shot-2017-02-13-at-12-30-38-pm
Magic Quadrant for Advanced Analytics Platforms, 2016

— IBM embraces Spark like Romeo embraced Juliet, hopefully with a better ending. IBM contributes heavily to the Spark project and has rebuilt many of its software products and cloud services to use Spark.

— KNIME’s Spark Executor enables users of the KNIME Analytics Platform to create and execute Spark applications. Through a combination of visual programming and scripting, users can leverage Spark to access data sources, blend data, train predictive models, score new data, and embed Spark applications in a KNIME workflow.

— RapidMiner’s Radoop module supports visual programming across SparkR, PySpark, Pig, and HiveQL, and machine learning with SparkML and H2O.

— Statistica, which is no longer part of Dell, offers Spark integration in its Expert and Enterprise editions.

— Microsoft supports Spark in AzureHD, and it has rebuilt Microsoft R Server’s Hadoop integration to leverage Spark as well as MapReduce. VentureBeat reports that Databricks will offer its managed service for Spark on Microsoft Azure later this year.

— SAP, another early adopter of Spark, supports Vora, a connector to SAP HANA.

You get the idea. Spark is deeply embedded in the ecosystem, and it’s foolish to argue that it doesn’t play a central role in the future of analytics.

The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and H2O.ai drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.

screen-shot-2016-12-08-at-3-17-32-pm

It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.

fusion

Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.

Databricks

Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.

visualizations-in-databricks

In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.

Dataiku

Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.

dataiku

In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.

DataRobot

DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.

cugrnjwxeaaking

The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.

collab-screen

DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs.

H2O.ai

H2O.ai develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, H2O.ai updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.

maxresdefault

In 2016, H2O.ai added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME

KNIME.com AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.

knime_screenshot

KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.

RapidMiner

RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.

rm7_process

The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.

Skytree

Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).

figure_09a_tuning_results_chart_9_way_grid

The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

Gartner’s 2016 MQ for Advanced Analytics Platforms

This is a revised and expanded version of a story that first appeared in the weekly roundup for February 15.

Gartner publishes its 2016 Magic Quadrant for Advanced Analytics Platforms.   You can get a free copy here from RapidMiner (registration required.)  The report is a muddle that mixes up products in different categories that don’t compete with one another, includes marginal players, excludes important startups and ignores open source analytics.

Other than that, it’s a fine report.

The advanced analytics category is much more complex than it used to be.  In the contemporary marketplace, there are at least six different categories of software for advanced analytics that are widely used in enterprises:

  • Analytic Programming Languages (e.g. R, SAS Programming Language)
  • Analytic Productivity Tools (e.g. RStudio, SAS Enterprise Guide)
  • Analytic Workbenches (e.g. Alteryx, IBM Watson Analytics, SAS JMP)
  • Expert Workbenches (e.g. IBM SPSS Modeler, SAS Enterprise Miner)
  • In-Database Machine Learning Engines (e.g. DBLytix, Oracle Data Mining)
  • Distributed Machine Learning Engines (e.g. Apache Spark MLlib, H2O)

Gartner appears to have a narrow notion of what an advanced analytics platform should be, and it ignores widely used software that does not fit that mold.  Among those evaluated by Gartner but excluded from the analysis: BigML, Business-Insight, Dataiku, Dato, H2O.ai, MathWorks, Oracle, Rapid Insight, Salford Systems, Skytree and TIBCO.

Gartner also ignores open source analytics, including only those vendors with at least $4 million in annual software license revenue.  That criterion excludes vendors with a commercial open source business model, like H2O.ai.  Gartner uses a similar criterion to exclude Hortonworks from its MQ for data warehousing, while including Cloudera and MapR.

Changes from last year’s report are relatively small.  Some detailed comments:

— Accenture makes the analysis this year, according to Gartner, because it acquired Milan-based i4C Analytics, a tiny little privately held company based in Milan, Italy.  Accenture rebranded the software assets as the Accenture Analytics Applications Platform, which Accenture positions as a platform for custom solutions.  This is not at all surprising, since Accenture is a consulting firm and not a software vendor, but it’s interesting to note that Accenture reports no revenue at all from software licensing;  hence, it can’t possibly satisfy Gartner’s inclusion criteria for the MQ.  The distinction between software and services is increasingly muddy, but if Gartner includes one services provider on the analytics MQ it should include them all.

Alpine Data Labs declines a lot in “Ability to Deliver,” which makes sense since they appear to be running out of money (*).  Gartner characterizes Alpine as “running analytic workflows natively within Hadoop”, which is only partly true.  Alpine was originally developed to run on MPP databases with table functions (such as Greenplum and Netezza), and has ported some of its functions to Hadoop.  The company has a history with Greenplum Pivotal and EMC Dell, and most existing customers use the product with Greenplum Database, Pivotal Hadoop, Hawq and MADlib, which is great if you use all of those but otherwise not.  Gartner rightly notes that “the depth of choice of algorithms may be limited for some users,” which is spot on — anyone not using Alpine with Hawq and MADlib.

(*) Of course, things aren’t always what they appear to be.  Joe Otto, Alpine CEO, contacted me to say that Alpine has a year’s worth of expenses in the bank, and hasn’t done any new venture rounds since 2013 “because they haven’t needed to do so.”  Joe had no explanation for Alpine’s significantly lower rating on both dimensions in Gartner’s MQ, attributing the change to “bias”.  He’s right in pointing out that Gartner’s analysis defies logic.

Alteryx declines a little, which is surprising since its new release is strong and the company just scored a pile of venture cash.  Gartner notes that Alteryx’ scores are up for customer satisfaction and delivering business value, which suggests that whoever it is at Gartner that decides where to position the dots on the MQ does not read the survey results.  Gartner dings Alteryx for not having native visualization capabilities like Tableau, Qlik or PowerBI, a ridiculous observation when you consider that not one of the other vendors covered in this report offers visualization capabilities like Tableau, Qlik or PowerBI.

Angoss improves a lot, moving from Niche to Challenger, largely on the basis of its WPL-based SAS integration and better customer satisfaction.  Data prep was a gap for Angoss, so the WPL partnership is a positive move.

— Dell: Arguing that Dell has “executed on an ambitious roadmap during the past year”, Gartner moves Dell into the Leaders quadrant.   That “execution” is largely invisible to everyone else, as the product seems to have changed little since Dell acquired Statistica, and I don’t think too many people are excited that the product interfaces with Boomi.  Customer satisfaction has declined and pricing is a mess, but Gartner is all giggly about Boomi, Kitenga and Toad.  Gartner rightly cautions that software isn’t one of Dell’s core strengths, and the recent EMC acquisition “raises questions” about the future of software at Dell.  Which raises questions about why Gartner thinks Dell qualifies as a Leader in the category.

FICO fades for no apparent reason.  I’m guessing they didn’t renew their subscription.

IBM stays at about the same position in the MQ.  Gartner rightly notes the “market confusion” about IBM’s analytics products, and dismisses yikyak about cognitive computing.  Recently, I spent 30 minutes with one of the 443 IBM vice presidents responsible for analytics — supposedly, he’s in charge of “all analytics” at IBM — and I’m still as confused as Gartner, and the market.

— KNIME was a Leader last year and remains a Leader, moving up a little.  Gartner notes that many customers choose KNIME for its cost-benefit ratio, which is unsurprising since the software is free.  Once again, Gartner complains that KNIME isn’t as good as Tableau and Qlik for visualization.

Lavastorm makes it to the MQ this year, for some reason.  Lavastorm is an ETL and data blending tool that does not claim to offer the native predictive analytics that Gartner says are necessary for inclusion in the MQ.

Megaputer, a text mining vendor, makes it to the MQ for the second year running despite being so marginal that they lack a record in Crunchbase.  Gartner notes that “Megaputer scores low on viability and visibility and there is a lack of awareness of the company outside of text analytics in the advanced analytics market.”  Just going out on a limb, here, Mr. Gartner, but maybe that’s your cue to drop them from the MQ, or cover them under text mining.

Microsoft gets Gartner’s highest scores on Completeness of Vision on the strength of Azure Machine Learning (AML) and Cortana Analytics Suite.  Some customers aren’t thrilled that AML is only available in the cloud, presumably because they want hackers to steal their data from an on-premises system, where most data breaches happen.  Microsoft’s hybrid on-premises cloud should render those arguments moot.  Existing customers who use SQL Server Analytic Services are less than thrilled with that product.

Predixion Software improves on “Completeness of Vision” because it can “deploy anywhere” according to Gartner.  Wut?  Anywhere you can run Windows.

Prognoz returns to the MQ for another year and, like Megaputer, continues to inspire WTF? reactions from folks familiar with this category.  Primarily a BI tool with some time-series and analytics functionality included, Prognoz appears to lack the native predictive analytics capabilities that Gartner says are minimally required. 

RapidMiner moves up on both dimensions.  Gartner recognizes the company’s “Wisdom of Crowds” feature and the recent Series C funding, but neglects to note RapidMiner’s excellent Hadoop and Spark integration.

SAP stays at pretty much the same place in the MQ.  Gartner notes that SAP has the lowest scores in customer satisfaction, analytic support and sales relationship, which is about what you would expect when an ankle-biter like KXEN gets swallowed by a behemoth like SAP, where analytics go to die.

SAS declines slightly in Ability to Deliver.  Gartner notes that SAS’ licensing model, high costs and lack of transparency are a concern.  Gartner also notes that while SAS has a loyal customer base whose members refer to it as the “gold standard” in advanced analytics, SAS also has the highest percentage of customers who have experienced challenges or issues with the software.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Big Analytics Roundup (November 16, 2015)

Just three main stories this week: possible trouble for a pair of analytic startups; Google releases TensorFlow to open source; and H2O delivers new capabilities at its annual meeting.

In other news, the Spark team announces Release 1.5.2, a maintenance release; and the Mahout guy announces Release 0.11.1, with bug fixes and performance improvements. (h/t Hadoop Weekly)

Two items of note from the Databricks blog:

— Darin McBeath describes Elsevier’s Spark use case and introduces spark-xml-utils, a Spark package contributed by his team.  The package enables the Spark user to filter documents based on an Path expression, return specific nodes for an Path/XQuery expression and transform documents using an XLST stylesheet.

— Rachit Agarwal and Anurag Khandelwal of Berkeley’s AMPLab introduce Succinct, a distributed datastore for queries on compressed data.   They announce release of Succinct Spark, a Spark package that enables search, count, range and random access queries on compressed RDDs.  The authors claim a 75X performance advantage over native Spark using Succinct as a document store,

Three interesting stories on streaming data:

  • In a podcast, Data Artisans CTO Stephan Ewen discusses Flink, Spark and the Kappa architecture.
  • Techalpine’s Kaushik Pal compares Spark and Flink for streaming data.
  • Will McGinnis helps you get started with Python and Flink.

(1) Analytic Startups in Trouble

In The Information, Steve Nellis and Peter Schulz explain why startups return to the funding well frequently — and why those that don’t may be in trouble.  Venture funding isn’t a perfect indicator of success, but is often the only indicator available.  On the list: Skytree Software and Alpine Data Labs.

(2) Google Releases TensorFlow for Machine Learning

On the Google Research blog, Google announces open source availability of TensorFlow.  TensorFlow is Google’s second generation machine learning system; it supports Deep Learning as well as any computation that can be expressed as a flow graph.   Read this white paper for details of the system.  At present, there are Python and C++ APIs;  Google notes that the C++ API may offer some performance advantages.

Video intro here.

In Wired, Cade Metz reports; Erik T. Mueller dismisses; and Metz returns to note that Deep Learning can leverage GPUs, and that AI’s future is in data, as if we didn’t know these things already.

On Slate, Will Oremus feels the buzz.

On his eponymous blog, Sachin Joglekar explains how to do k-means clustering with TensorFlow.

Separately, in VentureBeat, Jordan Novet rounds up open source frameworks for Deep Learning.

(3) H2O.ai Releases Steam

It’s not a metaphor.  At its second annual H2O World event, H2O releases Steam, an open source data science hub that bundles model selection, model management and model scoring into a single container for elastic deployment.

On the H2O Blog, Yotam Levy wraps Day One, Day Two and Day Three of the H2O World event.  Speaker videos are here, slides here.  (Registration required.)  Some notable presentations:

— H2O: Tomas Nykodym on GLM; Mark Landry on GBM and Random Forests; Arno Candel on Deep Learning; Erin LaDell on Ensemble Modeling.

— Michal Malohlava of H2O and Richard Garris of Databricks explain how to run H2O on Databricks Cloud.  Separately, Michal demonstrates Sparkling Water, a Spark package that enables a Spark user to call H2O algorithms; Nidhi Mehta leads a hands-on with PySparkling Water;  and Xavier Tordoir of Data Fellas exhibits Interactive Genomes Clustering with Sparkling Water on the Spark Notebook.

— Szilard Pafka of Epoch summarizes his work to date benchmarking R, Python, Vowpal Wabbit, H2O, xgboost and Spark MLLib.  As reported previously, Pafka’s benchmarks show that H2O and xgboost are the best performers; they are faster and deliver more accurate models.

As reported in last week’s roundup, H2O.ai also announces a $20 million “B” round.

Big Analytics Roundup (November 9, 2015)

My roundup of the Spark Summit Europe is here.

Two important events this week:

  • H2O World starts today and runs through Wednesday at the Computer History Museum in Mountain View CA.   Yotam Levy summarizes here and here.
  • Open Data Science Conference meets November 14-15 at the Marriott Waterfront in SFO

Five backgrounders and explainers:

  • At HUG London, Apache’s Ufuk Celebi delivers a nice intro to Flink.
  • On the Databricks blog, Yesware’s Justin Mills explains how his team migrates Spark applications from concept through prototype through production.
  • On Slideshare, Alpine’s Holden Karau delivers an overview of Spark with Python.
  • Chloe Green wakes from a three year slumber and discovers Spark.
  • On the Cloudera Engineering blog, Madhu Ganta explains how to build a CEP app with Spark and Drools.

Third quarter financials drive the news:

(1) MapR: We Grew 160% in Q3

MapR posts its biggest quarter ever.

(2) HDP: We Grew 168% in Q3

HDP loses $1.33 on every dollar sold, tries to make it up on volume.  Stock craters.

(3) Teradata: We Got A Box of Steak Knives in Q3

Teradata reports more disappointing sales as customers continue to defer investments in big box solutions for data warehousing.  This is getting to be a habit with Teradata; the company missed revenue projections for 2014 as well as the first and second quarters of this year.  Any company can run into headwinds, but a management team that consistently misses targets clearly does not understand its own business and needs to go.

Full report here.

(4) “B” Round for H2O.ai

Machine learning software developer H2O.ai announces a $20 million Series B round led by Paxion Capital Partners.  H2O.ai leads development of H2O, an open source project for distributed in-memory machine learning.  The company reports 25 new support customers this year.

(5) Fuzzy Logix Lands Funds

In-database analytics vendor Fuzzy Logix announces a $5 million “A” round from New Science Ventures.  Fuzzy offers a library of analytic functions that run in a number of high-performance databases and in HiveQL.

(6) New Optimization Package for Spark

On the Databricks blog, Aaron Staple announces availability of Spark TFOCS, an optimization package based on the eponymous Matlab package.  (TFOCS=Templates for First Order Conic Solvers.)

(7) WSO2 Delivers IoT App on Spark 

IoT middleware vendor WSO2 announces Release 3.0 of its open source Data Analytics Server (DAS) platform.   DAS collects data streams and applies batch, real-tim or interactive analytics; predictive analytics are in the roadmap.  For streaming data sources, DAS supports java agents, javascript clients and 100+ connectors.  The software runs on Spark and Lucene.

(8) Hortonworks: We Aren’t Irrelevant

On the Hortonworks blog, Vinay Shukla and Ram Sriharsha tout Hortonworks’ contributions to Spark, including ORC support, an Ambari stack definition for Spark, tighter integration between Hive and Spark, minor enhancements to ML and user-facing documentation.  Looking at the roadmap, they discuss Magellan for geospatial and Zeppelin notebooks. (h/t Hadoop Weekly).

(9) Apache Drill Delivers Fast SQL-on-Laptop

On the MapR blog, Mitsutoshi Kiuchi offers a case study in how to run a silly benchmark.

Comparing the functionality of Drill and Spark SQL, Kiuchi argues that Drill “supports” NoSQL databases but Spark does not, relegating Spark’s packages to a footnote.  “Support” is a loaded word with open source software; technically, nothing is supported unless you pay for it, in which case the scope of support is negotiated as part of the SLA.  It’s also worth noting that MongoDB developed Spark’s interface to MongoDB (for example), which provides a certain amount of confidence.

Kiuchi does not consider other functional areas, such as security, YARN support, query fault tolerance, the user interface, metastore management and view support, where Drill comes up short.

In a previously published performance test of five SQL engines, Spark successfully ran nine out of eleven queries, while Drill ran eight out of ten.  On the eight queries both engines ran, Drill was slightly faster on six.  For this benchmark, Kiuchi runs three queries on his laptop with a tiny dataset.

As a general rule, one should ignore SQL-on-Hadoop benchmarks unless they run industry standard queries (e.g. TPC) with large datasets in a distributed configuration.

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Big Analytics Roundup (March 9, 2015)

Here’s a roundup of interesting Big Analytics news and analysis from the past week.  Featured this week: Hortonworks, Alpine, Spark and H2O.

Hortonworks

  • Matt Asay, writing in InfoWorld, deconstructs Hortonworks’ earnings fiasco, and with it the “100% open source” business model.

Alpine Data Labs

  • VentureBeat reports a story that Alpine Data Labs claims 10X growth in user count and billings year over year.
  • MarketWired reports the same story.
  • ITBusinessNet too.

There is no supporting press release from Alpine Data Labs.   The VentureBeat story includes the nugget that Alpine currently has “more than 60” customers; an insider tells me that the number is closer to 75, roughly twice as many as last year.  Alpine has changed its selling model, hiring its own sales force instead of selling through EMC and Pivotal.  This also means that Alpine has changed its messaging from “we run on Greenplum and PostgresSQL, but mostly on Greenplum” to “we run on anything.”  This is an aspiration, to be sure, but a good one.

Alpine has also changed its pricing model from a perpetual server-based model to a user-based subscription model.

Separately, Ventana Research publishes a positive review of Alpine Chorus 5.0.

Apache Spark

  • Jonathan Buckley of Qubole argues that the three open source projects that transformed Hadoop are Hive, Spark and Presto.  It’s an odd choice.  Hive is certainly a key project and Spark is red hot; Presto, not so much.
  • Data prep engine vendor Paxata announces a new release that runs on Spark, releases benchmark report showing significant performance improvements.
  • Databricks announces selection of Databricks Cloud as preferred platform for B2B vendor Radius Intelligence, publishes case study.
  • Forbes profiles Databricks CEO Ion Stoica.
  • Ian Lumb offers eight reasons why Spark is hot.
  • Databricks published a slideshare about Spark DataFrames, which will be available in Spark 1.3 later this month.
  • From the Cloudera blog, an excellent post showing how to build an application for financial markets risk calculations in Spark.

H2O

  • In an interview with KDNuggets, Ted Dunning touts Mahout and H2O over Spark.
  • H2O.ai announces Cloudera certification for its Sparking Water interface to Spark.

General

CMSWire rehashes the Gartner Magic Quadrant without adding value.   The author notes breathlessly that “many KNIME enthusiasts are data miners”, and “on the downside, (RapidMiner’s) user base is mostly data scientists”; as if these points are news, and as if there is something extraordinary about data miners and data scientists using data mining and data science tools.

Big Analytics Roundup (March 2, 2015)

Here is a roundup of some recent Big Analytics news and analysis.

General

  • SiliconAngle covers the Big Data money trail.

Apache Spark

  • Curt Monash writes about Databricks and Spark on his DBMS2 blog.
  • On the Databricks blog, Dave Wang summarizes Spark highlights from Strata + Hadoop World.
  • In this post, Hammer Lab describes how to monitor Spark with Graphite and Grafana.
  • Cloudera announces Hive on Spark beta.
  • InfoWorld covers Spark’s planned support for R in Release 1.3.
  • Qubole announces Spark as a Service.

 Dato/GraphLab

  • Dato announces new version of GraphLab Create.

 H2O

  • From Strata + Hadoop World, Prithvi Pravu talks about using H2O.
  • Also from Strata, here is Cliff Click’s presentation on H2O, Spark and Python.
  • On the H2O blog, Arno Candel publishes a performance tuning guide for H2O Deep Learning.