Big Analytics Roundup (April 18, 2016)

In hard news this week, Storm hits a milestone with Release 1.0, Google releases TensorFlow 0.8 with distributed computing support, and DataStax announces DataStax Enterprise Graph. And, following on NVIDIA’s DGX-1 announcement last week there are a number of items on Deep Learning featured below.

Deep Learning

— Adrian Colyer summarizes a paper that summarizes 900 other papers on Deep Learning.

— Data Science Central compiles a slew of links on Deep Learning.

— Nicole Hemsoth interviews NVIDIA Veep Marc Hamilton, who ruminates on the convergence of supercomputing and Deep Learning.

Explainers

— On the Pivotal Big Data blog, Alexey Grischchenko explains what’s up with Apache Hawq, the SQL-on-Hadoop-and-Greenplum engine that is now an Apache Incubator project. According to OpenHub, there’s a lot of activity on Hawq, and contributions are up sharply since it went Apache.

— In KDnuggets, Microsoft’s Brandon Rohrer publishes a handy pocket guide to data science.

— Nicholas A. Perez explains custom streaming sources in Spark.

— Ian Pointer explains Apache Beam, and how it aspires to be the uber-API.

— Abie Reifer explains Microsoft Azure HDInsight.

— Yong Feng of IBM’s Spark Technology Center explains results of a test run with Spark on Mesos.

— Gopal Wunnava explains geospatial intelligence with SparkR on Amazon EMR.

— IBM’s Fred Reiss explains SystemML, for those who missed his presentation at Spark Summit East.

— For masochistic sabremetricians, Nick Amato explains baseball statistics with Hive and Pig.

Perspectives

— Serdar Yegulalp reviews Apache Storm 1.0. He likes it.

— DataArtisans’ Kostas Tzoumas explains counting in streams, then touts Flink.

— Timothy Prickett Morgan reports on HPE’s efforts to put Spark on a Superdome. Results are interesting. But as with IBM running Spark on a mainframe, such efforts overlook a key benefit of Hadoop and Spark: the ability to avoid dealing with the likes of HPE and IBM.

— Katharine Kearnan interviews Nick Pentreath, one of the two Spark Committers IBM has hired. He predicts that in Spark 2.0, the ML pipeline API approaches parity with the MLlib API. Interestingly, he doesn’t expect a lot from SparkR.

— In Forbes, Chris Wilder recaps his visit to Google Cloud Platform NEXT 2016.

— Andrew Brust summarizes Hortonworks’ recent announcements, sees an emerging duopoly of Cloudera and Hortonworks. I’m not inclined to dismiss MapR and AWS so easily.

— Craig Stedman comments on Pivotal’s exit from the Hadoop distribution market, quotes some old guy wondering how much longer IBM will keep BigInsights alive. My take on Pivotal: honestly, I thought they exited a year ago.

— Cloud platform Altiscale’s Raymie Stata surveys Hadoop’s history, sees movement to the cloud.

— James Nunns wonders if the top Hadoop distributors can steal the show from Spark at Hadoop Summit 2016. If you count the number of times the word “Spark” appears in Hortonworks’ announcement, the answer is no.

— Ajay Khanna opines that absent data quality and metadata management, your data lake will turn into a data swamp.

— Nick Bishop interviews MSFT’s research chief, who assures him that AI is too stupid to wipe us out. I worry more about the chemtrails.

Open Source Announcements

— Apache Storm announces Release 1.0.0, with many enhancements. According to OpenHub, Storm is picking up steam, with 127 active contributors in the past 12 months.

— Google announces TensorFlow 0.8, with distributed computing support and new libraries for user-defined distributed models.

— Apache Mahout announces release of Mahout 0.12.0, with Flink bindings to the Samsara engine. Contributors from DataArtisans did most of the work, as most other contributors have long since exited this project.

Commercial Announcements

— DataStax announces DataStax Enterprise Graph (DSE Graph), built on Apache Cassandra and Apache Tinkerpop (a graph computing framework.) A year ago, Datastax acquired Aurelius, the commercial venture behind Titan, an open source distributed graph database; Titan uses Cassandra as a back end. DSE Graph includes extensions found in DataStax Enterprise, including security, search, analytics and monitoring tools. Alex Handy reports.

— Databricks announces new content for its Community Edition:

— Hortonworks previews HDP 2.4.2. Key bits:

  • Spark 1.6.1.
  • Spark SQL certified with ODBC.
  • Bug fixes for Spark/Oozie connection for Kerberos-enabled clusters.
  • Spark Streaming with Apache Kafka in a Kerberos-enabled cluster.
  • Spark SQL with ORC performance improvements.
  • Final technical preview of Apache Zeppelin with Kerberos, LDAP and identity propagation.

— Hortonworks also announces that Pivotal HDP is officially dead. Pivotal announces nothing.

— Teradata announces that its Think Big subsidiary is expanding its data lake and managed service offerings using Apache Spark. This is good news for the eight consultants at Think Big with Spark credentials, as it means less time spent on the bench. Meanwhile, Think Big contributes a distributed K-Modes in PySpark to open source, the first such contribution since 2014. For some reason, they did not contribute it to Spark packages.

— Atigeo, a “compassionate technology company”, announces that is has added Spark 1.6 to its xPatterns platform.

— Lucidworks announces release of Lucidworks View, a component that simplifies development of applications on Solr and Spark.

— DataRPM, “Cognitive Data Science” company with very little money announces partnership with Tamr, a data integration company with lots of money.

Big Analytics Roundup (March 28, 2016)

Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.

— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.

— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.

— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)

Explainers

— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.

— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.

— Frances Perry and Tyler Akidau explain runners in Apache Beam.

— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.

— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.

— Two interesting posts on PySpark:

  • On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
  • On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.

Perspectives

— Eric Kavanagh delivers a nice overview of the history of open source analytics.

— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.

— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.

— Ian Allison profiles Seldon, an open source machine learning platform that specializes in content and product recommenders.

— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.

— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.

— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.

— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.

Open Source Announcements

— AirBNB donates Airflow, a workflow automation system, to Apache.

— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.

— Several Apache projects have new releases:

  • Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
  • BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
  • OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4.  On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
  • SQL engine MRQL releases version 0.6, with new features for incremental query processing.

Commercial Announcements

— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.

— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.

— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.

— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.

Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.