Big Analytics Roundup (May 2, 2016)

Movidius ups the ante for trade show trinkets by releasing what journos describe as supercomputing, neural computing power, vision processing, deep learning, and artificial intelligence on a USB drive.  Roundup here.

Movidius-Fathom-Key-Product-shot

Last November, IBM’s Paul Zikopoulos snarked at Cloudera for not supporting SparkR. Cloudera’s Sean Owen, responding to a query in the Cloudera Community, notes that SparkR “does not work with other resource managers,” and does not work unless R is installed on the data nodes. Sean also notes that Cloudera cannot redistribute R because it is under GPL license. Data scientist Iraklis Tsatsoulis explains how to make SparkR work in Cloudera. Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example — but it is based on actual working experience with the product, which IBM clearly does not have.

Turning to important matters, a group at the Technical University of Munich has a machine learning engine that predicts who will die in Game of Thrones. Not very well, it seems; they blew it on Roose Bolton. Oops, spoiler.

Screen Shot 2016-05-02 at 1.21.19 PM

Explainers

— Adrian Colyer explains GeePS, a Deep Learning framework for clusters of GPUs. Put that on a thumb drive and we can talk.

— On the Altiscale blog Professor Jimmy Lin compares local installations, virtual machine, IaaS providers and Altiscale’s Hadoop-as-a-Service offering for teaching students about Big Data. Spoiler: he likes Altiscale.

— Two benchmarks from the Cloudera Engineering Blog:

  • Devadutta Ghat et.al. explain results from benchmarking Impala 2.5 with TPC queries. They claim an average speedup of 4.35X over Impala 2.3 for TPC-DS.
  • Allstate’s Don Drake explains results of a test comparing Spark 1.6 performance with Avro and Parquet, with CSV as a baseline. Drake ran a multi-step benchmark with a narrow table and a wide table. Results: the Spark job ran faster with Parquet than Avro, markedly so for the wide data set, which makes sense since it’s columnar. Also, performance with CSV sucked.

— Three items from MapR’s Converge blog:

  • Nick Amato explains how to predict Airbnb listing prices with scikit-learn and Spark.
  • Mathieu Dumoulin explains Deep Learning with the CaffeOnSpark package.
  • Nicolas A Perez explains how to do Twitter sentiment analysis with Spark Streaming.

— Corentin Kerisit explains RDD partitioning in Spark.

Perspectives

— An anonymous blogger at CBInsights notes that big tech companies are paying big bucks for AI companies, so if you’re running a startup make sure you put AI in the name.

— Alexander Wissner-Gross weighs in on the “datasets versus algorithms” debate. My take: data trumps algorithms.

— Google streams engineer Tyler Akidau discusses streaming systems versus batch processing, which is like asking Mr. Fox for his perspective on chickens.

— David Weldon continues his series of interviews with people at Strata + Hadoop: Ravi Dharnikota of SnapLogic, who heard a lot of talk about streaming, Spark and data lakes.

— Alan Earls touts Amazon Machine Learning without understanding it.

Jack Vaughan interviews eBay’s Debashis Saha, who discusses Kylin and other stuff.

Open Source Announcements

— The Apache Software Foundation announces that Apache Apex has graduated to top level status. Apex, for streaming analytics, is the open source version of DataTorrent. Jessica Davis reports.

— North Bridge and Black Duck release their tenth annual survey of people who like open source.

— Apache Flink 1.0.2 ships with bug fixes and a new capability to integrate with RocksDB. So now, you can have Flink on Rocks.

Commercial Announcements

— Google’s DeepMind AI unit announces that they will use TensorFlow instead of Torch for all future work.

— Three guys exit Pivotal, start a company named SnappyData, land a tiny “A” round from Pivotal and GE Digital and propose to build something like GemFire, but on Spark. More here.

— Levyx announces a small “A” round. Levyx offers a version of Spark optimized to run on solid state/Flash memory.

— Tiny consulting firm Xentaurs announces a partnership with Mesosphere. And not just any partnership; it’s a strategic partnership. Actually, they just joined the DC/OS community.

Big Analytics Roundup (April 11, 2016)

Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.

We have three roundups from Strata + Hadoop World, Rashomon style:

  • Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
  • Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
  • Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.

— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.

— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.

— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:

Screen Shot 2016-04-10 at 7.52.54 AM

— Forrester publishes its 2016 “Wave” for Big Data Streaming Analytics. You can go here and buy it for $2,495, get a free copy here, or just look at the picture below.

Screen Shot 2016-04-10 at 3.52.54 PM

— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.

— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.

NVIDIA Unveils Deep Learning Chip

— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.

— Selected media reports:

— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.

Explainers

— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.

— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.

— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.

— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.

— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.

— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.

Perspectives

— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.

— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.

— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.

— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.

— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.

— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.

— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.

— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.

— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.

Open Source Announcements

— Qubole releases SQL optimizer Quark to open source.

— Flink releases version 1.0.1, a maintenance release.

— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.

— Airbnb open sources Caravel, a data exploration package.

— Apache Tajo announces Release 0.11.2, which should please its user.

— LinkedIn releases Dr. Elephant to open source.

Commercial Announcements

— Databricks announces the agenda for Spark Summit 2016 in SFO.

— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.

— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.

Big Analytics Roundup (April 4, 2016)

Strata + Hadoop World sparks a number of commercial announcements: AtScale has a new release, Microsoft previews R Server on HDInsight, and IBM puts Spark on a mainframe, FWIW. We also have a nice harvest of explainers and perspectives.

Slides from Strata available here.

The folks at Domino Data ask: Is XGBoost 10X faster than H2O? We’ll never know the answer, since they took down the post. I’m guessing the answer is “no.”

Screen Shot 2016-04-04 at 10.47.32 AM

Databricks offers a collection of popular blog posts on Apache Spark as an eBook.

Explainers

On the Google Cloud Big Data Blog, Eric Anderson and Marian Dvorsky compare autoscaling in Dataflow/Beam to Spark and Hadoop. (h/t William Vambenepe)

Miles Yucht and Reynold Xin explain DeepSpark, a convolutional neural network that automates software development processes, such as writing test cases, fixing bugs and so forth.

Databricks’ Jules Damji explains how to process JSON data with Spark Datasets and DataFrames.

On the Airbnb engineering blog, Ricardo Bion explains how to scale data science with R.

Eduardo Ariño De La Rubia explains how The Climate Corporation created a high-throughput data science machine.

DataArtisans’ Kostas Tzoumas explains Flink internals, and how Flink counts elements in streams.

On the Insight Data Engineering blog, Daniel Blazevski explains Flink quadtrees.

H2O.ai’s Erin LeDell explains scalable ensemble learning with H2O. Also at Strata, Arno Candel explains why Deep Learning is eating your lunch.

On the Dataiku blog, someone named Margot explains automated model deployment with Data Science Studio.

On the DataTorrent blog, David Yan explains latency calculations in Apache Apex.

Christopher Crosbie explains SparkR on EMR, on the AWS Big Data blog.

Perspectives

Jack Vaughan notes the prominence of streaming analytics at Strata, quotes some old guy who thinks streaming is a thing.

On the Cloudera Vision Blog, Dan Sturman describes Cloudera’s response to what he characterizes as a software quality challenge.

Cloud vendor Altiscale’s Raymie Stata asks which is best for Spark and Hadoop: cloud or on-premises. Spoiler: he thinks you should choose cloud.

On LinkedIn, consultant Rick van der Lans touts Apache Drill.

Wikibon releases forecasts of Spark adoption and the Big Data market. You can either pay Wikibon for a subscription, or read George Leopold’s summary here or Mike Wheatley’s summary here.

Alex Woodie recaps Doug Cutting’s keynoter at Strata+Hadoop.

On the tech blog for Berlin-based online retailer Zalando, Javier Lopez and Mihail Vieru recap a recently completed Flink versus Spark bakeoff. They like Flink’s low latency which, as a fashion retailer, they totally think they need. The bottom line, though, seems to be that DataArtisans is just a few stops away on the U-Bahn, so they chose Flink.

Brandon Butler summarizes the Microsoft and Google challenges to Amazon in the cloud.

InfoWorld’s Martin Heller reviews Databricks’ Spark service, likes it.

In TechCrunch, Josh Klahr lists seven things to watch for at Strata + Hadoop World, which is still worth reading even though the show came and went.

Talend CMO Ashley Stirrup suggests you sharpen your customer reflexes with Apache Spark. If you want to improve your actual reflexes, read this.

Open Source Announcements

ASF announces Apache NiFi 0.6.0, with Kerberos authentication for its REST API and support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra. (h/t Hadoop Weekly)

Commercial Announcements

OLAP-on-Hadoop vendor AtScale announces release 4.0. Key new bits: fine-grained security that links every query to an end user and an intelligent query optimizer that pushes down either as SQL or as MDX depending on end user tool. AtScale has also added to its platform integration, now supports  Business Objects, Cognos, Excel, Jaspersoft, Qlik, MicroStrategy, PowerBI, Spotfire, and Tableau on CDH, HDP, HDInsights and MapR with Hive/Tez, Impala and Spark SQL and an impressive list of data storage formats. Mike Wheatley reports.

Data integration startup Tamr announces “compatibility” with Spark. The press release does not specify whether that means connectivity, push-down integration or something else. Tamr is not certified by Databricks, and has not published anything on Spark Packages.

Pouring new wine into old bottles, IBM delivers Spark on a mainframe, as promised last July.  IBM touts this as a way to perform analysis of your data “in place”, which is great if all of your data is stuck on a mainframe.

IBM partners with Lightbend, the company formerly known as Typesafe, to deliver Scala training through the Big Data University.

Altiscale announces partnership with Tableau, will add visualization to its managed service for Big Data.

Databricks announces availability of APIs to automate Spark infrastructure. On the Databricks blog, Dave Wang explains.

Microsoft announces preview of R Server for HDInsight and an update to Apache Spark for Azure HDInsight. R Server for HDInsight is a rebranded version of Revolution Analytics’ ScaleR acquired last year. R Server is a distributed machine learning platform with push-down integration to MapReduce and Spark and an R API.

Flink promoter DataArtisans announces a 5.5 million Euro Series A financing round led by Intel Capital.

Dataiku announces a new release of Data Science Studio. The press release touts some new features, but I’ll refrain from commenting until the company posts release notes.

Big Analytics Roundup (March 28, 2016)

Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.

— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.

— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.

— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)

Explainers

— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.

— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.

— Frances Perry and Tyler Akidau explain runners in Apache Beam.

— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.

— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.

— Two interesting posts on PySpark:

  • On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
  • On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.

Perspectives

— Eric Kavanagh delivers a nice overview of the history of open source analytics.

— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.

— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.

— Ian Allison profiles Seldon, an open source machine learning platform that specializes in content and product recommenders.

— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.

— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.

— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.

— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.

Open Source Announcements

— AirBNB donates Airflow, a workflow automation system, to Apache.

— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.

— Several Apache projects have new releases:

  • Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
  • BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
  • OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4.  On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
  • SQL engine MRQL releases version 0.6, with new features for incremental query processing.

Commercial Announcements

— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.

— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.

— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.

— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)

Big Analytics Roundup (September 21, 2015)

Top story of the week: release of AtScale’s Hadoop Maturity Survey, which triggered a flurry of analysis.  Meanwhile, the Economist ventures into the world of open source software and venture capital, embarrassing itself in the process; and IBM announces plans to use Spark in its search for extraterrestrial intelligence, a project that would be more useful if pointed toward IBM headquarters.

AtScale Releases Hadoop Adoption Survey

OLAP-on-Hadoop vendor AtScale publishes results of a survey of 2,200 respondents who are either actively working with Hadoop today or planning to do so in the near future.  AtScale partnered with Cloudera, Hortonworks, MapR and Tableau to recruit respondents for the survey.

A copy of the survey report is here; the survey instrument is here.  AtScale will deliver a webinar summarizing results from the survey; you can register here.

There are multiple stories about this survey in the media: here, here, here, here, here, here, here, here, and here.  Some highlights:

  • Andrew Oliver compares this survey to Gartner’s Hadoop assessment back in May and concludes that Gartner blew it.  While I agree that Gartner’s outlook on Hadoop is too conservative, (and said so at the time) the two surveys are apples and oranges: while AtScale surveyed people who are either already using Hadoop or plan to do so, Gartner surveyed a panel of CIOs.  Hence, it is not surprising that AtScale’s respondents are more positive about prospects for Hadoop.
  • Matt Asay notes that “Cost saving” is the third most frequently cited reason for adopting Hadoop, after “Scale-out needs” and “New applications.”  This is somewhat surprising, given Hadoop’s reputation as a cheap datastore.  Cost is still a factor driving Hadoop adoption, it’s just not the primary factor.

Here are a few insights from this survey not mentioned by other analysts.  First look at the difference in BI tool usage between those currently using Hadoop and those planning to use Hadoop.  Compared to current users, planners are significantly more likely to say they want to use Excel and less likely to say they want to use Tableau or SAS.  (Current and planned use of SAP Business Objects and IBM Cognos are about the same.)

Screen Shot 2015-09-21 at 10.06.17 AM

Also interesting to note differences in Hadoop maturity among the BI users.  SAS users are more likely than others to self-identify as “Low Maturity”:

Screen Shot 2015-09-21 at 10.06.37 AM

Finally, a significant minority of current Hadoop users cite Management, Security, Performance, Governance and Accessibility as challenges using Hadoop.  However, most who plan to use Hadoop do not anticipate these challenges — which suggest these respondents are in for a rude awakening.

Screen Shot 2015-09-21 at 10.07.01 AM

SQL on Hadoop

For those who like things distilled to sound bites, eWeek offers a point of view on when to select Apache Spark, Hadoop or Hive.   Brevity is the soul of wit, but sometimes it’s just brevity.

Amazon Web Services

Redshift is an OEM version of Actian’s ParAccel columnar database with analytic capabilities removed, which is why data scientists say that Redshift is where data goes to die.  Amazon Web Services has taken baby steps to ameliorate this, adding Python UDFs.  Christopher Crosbie reports, on the AWS Big Data Blog. (h/t Hadoop Weekly)

Apache Apex/DataTorrent

On the DataTorrent blog, Amol Kekre introduces you to Apache Apex, which was just accepted by Apache as an incubator project.  DataTorrent touts Apex as kind of like Spark, only better, thereby demonstrating the importance of timing in life.  (h/t Hadoop Weekly)

If you think that Apex does nothing, Munagala Ramanath shares the good news that Apex supports the Malhar library.  Honestly, though, it still seems to do nothing.

In an email to David Ramel, DataTorrent CEO Phu Hoang identifies flaws in Spark, points to his Apache Apex project as a solution.  Bad move on his part.

Apache Drill

Chloe Green discusses implications of the European Commission’s digital single market, and suggests that retailers will use Apache Drill to analyze the data that will be produced under this regulatory framework.  There are two problems with this article.  First, Green makes no effort to consider alternatives to Drill.  Second, the article itself accepts the premise that more regulation will produce business growth; in fact, the opposite is more likely (except for those in the compliance industry.)

The Drill team explains how to implement Drill in ten minutes.

Jim Scott summarizes the benefits of Drill for the BI user.

On O’Reilly Radar, Ellen Friedman recaps the history of Drill as an open source project.

Zygimantas Jacikevicius offers an introduction to Drill and explains why it is useful.

Apache Flink

On the DataArtisans blog, Kostas Tzoumas seeks to position Flink against Spark by arguing that batch is a special case of streaming.  Of course, you can argue the opposite just as easily — that streaming is batch with very small batches.

If you care about Off-heap Memory in Apache Flink, Stephan Ewen offers a summary.

At a DC Area Flink Meetup, Capital One’s Slim Baltagi explains unified batch and real-time stream processing with Flink.

Flink sponsor DataArtisans announces partnership with SciSpike, a training and consulting provider.

Apache NiFi

Yves de Montcheuil explains why you should care about Apache NiFi, a project that connects data-generating systems with data processing systems.  Spoiler: it’s all about security and reliability.

Apache Spark

In Fortune, Derrick Harris describes Microsoft’s “Spark-inspired” and “Spark-like” Prajna project, does not explain why MSFT is reinventing the wheel.

Cloudera announces a Spark training curriculum.  For those without prior Hadoop experience, two courses cover data ingestion with Sqoop and Flume, data modeling, data processing with Spark and Spark Streaming with Kafka.  There is also a single shorter course covering the same ground for those with prior Hadoop experience.  Finally, a data science course covers advanced analytics with MLLib.

Document analytics vendor Ephesoft introduces new software built on Spark.

Matt Asay uses the Spark/Fire metaphor once too often.

In a post about DataStax, Curt Monash notes synergies between Spark and Cassandra.

MongoDB offers a white paper which explains, not surprisingly, how to use Spark with Mongo.

On the Basho blog, Korrigan Clark discusses his work using Spark to develop an algorithmic stock trading program.

Here are two items from Cloudera’s Kostas Sakellis on SlideShare.  The first explains why your Spark job fails; the second reviews how to get Spark customers to production.

GraphLab/Dato

Dato, the University of Washington and Coursera announce a machine learning specialization consisting of five courses and a capstone project.  The curriculum is platform neutral, though I suspect that co-creator Carlos Guestrin manages to get in a good word for his project.

H2O/H2O.ai

Two items on slideshare:

  • From a meetup at 6Sense, Mark Landry explains H2O Gradient Boosted Machines for Ad Click Prediction.
  • Avni Wadhwa and Vinod Iyengar demonstrate how to build machine learning applications with Sparkling Water, H2O’s interface to Spark.