Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (August 8, 2016)

So, Apple acquires Turi for $200 million. Hopefully, Apple did not pay for brand equity.

Bridget Botelho argues that businesses must either disrupt or be disrupted, and outlines the role of machine learning. Someone should write a book about that.

Conference Announcements

— Flink Forward announces the schedule for its second annual event, to be held September 12-14 in Berlin.

— Databricks announces the agenda for Spark Summit Europe 2016 in Brussels (October 25-27)

Apple Buys GraphLab Dato Turi

Geekwire breaks the story, reporting a purchase price of $200 million. According to TechCrunch, Turi notified customers that its products would no longer be available. Apple adds Turi to the portfolio of machine learning startups it has acquired in the past year, including Emotient, Perceptio, and VocalIQ. More reporting here.

GraphLab started in 2009 as an open source project led by Carlos Guestrin of Carnegie Mellon. (According to OpenHub Guestrin never contributed any code.) In May 2013, Guestrin raised $6.75M to start an eponymous venture to provide commercial support for GraphLab. In October 2014, GraphLab announced the availability of GraphLab Create, a commercially licensed software product. Contributions to the open source project actually ended in 2013; while the code remains on GitHub, the project is dead.

GraphLab changed its name to Dato in January 2015. They should have googled the name; at the time, the top links in a search included Dato Foland, a gay porn star, and Datto Inc, a data backup and recovery company in Connecticut. The latter proved problematic; Datto sued, forcing Dato to rebrand as Turi earlier this month.

Turi’s open source SFrame project remains for those who think introducing another file system into the mix is a smart thing to do.

Teradata: 9 Straight Quarters of Declining Product Revenue

For the second quarter of 2016, declining data warehouse giant Teradata reports an 11% decline in product revenue compared to Q2 2015. (Product revenue includes revenue from licensing software and hardware — boxes with the Teradata brand.) Maintenance revenue increased slightly, which means that customers aren’t pulling the plug on Teradata databases as fast as they did last year. Consulting revenue declined by 1%, which casts doubt on TDC’s stated strategy to become a services powerhouse.

Screen Shot 2016-08-08 at 10.38.16 AM

Count me as skeptical about the merits of that plan. Teradata’s consulting revenue remains highly correlated with product revenue; in other words, if Teradata can’t sell its boxes, it’s not going to sell billable hours for consultants to implement those boxes. Teradata is not a credible competitor in the market for consulting-led solutions; companies like Oracle, IBM and SAS have a twenty-year head start.

Since Teradata performed better than “expectations”, Wall Street rewarded the stock with a bounce above $30.  It’s a dead-cat bounce. As the Wall Street Journal notes, companies routinely game analyst expectations. TDC currently trades at 32 times trailing earnings, well above its peers; moreover, its peers are growing rather than declining.

Explainers

— Kaarthik Sivashanmugam explains how to develop Apache Spark applications in .NET with Mobius.

— On the Cloudera Engineering blog, Devadutta Ghat et. al. explain the latest performance improvements in Impala 2.6.

— Parsey McParseface now has 40 cousins. On the Google Research Blog, Chris Alberti et. al. explain.

— Ujjwal Ratan explains how to use Amazon Machine Learning to predict patient readmission.

Perspectives

— Curt Monash offers his assessment of Spark. Highlights:

  • Spark replaces MapReduce, in particular for data transformation.
  • Spark is becoming the default platform for machine learning.
  • Spark SQL is OK as an adjunct for other analysis.
  • Spark Streaming is doing well, but there are challengers. (See below).
  • Databricks’ managed service for Spark has more than 200 subscribers.

— Serdar Yegulalp deploys the tired old “pure streaming versus microbatch” argument to claim that Apache Apex, Heron, Apache Flink and Onyx are “contenders” versus Spark. Someone should show him this graph:

Screen Shot 2016-07-18 at 8.26.11 AM

— In Datanami, Alex Woodie profiles Flink.

— Vance McCarthy touts MapR’s Spyglass Initiative for analytics on the MapR Converged Data Platform.

— Trevor Jones describes Microsoft Azure’s big data tools.

— Sam Dean champions Sparkling Water, H2O’s interface to Spark.

Commercial Announcements

— Dataiku announces the release of Data Science Studio 3.1, with five machine learning back ends and a visual coding interface (which it labels “code-free”).  Dave Ramel reports.

— John Snow Labs announces it will deliver curated data in Parquet format.

— Lexalytics announces the availability of its Semantria text analytics software on Azure.

Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Big Analytics Roundup (May 23, 2016)

Google announces that it has designed an application-specific integrated circuit (ASIC) expressly for deep neural nets. Tech press goes bananas. The chips, branded Tensor Processing Units (TPUs) require fewer transistors per operation, so Google can fit more operations per second into the chip. In about a year of operation, Google has achieved an order of magnitude improvement in performance per watt for machine learning.

Google’s Felipe Hoffa summarizes Mark Litwintschik’s work benchmarking different platforms with the New York City Taxi and Limo Commission’s public dataset of 1.1 billion trips. So far, Mark has tested PostgreSQL on AWS, ElasticSearch on AWS, Spark on AWS EMR, Redshift, Google BigQuery, Presto on AWS and Presto on Cloud Dataproc. Results make Google look good, but you should read Mark’s original posts.

Meanwhile, IBM fires more people. More here and here.

Open Data Science Conference

The second annual Open Data Science Conference (ODSC) East met in Boston over the weekend. Attendance doubled from last year, to 2,400.

Registration was a snafu, because the conference organizers did not accurately predict walk-in traffic or staffing needs. The jokes write themselves.

Content was excellent. Keynoters included Stefan Karpinski (Julia co-creator), Kirk Borne of Booz Allen Hamilton, Ingo Mierswa, CTO of RapidMiner and Lukas Biewald, CEO of Crowdflower. Track leaders included JJ Allaire and Joe Cheng of RStudio, Usama Fayyad of Barclays and John Thompson of the US Census Bureau. Sponsors included Basis Technology, CartoDB, CrowdFlower, Dataiku, DataRobot, Dato, Exaptive, Facebook, H2O.ai, MassMutual, McKinsey, Metis, Microsoft, RapidMiner, SFL Scientific and Wayfair.

Prompted by a tweet, I stopped at the Dataiku table. The conversation went like this:

  • Me: What does Dataiku do, in 25 words or less?
  • Dataiku: DataRobot.
  • Me: What?
  • Dataiku: We do what DataRobot does.

At this point, it was clear to me that Mr. Dataiku either did not know what DataRobot does, or thought I don’t know what DataRobot does. So I changed the subject.

The next ODSC event is in October, in London.

Explainers

— Michael Armbrust and Tathagata Das explain Structured Streaming in Spark 2.0

— Adrian Colyer goes 5 for 5 for the week:

— Tim Hunter, Hossein Falaki and Joseph Bradley explain HyperLogLog and Quantiles in Spark.

— Microsoft’s Raymond Laghaeian explains how to use Azure ML predictions in Google Spreadsheet.

Perspectives

— Serdar Yegulalp cites PayScale data in noting that if you know Scala, Go, Python and Spark you can expect to make more money.

— Tim Spann weighs the advantages of Java and Scala, and explains DL4J.

— Sam Dean celebrates Drill’s first anniversary.

— Taylor Goetz delivers a brief history of Apache Storm.

Open Source Announcements

— MongoDB releases a new Spark Connector.

— Apache Tajo announces Release 0.11.3, with five bug fixes.

— Apache Mahout announces Release 0.12.1, a maintenance release that resolves an issue with Flink integration.

Commercial Announcements

— RedPoint Global snags a $12 million “C” round.

— TIBCO announces something called Accelerator for Apache Spark, a bundle of tools that connect TIBCO products with open source packages. While TIBCO refers to this component as open source, the software is available only to TIBCO customers, which means it isn’t Free and Open Source.

— MapR applauds itself.

Big Analytics Roundup (April 11, 2016)

Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.

We have three roundups from Strata + Hadoop World, Rashomon style:

  • Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
  • Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
  • Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.

— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.

— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.

— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:

Screen Shot 2016-04-10 at 7.52.54 AM

— Forrester publishes its 2016 “Wave” for Big Data Streaming Analytics. You can go here and buy it for $2,495, get a free copy here, or just look at the picture below.

Screen Shot 2016-04-10 at 3.52.54 PM

— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.

— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.

NVIDIA Unveils Deep Learning Chip

— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.

— Selected media reports:

— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.

Explainers

— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.

— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.

— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.

— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.

— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.

— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.

Perspectives

— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.

— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.

— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.

— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.

— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.

— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.

— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.

— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.

— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.

Open Source Announcements

— Qubole releases SQL optimizer Quark to open source.

— Flink releases version 1.0.1, a maintenance release.

— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.

— Airbnb open sources Caravel, a data exploration package.

— Apache Tajo announces Release 0.11.2, which should please its user.

— LinkedIn releases Dr. Elephant to open source.

Commercial Announcements

— Databricks announces the agenda for Spark Summit 2016 in SFO.

— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.

— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

Big Analytics Roundup (November 9, 2015)

My roundup of the Spark Summit Europe is here.

Two important events this week:

  • H2O World starts today and runs through Wednesday at the Computer History Museum in Mountain View CA.   Yotam Levy summarizes here and here.
  • Open Data Science Conference meets November 14-15 at the Marriott Waterfront in SFO

Five backgrounders and explainers:

  • At HUG London, Apache’s Ufuk Celebi delivers a nice intro to Flink.
  • On the Databricks blog, Yesware’s Justin Mills explains how his team migrates Spark applications from concept through prototype through production.
  • On Slideshare, Alpine’s Holden Karau delivers an overview of Spark with Python.
  • Chloe Green wakes from a three year slumber and discovers Spark.
  • On the Cloudera Engineering blog, Madhu Ganta explains how to build a CEP app with Spark and Drools.

Third quarter financials drive the news:

(1) MapR: We Grew 160% in Q3

MapR posts its biggest quarter ever.

(2) HDP: We Grew 168% in Q3

HDP loses $1.33 on every dollar sold, tries to make it up on volume.  Stock craters.

(3) Teradata: We Got A Box of Steak Knives in Q3

Teradata reports more disappointing sales as customers continue to defer investments in big box solutions for data warehousing.  This is getting to be a habit with Teradata; the company missed revenue projections for 2014 as well as the first and second quarters of this year.  Any company can run into headwinds, but a management team that consistently misses targets clearly does not understand its own business and needs to go.

Full report here.

(4) “B” Round for H2O.ai

Machine learning software developer H2O.ai announces a $20 million Series B round led by Paxion Capital Partners.  H2O.ai leads development of H2O, an open source project for distributed in-memory machine learning.  The company reports 25 new support customers this year.

(5) Fuzzy Logix Lands Funds

In-database analytics vendor Fuzzy Logix announces a $5 million “A” round from New Science Ventures.  Fuzzy offers a library of analytic functions that run in a number of high-performance databases and in HiveQL.

(6) New Optimization Package for Spark

On the Databricks blog, Aaron Staple announces availability of Spark TFOCS, an optimization package based on the eponymous Matlab package.  (TFOCS=Templates for First Order Conic Solvers.)

(7) WSO2 Delivers IoT App on Spark 

IoT middleware vendor WSO2 announces Release 3.0 of its open source Data Analytics Server (DAS) platform.   DAS collects data streams and applies batch, real-tim or interactive analytics; predictive analytics are in the roadmap.  For streaming data sources, DAS supports java agents, javascript clients and 100+ connectors.  The software runs on Spark and Lucene.

(8) Hortonworks: We Aren’t Irrelevant

On the Hortonworks blog, Vinay Shukla and Ram Sriharsha tout Hortonworks’ contributions to Spark, including ORC support, an Ambari stack definition for Spark, tighter integration between Hive and Spark, minor enhancements to ML and user-facing documentation.  Looking at the roadmap, they discuss Magellan for geospatial and Zeppelin notebooks. (h/t Hadoop Weekly).

(9) Apache Drill Delivers Fast SQL-on-Laptop

On the MapR blog, Mitsutoshi Kiuchi offers a case study in how to run a silly benchmark.

Comparing the functionality of Drill and Spark SQL, Kiuchi argues that Drill “supports” NoSQL databases but Spark does not, relegating Spark’s packages to a footnote.  “Support” is a loaded word with open source software; technically, nothing is supported unless you pay for it, in which case the scope of support is negotiated as part of the SLA.  It’s also worth noting that MongoDB developed Spark’s interface to MongoDB (for example), which provides a certain amount of confidence.

Kiuchi does not consider other functional areas, such as security, YARN support, query fault tolerance, the user interface, metastore management and view support, where Drill comes up short.

In a previously published performance test of five SQL engines, Spark successfully ran nine out of eleven queries, while Drill ran eight out of ten.  On the eight queries both engines ran, Drill was slightly faster on six.  For this benchmark, Kiuchi runs three queries on his laptop with a tiny dataset.

As a general rule, one should ignore SQL-on-Hadoop benchmarks unless they run industry standard queries (e.g. TPC) with large datasets in a distributed configuration.

Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.