Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s