Big Analytics Roundup (April 4, 2016)
Strata + Hadoop World sparks a number of commercial announcements: AtScale has a new release, Microsoft previews R Server on HDInsight, and IBM puts Spark on a mainframe, FWIW. We also have a nice harvest of explainers and perspectives.
Slides from Strata available here.
The folks at Domino Data ask: Is XGBoost 10X faster than H2O? We’ll never know the answer, since they took down the post. I’m guessing the answer is “no.”
Databricks offers a collection of popular blog posts on Apache Spark as an eBook.
On the Google Cloud Big Data Blog, Eric Anderson and Marian Dvorsky compare autoscaling in Dataflow/Beam to Spark and Hadoop. (h/t William Vambenepe)
Miles Yucht and Reynold Xin explain DeepSpark, a convolutional neural network that automates software development processes, such as writing test cases, fixing bugs and so forth.
Databricks’ Jules Damji explains how to process JSON data with Spark Datasets and DataFrames.
On the Airbnb engineering blog, Ricardo Bion explains how to scale data science with R.
Eduardo Ariño De La Rubia explains how The Climate Corporation created a high-throughput data science machine.
On the Insight Data Engineering blog, Daniel Blazevski explains Flink quadtrees.
On the Dataiku blog, someone named Margot explains automated model deployment with Data Science Studio.
On the DataTorrent blog, David Yan explains latency calculations in Apache Apex.
Christopher Crosbie explains SparkR on EMR, on the AWS Big Data blog.
Jack Vaughan notes the prominence of streaming analytics at Strata, quotes some old guy who thinks streaming is a thing.
On the Cloudera Vision Blog, Dan Sturman describes Cloudera’s response to what he characterizes as a software quality challenge.
Cloud vendor Altiscale’s Raymie Stata asks which is best for Spark and Hadoop: cloud or on-premises. Spoiler: he thinks you should choose cloud.
On LinkedIn, consultant Rick van der Lans touts Apache Drill.
Alex Woodie recaps Doug Cutting’s keynoter at Strata+Hadoop.
On the tech blog for Berlin-based online retailer Zalando, Javier Lopez and Mihail Vieru recap a recently completed Flink versus Spark bakeoff. They like Flink’s low latency which, as a fashion retailer, they totally think they need. The bottom line, though, seems to be that DataArtisans is just a few stops away on the U-Bahn, so they chose Flink.
Brandon Butler summarizes the Microsoft and Google challenges to Amazon in the cloud.
InfoWorld’s Martin Heller reviews Databricks’ Spark service, likes it.
In TechCrunch, Josh Klahr lists seven things to watch for at Strata + Hadoop World, which is still worth reading even though the show came and went.
Open Source Announcements
ASF announces Apache NiFi 0.6.0, with Kerberos authentication for its REST API and support for Amazon Kinesis, AWS Lambda, Splunk, and Apache Cassandra. (h/t Hadoop Weekly)
OLAP-on-Hadoop vendor AtScale announces release 4.0. Key new bits: fine-grained security that links every query to an end user and an intelligent query optimizer that pushes down either as SQL or as MDX depending on end user tool. AtScale has also added to its platform integration, now supports Business Objects, Cognos, Excel, Jaspersoft, Qlik, MicroStrategy, PowerBI, Spotfire, and Tableau on CDH, HDP, HDInsights and MapR with Hive/Tez, Impala and Spark SQL and an impressive list of data storage formats. Mike Wheatley reports.
Data integration startup Tamr announces “compatibility” with Spark. The press release does not specify whether that means connectivity, push-down integration or something else. Tamr is not certified by Databricks, and has not published anything on Spark Packages.
Pouring new wine into old bottles, IBM delivers Spark on a mainframe, as promised last July. IBM touts this as a way to perform analysis of your data “in place”, which is great if all of your data is stuck on a mainframe.
IBM partners with Lightbend, the company formerly known as Typesafe, to deliver Scala training through the Big Data University.
Altiscale announces partnership with Tableau, will add visualization to its managed service for Big Data.
Microsoft announces preview of R Server for HDInsight and an update to Apache Spark for Azure HDInsight. R Server for HDInsight is a rebranded version of Revolution Analytics’ ScaleR acquired last year. R Server is a distributed machine learning platform with push-down integration to MapReduce and Spark and an R API.
Flink promoter DataArtisans announces a 5.5 million Euro Series A financing round led by Intel Capital.
Dataiku announces a new release of Data Science Studio. The press release touts some new features, but I’ll refrain from commenting until the company posts release notes.