Big Analytics Roundup (March 7, 2016)
Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise. The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.
Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow. We are reaching peak flow.
IBM demonstrates its core values.
Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises. There are simple steps you can take to reduce or eliminate concerns about data security. Here’s a practical guide to anonymizing your data.
In the morning paper, Adrian Colyer explains trajectory data mining,
On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.
Nicholas Perez explains how to log in Spark.
Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.
Slim Baltagi updates all on state of Flink community.
Martin Junghanns explains scalable graph analytics with Neo4j and Flink.
On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.
DataTorrent’s Thomas Weise explains exactly-once processing with
DataTorrent Apache Apex.
Nishant Singh explains how to get started with Apache Drill.
On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.
On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.
We continue to digest analysis from Spark Summit East:
— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.
— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.
— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.
In other matters:
— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.
— Paige Roberts probes the true meaning of “real time.”
— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.
— Sri Ambati describes the road ahead for H2O.ai.
Open Source Announcements
— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo. New bits include integrated security and support for Apache Kafka and Apache Storm.
— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API. GraphFrames is a Spark Package.
— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark. HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark. That’s nice. Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.
— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations. The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank. Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.