Big Analytics Roundup (July 5, 2016)

Quite a few open source announcements this week. One of the most interesting is Apache Bahir, which includes a number of bits spun out from Apache Spark. It’s another indicator of the size and strength of Spark, in case anyone needs a reminder.

In other news, Altiscale and H2O.ai concurrently develop time travel: both vendors claim to support Spark 2.0, which isn’t generally available yet. The currently available Spark 2.0 preview release is not a stable release and the Spark team does not guarantee API stability. So at minimum anyone claiming to support Spark 2.0 will have to retest with the GA release.

Andrew Brust summarizes news from Hadoop Summit.

Microsoft’s Bill Jacobs explains Apache Spark integration through Microsoft R Server.  (Short version: Microsoft R previously pushed processing down to MapReduce, and now pushes down to Spark.) In a test, Microsoft found that shifting from MapReduce to Spark produced a 6X speedup, which is similar to what IBM achieved when it did the same thing with SPSS Analytics Server. Bill’s claim of 125X speedup is suspicious — he compares the performance of Microsoft R’s ScaleR distributed GLM algorithm running in a five-node Spark cluster with running GLM with an unspecified CRAN package on a single machine.

Owen O’Malley benchmarks file formats, concludes nothing. But it was fun!  Pro tip: if you’re going to spend time running benchmarks, use a standard TPC protocol.

Denny Lee introduces Databricks’ new Guide to getting started with Spark on Databricks.

Top Read/Watch

On YouTube and SlideShare: Slim Baltagi, Director of Enterprise Architecture at Capital One, presents his analysis of major trends in big analytics at Hadoop Summit.

Explainers

— In the second of a three-part series, Databricks’ Bill Chambers explains how to build data science applications on Databricks. Part one is here.

— William Lyon explains graph analysis with Neo4j and Game of Thrones, concludes that Lancel Lannister isn’t very important to the narrative.

graph-of-thrones

— On the AWS Big Data Blog, Sai Sriparasa explains how to transfer data from EMR to RDS with Sqoop.

— In part one of a series, LinkedIn’s Kartik Paramasivam disses Lambda, explains how to solve hard problems in stream processing with Apache Samza.

— Hortonworks’ Vinay Shukla and others explain the roadmap for Apache Zeppelin.

— Rajat Jaiswal explains Azure Machine Learning in the first of a multi-part series. It’s on DZone, which means the content was ripped from some other source, but I can’t find the original.

— A blogger named junkcharts explains the importance of simplicity in visualization.

Perspectives

— Roger Schank, who wrote the book on cognitive computing, parses IBM’s claims for Watson. He isn’t impressed.

— Werther Krause offers some pretty good recommendations for building a data science team.

Open Source Announcements

— The Apache Software Foundation announces Apache Bahir as a top-level project. Bahir aims to curate extensions for distributed analytic platforms. Initial bits include toolkits for streaming akka, streaming mqtt, streaming twitter and streamingmq. The team includes 16 committers from Databricks, 4 from UC Berkeley, 3 from Cloudera and 13 others. Sam dean reports.

— H2O.ai announces Sparkling Water 2.0. Sparkling Water is an H2O API for Spark, and a registered Spark package. Stories here, here, here, and here. Among the claimed enhancements:

  • Support for Apache Spark 2.0 and “backward compatibility with all previous versions.”
  • The ability to run Apache Spark and Scala through H2O’s web-based Flow UI.
  • Support for the Apache Zeppelin notebook.
  • H2O feature improvements and visualizations for MLlib algorithms, including the ability to score feature importance.
  • The ability to build Ensembles using H2O plus MLlib algorithms.
  • The power to export MLlib models as POJOs (Plain Old Java Objects).

— Alluxio (née Tachyon) announces Release 1.1. (Alluxio is an open source project for in-memory virtual distributed storage). Key bits include performance improvements, including master metadata scalability, worker scalability and better support for random I/O; improved access control features; usability improvements; and integration with Google Compute Engine.

— Apache Drill announces Release 1.7.0, with bug fixes and minor improvements.

— Qubole announces Quark, an open source project that optimizes SQL across storage platforms.

— MongoDB releases its own connector for Spark, supplementing the existing package developed by Stratio.

Commercial Announcements

— Altiscale claims support for Spark 2.0.

— AtScale announces a reseller agreement with Hortonworks.

— GridGain Systems announces Professional Edition 1.6, the commercially licensed enhanced version of Apache Ignite. Release 1.6 includes native support for Apache Cassandra.

— Hortonworks announces Microsoft Azure HDInsight as its premier cloud solution. They should have noted that Azure is Hortonworks only cloud solution.

— Zoomdata announces certification on the MapR Converged Data Platform.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s