Quite a few open source announcements this week. One of the most interesting is Apache Bahir, which includes a number of bits spun out from Apache Spark. It’s another indicator of the size and strength of Spark, in case anyone needs a reminder.
In other news, Altiscale and H2O.ai concurrently develop time travel: both vendors claim to support Spark 2.0, which isn’t generally available yet. The currently available Spark 2.0 preview release is not a stable release and the Spark team does not guarantee API stability. So at minimum anyone claiming to support Spark 2.0 will have to retest with the GA release.
Andrew Brust summarizes news from Hadoop Summit.
Microsoft’s Bill Jacobs explains Apache Spark integration through Microsoft R Server. (Short version: Microsoft R previously pushed processing down to MapReduce, and now pushes down to Spark.) In a test, Microsoft found that shifting from MapReduce to Spark produced a 6X speedup, which is similar to what IBM achieved when it did the same thing with SPSS Analytics Server. Bill’s claim of 125X speedup is suspicious — he compares the performance of Microsoft R’s ScaleR distributed GLM algorithm running in a five-node Spark cluster with running GLM with an unspecified CRAN package on a single machine.
Owen O’Malley benchmarks file formats, concludes nothing. But it was fun! Pro tip: if you’re going to spend time running benchmarks, use a standard TPC protocol.
— William Lyon explains graph analysis with Neo4j and Game of Thrones, concludes that Lancel Lannister isn’t very important to the narrative.
— On the AWS Big Data Blog, Sai Sriparasa explains how to transfer data from EMR to RDS with Sqoop.
— In part one of a series, LinkedIn’s Kartik Paramasivam disses Lambda, explains how to solve hard problems in stream processing with Apache Samza.
— Hortonworks’ Vinay Shukla and others explain the roadmap for Apache Zeppelin.
— Rajat Jaiswal explains Azure Machine Learning in the first of a multi-part series. It’s on DZone, which means the content was ripped from some other source, but I can’t find the original.
— A blogger named junkcharts explains the importance of simplicity in visualization.
— Werther Krause offers some pretty good recommendations for building a data science team.
Open Source Announcements
— The Apache Software Foundation announces Apache Bahir as a top-level project. Bahir aims to curate extensions for distributed analytic platforms. Initial bits include toolkits for streaming akka, streaming mqtt, streaming twitter and streamingmq. The team includes 16 committers from Databricks, 4 from UC Berkeley, 3 from Cloudera and 13 others. Sam dean reports.
- Support for Apache Spark 2.0 and “backward compatibility with all previous versions.”
- The ability to run Apache Spark and Scala through H2O’s web-based Flow UI.
- Support for the Apache Zeppelin notebook.
- H2O feature improvements and visualizations for MLlib algorithms, including the ability to score feature importance.
- The ability to build Ensembles using H2O plus MLlib algorithms.
- The power to export MLlib models as POJOs (Plain Old Java Objects).
— Alluxio (née Tachyon) announces Release 1.1. (Alluxio is an open source project for in-memory virtual distributed storage). Key bits include performance improvements, including master metadata scalability, worker scalability and better support for random I/O; improved access control features; usability improvements; and integration with Google Compute Engine.
— Apache Drill announces Release 1.7.0, with bug fixes and minor improvements.
— Qubole announces Quark, an open source project that optimizes SQL across storage platforms.
— MongoDB releases its own connector for Spark, supplementing the existing package developed by Stratio.
— Altiscale claims support for Spark 2.0.
— AtScale announces a reseller agreement with Hortonworks.
— GridGain Systems announces Professional Edition 1.6, the commercially licensed enhanced version of Apache Ignite. Release 1.6 includes native support for Apache Cassandra.
— Hortonworks announces Microsoft Azure HDInsight as its premier cloud solution. They should have noted that Azure is Hortonworks only cloud solution.
— Zoomdata announces certification on the MapR Converged Data Platform.