Big Analytics Roundup (June 15, 2015)
This week: news from the Hadoop and Spark Summits; a big announcement from IBM, an interesting benchmark; and RapidMiner churns its management team.
In SiliconAngle, Paul Gillin contrasts Wikibon’s forecast for growth in the Big Data market with Gartner’s gloomy prognosis.
Skytree Software enters the Pantheon of Pointless Benchmarks.
Speaking of benchmarks, on the DataScience.LA blog, Szilard Pafka publishes results from his benchmarking project. (h/t Dan Putler) Bottom line: H2O ran faster than Spark and produced more accurate results. It’s just one use case, but still…
On Slideshare, a backgrounder on Drill.
Apache announces Parquet is a top-level project.
- SparkR, which enables the R user to create a DataFrame
- Enhancements to Spark Core for improved operations and performance
- Major extensions to the Dataframes API and Spark SQL
- For MLLib, pipelines go GA, plus new algorithms
- Streaming: visual information graphs, plus support for Kafka and Kinesis
Interesting to note that there is just one GraphX enhancement, reported under MLLib.
On the Databricks blog, Release Manager Patrick Wendell summarizes the new features.
Some additional Spark items (h/t Hadoop Weekly)
- This article details Spark-Kafka integration
- On slideshare, Gwen Shapira summarizes streaming apps with Spark, Flink and Summingbird
- Also on slideshare, Helena Edelson of Datastax details an implementation of Lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala
- On the Databricks blog, Bing Xiao describes Huawei’s embrace of Spark
Is IBM signaling an acquisition?
On the H2O blog, Amy Wang writes a nice intro to scaling R with H2O.
Speaking of H2O, did you know that Release 3.0 is out? Lots of new features, including the new Flow notebook interface for machine learning.
- IBM says it will build Spark into the core of its analytics offerings, including Watson Health Cloud.
- IBM will open source System ML, an analytics suite developed by IBM Research and currently bundled with BigInsights. System ML currently integrates through MapReduce; IBM will partner with Databricks to port the package to Spark.
IBM also announces:
- Commitment of 3,500 developers to work on Spark-related projects
- Opening a Spark Technology Center in San Francisco
- Plans to train “more than a million” data scientists and data engineers on Spark
Lucidworks, commercial sponsor of Lucene and Solr, announces release of Fusion 1.4, which includes Spark integration.
Analytic startup RapidMiner churns its management team for the third time in fifteen months. Not sure why the Board thinks bringing in a crew from TIBCO is a smart move. Gartner ranks TIBCO’s analytic solution dead last in “Ability to Execute” and third from last in “Completeness of Vision.”