Big Analytics Roundup (June 15, 2015)

This week: news from the Hadoop and Spark Summits; a big announcement from IBM, an interesting benchmark; and RapidMiner churns its management team.
Key analytics news from Hadoop Summit: new releases of MapR and HDP; Teradata announces support for Presto (more here).
Spark Summit 2015 meets in SFO June 15-17. If you can’t attend, register here for the live stream.
In SiliconAngle, Paul Gillin contrasts Wikibon’s forecast for growth in the Big Data market with Gartner’s gloomy prognosis.
Skytree Software enters the Pantheon of Pointless Benchmarks.
Speaking of benchmarks, on the DataScience.LA blog, Szilard Pafka publishes results from his benchmarking project. (h/t Dan Putler) Bottom line: H2O ran faster than Spark and produced more accurate results. It’s just one use case, but still…
Apache Drill
On Slideshare, a backgrounder on Drill.
Apache Parquet
Apache announces Parquet is a top-level project.
Apache Spark
Spark announces Release 1.4. (See post here.) Key bits:
- SparkR, which enables the R user to create a DataFrame
- Enhancements to Spark Core for improved operations and performance
- Major extensions to the Dataframes API and Spark SQL
- For MLLib, pipelines go GA, plus new algorithms
- Streaming: visual information graphs, plus support for Kafka and Kinesis
Interesting to note that there is just one GraphX enhancement, reported under MLLib.
On the Databricks blog, Release Manager Patrick Wendell summarizes the new features.
Media coverage here, here and here.
Some additional Spark items (h/t Hadoop Weekly)
- This article details Spark-Kafka integration
- On slideshare, Gwen Shapira summarizes streaming apps with Spark, Flink and Summingbird
- Also on slideshare, Helena Edelson of Datastax details an implementation of Lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala
- On the Databricks blog, Bing Xiao describes Huawei’s embrace of Spark
ClearStory Data
Is IBM signaling an acquisition?
H2O
On the H2O blog, Amy Wang writes a nice intro to scaling R with H2O.
Speaking of H2O, did you know that Release 3.0 is out? Lots of new features, including the new Flow notebook interface for machine learning.
IBM
IBM finally puts its marketing muscle behind Spark, with a raft of announcements.
- IBM says it will build Spark into the core of its analytics offerings, including Watson Health Cloud.
- IBM will open source System ML, an analytics suite developed by IBM Research and currently bundled with BigInsights. System ML currently integrates through MapReduce; IBM will partner with Databricks to port the package to Spark.
IBM also announces:
- Commitment of 3,500 developers to work on Spark-related projects
- Opening a Spark Technology Center in San Francisco
- Plans to train “more than a million” data scientists and data engineers on Spark
Lucidworks
Lucidworks, commercial sponsor of Lucene and Solr, announces release of Fusion 1.4, which includes Spark integration.
Python
At Data Science Central, a list of the top ten Python projects for machine learning; from a larger list here. Spoiler alert: it’s scikit-learn and nine others.
RapidMiner
Analytic startup RapidMiner churns its management team for the third time in fifteen months. Not sure why the Board thinks bringing in a crew from TIBCO is a smart move. Gartner ranks TIBCO’s analytic solution dead last in “Ability to Execute” and third from last in “Completeness of Vision.”