Big Analytics Roundup (June 15, 2015)

This week: news from the Hadoop and Spark Summits; a big announcement from IBM, an interesting benchmark; and RapidMiner churns its management team.

Key analytics news from Hadoop Summit: new releases of MapR and HDP; Teradata announces support for Presto (more here).

Spark Summit 2015 meets in SFO June 15-17.  If you can’t attend, register here for the live stream.

In SiliconAngle, Paul Gillin contrasts Wikibon’s forecast for growth in the Big Data market with Gartner’s gloomy prognosis.

Skytree Software enters the Pantheon of Pointless Benchmarks.

Speaking of benchmarks, on the DataScience.LA blog, Szilard Pafka publishes results from his benchmarking project.  (h/t Dan Putler)  Bottom line: H2O ran faster than Spark and produced more accurate results.  It’s just one use case, but still…

Apache Drill

On Slideshare, a backgrounder on Drill.

Apache Parquet

Apache announces Parquet is a top-level project.

Apache Spark

Spark announces Release 1.4.  (See post here.)  Key bits:

  • SparkR, which enables the R user to create a DataFrame
  • Enhancements to Spark Core for improved operations and performance
  • Major extensions to the Dataframes API and Spark SQL
  • For MLLib, pipelines go GA, plus new algorithms
  • Streaming: visual information graphs, plus support for Kafka and Kinesis

Interesting to note that there is just one GraphX enhancement, reported under MLLib.

On the Databricks blog, Release Manager Patrick Wendell summarizes the new features.

Media coverage here, here and here.

Some additional Spark items (h/t Hadoop Weekly)

  • This article details Spark-Kafka integration
  • On slideshare, Gwen Shapira summarizes streaming apps with Spark, Flink and Summingbird
  • Also on slideshare, Helena Edelson of Datastax details an implementation of Lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala
  • On the Databricks blog, Bing Xiao describes Huawei’s embrace of Spark

ClearStory Data

Is IBM signaling an acquisition?


On the H2O blog, Amy Wang writes a nice intro to scaling R with H2O.

Speaking of H2O, did you know that Release 3.0 is out?  Lots of new features, including the new Flow notebook interface for machine learning.


IBM finally puts its marketing muscle behind Spark, with a raft of announcements.

  • IBM says it will build Spark into the core of its analytics offerings, including Watson Health Cloud.
  • IBM will open source System ML, an analytics suite developed by IBM Research and currently bundled with BigInsights.  System ML currently integrates through MapReduce; IBM will partner with Databricks to port the package to Spark.

IBM also announces:

  • Commitment of 3,500 developers to work on Spark-related projects
  • Opening a Spark Technology Center in San Francisco
  • Plans to train “more than a million” data scientists and data engineers on Spark



Lucidworks, commercial sponsor of Lucene and Solr, announces release of Fusion 1.4, which includes Spark integration.



At Data Science Central, a list of the top ten Python projects for machine learning; from a larger list here.  Spoiler alert: it’s scikit-learn and nine others.


Analytic startup RapidMiner churns its management team for the third time in fifteen months.  Not sure why the Board thinks bringing in a crew from TIBCO is a smart move.  Gartner ranks TIBCO’s analytic solution dead last in “Ability to Execute” and third from last in “Completeness of Vision.”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.