Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s