Late posting today due to holiday travel.
In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.
The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.
Analytic Software
Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot. I’ll write a more detailed summary later this week. Quick takes: Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.
Apache Drill
Apache Drill announces Release 0.8.
Apache Spark
Analysis
In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.
Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud. (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)
Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit. Key takeaways: no, Matei is not a musician, and yes, he likes Nutella.
Spark has clearly reached a point of inflection when skeptical analysis emerges. Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce. In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.
- Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting. Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East. Who knew that Hadoop devotees are so sensitive?
- In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
- In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11. His point will be lost on most readers.
- Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.” Note to Andrew: you can download the software here.
Spark Core
Matei Zaharia celebrates Spark’s fifth birthday with a brief history.
On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.
Spark Streaming
On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3
Databricks
Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud. Case study available here.
Hadoop Ecosystem
In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity. Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions). Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS. Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?
IBM
IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0. BigInsights includes the usual Hadoop bits, plus:
- BigSQL, a federation engine for SQL across relational databases and Hadoop
- Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
- SystemML, a home-grown machine learning library that runs in MapReduce
- Text analytics capability
- Big R, an interface that can push embarrassingly parallel R processing into Hadoop
Streaming and Real-Time Processing
On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.