Big Analytics Roundup (September 21, 2015)

Top story of the week: release of AtScale’s Hadoop Maturity Survey, which triggered a flurry of analysis.  Meanwhile, the Economist ventures into the world of open source software and venture capital, embarrassing itself in the process; and IBM announces plans to use Spark in its search for extraterrestrial intelligence, a project that would be more useful if pointed toward IBM headquarters.

AtScale Releases Hadoop Adoption Survey

OLAP-on-Hadoop vendor AtScale publishes results of a survey of 2,200 respondents who are either actively working with Hadoop today or planning to do so in the near future.  AtScale partnered with Cloudera, Hortonworks, MapR and Tableau to recruit respondents for the survey.

A copy of the survey report is here; the survey instrument is here.  AtScale will deliver a webinar summarizing results from the survey; you can register here.

There are multiple stories about this survey in the media: here, here, here, here, here, here, here, here, and here.  Some highlights:

  • Andrew Oliver compares this survey to Gartner’s Hadoop assessment back in May and concludes that Gartner blew it.  While I agree that Gartner’s outlook on Hadoop is too conservative, (and said so at the time) the two surveys are apples and oranges: while AtScale surveyed people who are either already using Hadoop or plan to do so, Gartner surveyed a panel of CIOs.  Hence, it is not surprising that AtScale’s respondents are more positive about prospects for Hadoop.
  • Matt Asay notes that “Cost saving” is the third most frequently cited reason for adopting Hadoop, after “Scale-out needs” and “New applications.”  This is somewhat surprising, given Hadoop’s reputation as a cheap datastore.  Cost is still a factor driving Hadoop adoption, it’s just not the primary factor.

Here are a few insights from this survey not mentioned by other analysts.  First look at the difference in BI tool usage between those currently using Hadoop and those planning to use Hadoop.  Compared to current users, planners are significantly more likely to say they want to use Excel and less likely to say they want to use Tableau or SAS.  (Current and planned use of SAP Business Objects and IBM Cognos are about the same.)

Screen Shot 2015-09-21 at 10.06.17 AM

Also interesting to note differences in Hadoop maturity among the BI users.  SAS users are more likely than others to self-identify as “Low Maturity”:

Screen Shot 2015-09-21 at 10.06.37 AM

Finally, a significant minority of current Hadoop users cite Management, Security, Performance, Governance and Accessibility as challenges using Hadoop.  However, most who plan to use Hadoop do not anticipate these challenges — which suggest these respondents are in for a rude awakening.

Screen Shot 2015-09-21 at 10.07.01 AM

SQL on Hadoop

For those who like things distilled to sound bites, eWeek offers a point of view on when to select Apache Spark, Hadoop or Hive.   Brevity is the soul of wit, but sometimes it’s just brevity.

Amazon Web Services

Redshift is an OEM version of Actian’s ParAccel columnar database with analytic capabilities removed, which is why data scientists say that Redshift is where data goes to die.  Amazon Web Services has taken baby steps to ameliorate this, adding Python UDFs.  Christopher Crosbie reports, on the AWS Big Data Blog. (h/t Hadoop Weekly)

Apache Apex/DataTorrent

On the DataTorrent blog, Amol Kekre introduces you to Apache Apex, which was just accepted by Apache as an incubator project.  DataTorrent touts Apex as kind of like Spark, only better, thereby demonstrating the importance of timing in life.  (h/t Hadoop Weekly)

If you think that Apex does nothing, Munagala Ramanath shares the good news that Apex supports the Malhar library.  Honestly, though, it still seems to do nothing.

In an email to David Ramel, DataTorrent CEO Phu Hoang identifies flaws in Spark, points to his Apache Apex project as a solution.  Bad move on his part.

Apache Drill

Chloe Green discusses implications of the European Commission’s digital single market, and suggests that retailers will use Apache Drill to analyze the data that will be produced under this regulatory framework.  There are two problems with this article.  First, Green makes no effort to consider alternatives to Drill.  Second, the article itself accepts the premise that more regulation will produce business growth; in fact, the opposite is more likely (except for those in the compliance industry.)

The Drill team explains how to implement Drill in ten minutes.

Jim Scott summarizes the benefits of Drill for the BI user.

On O’Reilly Radar, Ellen Friedman recaps the history of Drill as an open source project.

Zygimantas Jacikevicius offers an introduction to Drill and explains why it is useful.

Apache Flink

On the DataArtisans blog, Kostas Tzoumas seeks to position Flink against Spark by arguing that batch is a special case of streaming.  Of course, you can argue the opposite just as easily — that streaming is batch with very small batches.

If you care about Off-heap Memory in Apache Flink, Stephan Ewen offers a summary.

At a DC Area Flink Meetup, Capital One’s Slim Baltagi explains unified batch and real-time stream processing with Flink.

Flink sponsor DataArtisans announces partnership with SciSpike, a training and consulting provider.

Apache NiFi

Yves de Montcheuil explains why you should care about Apache NiFi, a project that connects data-generating systems with data processing systems.  Spoiler: it’s all about security and reliability.

Apache Spark

In Fortune, Derrick Harris describes Microsoft’s “Spark-inspired” and “Spark-like” Prajna project, does not explain why MSFT is reinventing the wheel.

Cloudera announces a Spark training curriculum.  For those without prior Hadoop experience, two courses cover data ingestion with Sqoop and Flume, data modeling, data processing with Spark and Spark Streaming with Kafka.  There is also a single shorter course covering the same ground for those with prior Hadoop experience.  Finally, a data science course covers advanced analytics with MLLib.

Document analytics vendor Ephesoft introduces new software built on Spark.

Matt Asay uses the Spark/Fire metaphor once too often.

In a post about DataStax, Curt Monash notes synergies between Spark and Cassandra.

MongoDB offers a white paper which explains, not surprisingly, how to use Spark with Mongo.

On the Basho blog, Korrigan Clark discusses his work using Spark to develop an algorithmic stock trading program.

Here are two items from Cloudera’s Kostas Sakellis on SlideShare.  The first explains why your Spark job fails; the second reviews how to get Spark customers to production.


Dato, the University of Washington and Coursera announce a machine learning specialization consisting of five courses and a capstone project.  The curriculum is platform neutral, though I suspect that co-creator Carlos Guestrin manages to get in a good word for his project.


Two items on slideshare:

  • From a meetup at 6Sense, Mark Landry explains H2O Gradient Boosted Machines for Ad Click Prediction.
  • Avni Wadhwa and Vinod Iyengar demonstrate how to build machine learning applications with Sparkling Water, H2O’s interface to Spark.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.