Big Analytics Roundup (March 23, 2015)

This week, Spark Summit East produced a deluge of news and analysis on Apache Spark and Databricks.  Also in the news: a couple of ventures landed funding, SAP released software and SAS soft-launched something new for SAS Visual Analytics.

Analytic Startups

Venture Capital Dispatch on WSJ.D reports that Andreeson Horowitz has invested $7.5 million in AMPLab spinout Tachyon Nexus.  Tachyon Nexus supports the eponymous Tachyon project, a memory-centric storage layer that runs underneath Apache Spark or independently.

Social media mining venture Dataminr pulls $130 million in “D” round financing, demonstrating that the real money in analytics is in applications, not algorithms.

Apache Flink

On the Flink project blog, Fabian Hueske posts an excellent article that describes how joins work in Flink.

Apache Spark

ADTMag rehashes the tired debate about whether Spark and Hadoop are “friends” or “foes”.  Sounds like teens whispering in the hallways of Silicon Valley High.  Spark works with HDFS, and it works with other datastores; it all depends on your use case.  If that means a little less buzz for Hadoop purists, get over it.

To that point, Matt Kalan explains how to use Spark with MongoDB on the Databricks blog.

A paper published by a team at Berkeley summarizes results from Spark benchmark testing, draws surprising conclusions.

In other commentary about Spark:

  • TechCrunch reports on the growth of Spark.
  • TechRepublic wonders if anything can dim Spark.
  • InfoWorld lists five reasons to use Spark for Big Data.

In VentureBeat, Sharmila Mulligan relates how ClearStory Data’s big bet on Spark paid off without explaining the nature of the payoff.  ClearStory has a nice product, but it seems a bit too early for a victory lap.

On the Spark blog, Justin Kestelyn describes exactly-once Spark Streaming with Apache Kafka, a new feature in Spark 1.3.

Databricks

Doug Henschen chides Ion Stoica for plugging Databricks Cloud at Spark Summit East, hinting darkly that some Big Data vendors are threatened by Spark and trying to plant FUD about it.  Vendors planting FUD about competitors that threaten them: who knew that people did such things?  It’s not clear what revenue model Henschen thinks Databricks should pursue; as Hortonworks’ numbers show, “contributing to open source” alone is not a viable business model.  If those Big Data vendors are unhappy that Databricks Cloud competes with what they offer, there is nothing to stop them from embracing Spark and standing up their own cloud service.

In other news:

  • On the Databricks blog, the folks from Uncharted Software describe PanTera, cool visualization software that runs in Databricks Cloud.
  • Rob Marvin of SD Times rounds up new product announcements from Spark Summit East.
  • In PCWorld, Joab Jackson touts the benefits of Databricks Cloud.
  • ConsumerElectronicsNet recaps Databricks’ announcement of the Jobs feature for Databricks Cloud, plus other news from Spark Summit East.
  • On ZDNet, Toby Wolpe reviews the new Jobs feature for production workloads in Databricks Cloud.
  • On the Databricks blog, Abi Mehta announces that Tresata’s TEAK application for AML will be implemented on Databricks Cloud.  Media coverage here, here and here.

Geospatial

MemSQL announced geospatial capabilities for its distributed in-memory NewSQL database.

J. Andrew Rogers asks why geospatial databases are hard to build, then answers his own question.

RapidMiner

Butler Analytics publishes a favorable review of RapidMiner.

SAP

SAP released a new on-premises version of Lumira Edge for visualization, adding to the list of software that is not as good as Tableau.  SAP also released Predictive Analytics 2.0, a product that marries the toylike SAP Predictive Analytics with KXEN InfiniteInsight, a product acquired in 2013.  According to SAP, Predictive Analytics 2.0 is a “single, unified analytics product” with two work environments, which sounds like SAP has bundled two different code bases into a marketing bundle with a common datastore.  Going for a “three-fer”, SAP also adds Lumira Edge to the bundle.

SAS

American Banker reports that SAS has “launched” SAS Transaction Monitoring Optimization for AML scenario testing; in this case, “launch”, means marketing collateral is available.  The product is said to run on top of SAS Visual Analytics, which itself runs on top of SAS LASR Server, SAS’ “other” distributed in-memory platform.

Strata + Hadoop World 2014

A sellout crowd of 5,500 met at the Javits Center in New York last week for the 2014 Strata + Hadoop World conference.  There were three major themes:

Big Data in Action.   In his keynote address, Mike Olson of Cloudera noted the shift from talking about “geeky projects like Pig, Sqoop and Oozie” to talking about applications, such as fraud detection, product design and agriculture.   An entire track in the conference featured success stories from companies such as Goldman Sachs, Transamerica, American Express, L.L. Bean, FICO and Kaiser Permanente.

Symbiosis of Analytics and Big Data.  Paul Zikopoulos of IBM observed that “Big Data without analytics is just a bunch of data.”   Zikopoulos drew an analogy to the mining industry, which uses advanced technology to extract trace amounts of valuable material from large quantities of low-grade ore; in Big Data, we use advanced analytics to extract useful insight from large quantities of low-value per byte data.  Conference sessions reflected the critical role analytic technology plays in the Big Data value chain.

Spark has arrived.  The 2013 conference included two sessions about Spark; this year, thirteen sessions featured Spark, including the sold-out full day Spark Camp.  Moreover, vendors such as ClearStory Data and Platfora openly touted Spark integration, in the belief that this capability resonates with buyers.  Other conference sponsors recently certified on Spark include Pentaho, Skytree, Tableau, Talend and Trifacta; and MapR announced a project to deliver Apache Drill on Spark.

Among the notable Spark sessions:

  • Sean Owen of Cloudera delivered an excellent demonstration of Spark’s MLLib machine learning library for anomaly detection
  • Michael Armbrust of Databricks presented on Spark SQL and its uses as both a query language and a general framework for working with structured data

Advancing a theme he introduced last year, Olson speculated in his keynote that Hadoop will “disappear” this year because enterprises increasingly view Hadoop in the context of an overall data management strategy.  He cited the recent Teradata-Cloudera partnership as evidence of this trend.  That announcement is certainly significant, but it demonstrates the opposite of Olson’s high-level point; Teradata abandoned its exclusive relationship with Hortonworks because many of its customers prefer Cloudera to HDP, and they aren’t willing to switch simply because TD sells a “Unified Data Architecture.”  Most enterprises still make decisions about Hadoop separately from decisions about other elements in the warehousing mix, and there are currently few good reasons to change that behavior.

Rana El Kaliouby of Affectiva presented an excellent example of analytics and Big Data working together.  Affectiva uses streaming facial recognition to capture millions of data points as consumers react to content, and uses machine learning algorithms to draw insight from the data.  By mapping the streaming data to emotional states, they can identify what content resonates with consumers.

Several of the sponsored topics in the plenary sessions were quite good, including presentations by MapR, Intel, ClearStory and IBM; others were about what one expects from sponsored presentations.

There were also a number of entertaining presentations that had little to do with Big Data.  Shankar Vedantum of NPR, for example, spent ten minutes sermonizing about the propensity of the human mind to select facts that confirm existing biases, and selectively used facts to illustrate his point.  He should have paid attention in “Research Methods 101”; at best, his point seemed trite, like telling a convention of nutritionists that “dieting is hard.”

Eli Collins of Cloudera delivered the obligatory “ethics and Big Data” piece, in which he argued that we should “use data for good”; his piece was immediately followed, ironically, by a presentation about using facial recognition to get people to buy more candy.  Everyone agrees that doing good is a good thing, but a technologist delivering a sermon is as silly as a Baptist minister lecturing on Oozie.