Looking Ahead: Big Analytics in 2016

Every year around this time I review last year’s forecast and publish some thoughts about the coming year.

2015 Assessment

First, a brief review of my predictions for 2015:

(1) Apache Spark usage will explode.

Nailed it.

(2) Analytics in the cloud will take off.

In 2015, all of the leading cloud platforms — AWS, Azure, IBM and Google — released new tools for advanced analytics and machine learning.  New cloud-based providers specializing in advanced analytics, such as Qubole and Domino Data, emerged.

Cloud platform providers do not break out revenue by workload, so it’s difficult to measure analytics activity in the cloud; anecdotally, though, there are a growing number of analysts, vendors and service providers whose sole platform is the cloud.

(3) Python will continue to gain on R as the preferred open source analytics platform.

While Python continues to add functionality and gain users, so does R, so it’s hard to say that one is gaining on the other.

(4) H2O will continue to win respect and customers in the Big Analytics market.

In 2015, H2O doubled its user base, expanded its paid subscriber base fourfold and landed a $20 million “B” round.  Not bad for a company that operates on a true open source business model.

(5) SAS customers will continue to seek alternatives.

Among analytic service providers (ASPs) the exit from SAS is a stampede.

With a half dozen dot releases, SAS’ distributed in-memory products are stable enough that they are no longer the butt of jokes.  Customer adoption remains thin; customers are loyal to SAS’ legacy software, but skeptical about the new stuff.

2016 Themes

Looking ahead, here is what I see:

(1) Spark continues its long march into the enterprise.

With Cloudera 6, Spark will be the default processing option for Cloudera workloads.  This does not mean, as some suggest, that MapReduce is dead; it does mean that a larger share of new workloads will run on Spark.  Many existing jobs will continue to run in MapReduce, which works reasonably well for embarrassingly parallel workloads.

Hortonworks and MapR haven’t followed Cloudera with similar announcements yet, but will do so in 2016.  Hortonworks will continue to fiddle around with Hive on Tez, but will eventually give up and embrace Hive on Spark.

SAS will hold its nose and support Spark in 2016.  Spark competes with SAS’ proprietary back end, but it will be forced to support Spark due to its partnerships with the Hadoop distributors.  Analytic applications like Datameer and Microsoft/Revolution Analytics ScaleR that integrate with Hadoop through MapReduce will rebuild their software to interface with Spark.

Spark Core and Spark SQL will remain the most widely used Spark components, with general applicability across many use cases.  Spark MLLib suffers from comparison with alternatives like H2O and XGBoost; performance and accuracy need to improve.  Spark Streaming faces competition from Storm and Flink; while the benefits of “pure” streaming versus micro-batching are largely theoretical, it’s a serious difference that shows up in benchmarks like this.

With no enhancements in 2015, Spark GraphX is effectively dead.  The project leadership team must either find someone interested in contributing, fold the library into MLLib, or kill it.

(2) Open source continues to eat the analytics software world.

If all you read is Gartner and Forrester, you may be inclined to think that open source is just a blip in the market.  Gartner and Forrester ignore open source analytics for two reasons: (1) they get paid by commercial vendors, and (2) users don’t need “analysts” to tell them how to evaluate open source software.  You just download it and check it out.

Surveys of actual users paint a different picture.  Among new grads entering the analytics workforce, using open source is as natural as using mobile phones and Yik Yak; big SAS shops have to pay to send the kids to training.  The best and brightest analysts use open source tools, as shown by the 2015 O’Reilly Data Science Salary Survey;  while SAS users are among the lowest paid analysts, they take consolation from knowing that SPSS users get paid even less.

IBM’s decision in 2015 to get behind Spark exemplifies the movement towards open source.  IBM ranks #2 behind SAS in advanced analytics software revenue, but chose to disrupt itself by endorsing Spark and open-sourcing SystemML.  IBM figures to gain more in cloud and services revenue than it loses in cannibalized software sales.  It remains to be seen how well that will work, but IBM knows how to spot a trend when it sees it.

Microsoft’s acquisition of Revolution Analytics in 2015 gives R the stamp of approval from a company that markets the most widely implemented database (SQL Server) and the most widely used BI tool (Excel).  As Microsoft rolls out its R server and SQL-embedded R, look for a big jump in enterprise adoption.  It’s no longer possible for folks to dismiss R as some quirky tool used by academics and hobos.

The open source business model is also attracting capital.  Two analytics vendors with open source models (H2O and RapidMiner) recently landed funding rounds, while commercial vendors Skytree and Alpine languish in the funding doldrums and cut headcount.  Palantir and Opera, the biggest dogs in the analytics startup world, also leverage open source.

Increasingly, the scale-out distributed back end for Big Analytics is an open source platform, where proprietary architecture sticks out like a pimple.  Commercial software vendors can and will thrive when they focus on the end user.  This approach works well for AtScale, Alteryx, RapidMiner and ZoomData, among others.

(3) Cloud emerges as the primary platform for advanced analytics.

By “cloud” I mean all types of cloud: public, private, virtual private and hybrid, as well as data center virtualization tools, such as Apache Mesos.  In other words, self-service elastic provisioning.

High-value advanced analytics is inherently project-oriented and ad-hoc; the most important questions are answered only once.  This makes workloads for advanced analytics inherently volatile.  They are also time-sensitive and may require massive computing resources.

This combination  — immediate need for large-scale computing resources for a finite period — is inherently best served by some form of cloud.  The form of cloud an organization chooses will depend on a number of factors, such as where the source data resides, security concerns and the organization’s skills in virtualization and data center management.  But make no mistake: organizations that do not leverage cloud computing for advanced analytics will fall behind.

Concerns about cloud security for advanced analytics are largely bogus: rent-seeking apologetics from IT personnel who (rightly) view the cloud as a threat to their fiefdom.  Sorry guys — the biggest data breaches in the past two years were from on-premises systems.  Arguably, data is more secure in one of the leading clouds than it is in on premises.

For more on this, read my book later this year. 🙂

(4) Automated machine learning tools become mainstream.

As I’ve written elsewhere, automated machine learning is not a new thing.  Commercial and open source tools that automate modeling in various ways have been available since the 1980s.  Most, however, automated machine learning by simplifying the problem in ways that adversely impact model quality.  In 2016, software will be available to enterprises that delivers expert-level predictive models that win Kaggle competitions.

Since analysts spend 80% of their time data wrangling, automated machine learning tools will not eliminate the hiring crunch in advanced analytics; one should be skeptical of vendor claims that “it’s so easy that even a caveman can do it.”  The primary benefit of automation will be better predictive models built consistently to best practices.  Automation will also expand the potential pool of users from hardcore data scientists to “near-experts”, people with business experience or statistical training who are not skilled in programming languages.

(5) Teradata continues to struggle.

Listening to Teradata’s Q3 earnings call back in November, I thought of this:

100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8

CEO Mike Koehler, wiping pie from his face after another quarterly earnings fail, struggled to explain a coherent growth strategy.  It included (a) consulting services; (b) Teradata software on AWS; (c) Aster on commodity hardware.

Well, that dog won’t hunt.

— Teradata’s product sales drive its consulting revenue.  No product sales, no consulting revenue.   Nobody will ever hire Teradata for platform-neutral enterprise Big Data consulting projects, so without a strategy to build product sales, consulting  revenue won’t grow either.

— Teradata’s principal value added is its ability to converge software and hardware into an integrated appliance.  By itself, Teradata software itself is nothing special; there are plenty of open source alternatives, like Apache Greenplum.  Customers who choose to build a data warehouse on AWS have many options, and Teradata won’t be the first choice.  Meanwhile, IBM, Microsoft and Oracle are light years ahead of Teradata delivering true hybrid cloud databases.

— Aster on commodity hardware is a SQL engine with some prebuilt apps.  It runs through MapReduce, which was kind of cool in 2012 but DOA in today’s market: customers who want a SQL engine that runs on commodity hardware have multiple open source options, including Presto, which Teradata also embraces.

Meanwhile, Teradata’s leadership team actually spent time with analysts talking about the R&D tax credit, which seemed like shuffling deck chairs.  The stock is worth about a third of its value in 2012 because the company has repeatedly missed earnings forecasts, and investors have no confidence in current leadership.

At current market value, Teradata is acquisition bait, but it’s not clear who would buy it.  My money’s on private equity, who will cut headcount by half and milk the existing customer base.   There are good people at Teradata; I would advise them all to polish their resumes.

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.

Big Analytics Roundup (October 26, 2015)

Fourteen stories this week, beginning with an announcement from IBM.  This week, IBM celebrates 14 straight quarters of declining revenue at its IBM Insight conference, appropriately enough at the Mandalay Bay in Vegas, where the restaurants are overhyped and overpriced.

Meanwhile, the first Spark Summit Europe meets in Amsterdam, in the far more interesting setting of the Beurs van Berlage.  There will be a live stream on Wednesday and Thursday — details here.  Sadly, I can’t make this one — the first Spark Summit I’ve missed — but am looking forward to the live stream.

(1) IBM Announces Spark on Bluemix

At its IBM Insight beauty show, IBM announces availability of its Apache Spark cloud service.  Actually, IBM announced it back in July, but that was a public beta.   On ZDNet, Andrew Brust gushes, noting that IBM has DB2, Watson, Netezza, Cognos, TM1, SPSS, Informix and Cloudant in its portfolio.  He fails to note that of those products, exactly one — Cloudant — actually interfaces with Spark.

There were rumors that IBM would have an exciting announcement about Spark at this show, but if this is it — yawn.  Looking at IBM’s “Spark in the cloud” offering, I don’t see anything that sets it apart from other available offerings unless you have a Blue fetish.

Update: Rod Reicks of IBM writes to note that IBM’s new release of SPSS Analytics Server runs processes in Spark.  For the uninitiated, Analytics Server is a product you license from IBM that enables SPSS Modeler user to run selected operations in Hadoop.  Previous versions ran through MapReduce only.  Reicks claims that the latest version runs through Spark when available.

I say “claims” because there is no reference to this feature in IBM’s Release Notes, Installation Guide or User’s Guide.  Spark is mentioned deep in the Administrator Guide, under Troubleshooting.  So the good news is that if the product fails, IBM has some tips — one of which should be “Install Spark.”

You’d think that with IBM’s armies of people they could at least find someone to write documentation.

(2) Mahout Book FAIL

Packt announces a book on Clustering with Mahout with an entire chapter devoted to Canopy Clustering, which the Mahout team just deprecated.

(3) Concurrent Adds Spark Support

Concurrent announces Release 2.0 of Driven, its oddly-named performance management software, which now includes support for Apache Spark.

(4) Flink Founder Touts Streaming Analytics

At Big Data Spain, Data Artisans co-founder Kostas Tzoumas argues that streaming is the basis for all analytics, which is a bit over the top: as they say, if all you have is a hammer, the world looks like a nail.  Still, his deck is a nice intro to Flink, which has made some progress this year.

(5) AtScale Announces Release 3.0

AtScale, one of the more interesting startups in the BI space, delivers Release 3.0 of its OLAP-on Hadoop platform.  Rather than introducing a new user interface into the mix, AtScale makes it possible for BI users to work with Hadoop tables without jumping back and forth to programming tools.  The product currently supports Tableau, Excel, Qlik, Spotfire, MicroStrategy and JasperSoft, and runs on CDH, HDP or MapR with Impala, Spark SQL or Hive on Tez.  The new release includes enhanced role-based security, including Kerberos, Username/Password or LDAP.

(6) Neo: Graphs are Eating the World

Graph database leader Neo announces immediate availability of Neo4j 2.3, which includes what it calls “intelligent applications at scale” and Docker support.  Exactly what Neo means by “intelligence applications at scale” means is unclear, but if Neo is claiming that you no longer have to dump a graph into Spark to run a PageRank, I’ll believe it when I see it.

(7) New Notebook Sharing for Databricks 

Databricks announces new notebook sharing capabilities for its eponymous product.  On the Databricks blog, Denise Li and Dave Wang explain.

(8) Teradata: Blah, Blah, Blah, IoT, Blah, Blah Blah

At its annual user conference, Teradata announces that it’s heard about IoT.    Teradata also announces that it will make Aster available on Hadoop, which would have been interesting in 2012.  Aster, for the uninitiated, includes a SQL on MapReduce engine, which is rendered obsolete by fast SQL engines like Presto, which Teradata has just embraced.

(9) Flink Forward Redux

As I noted last week, the first Flink Forward conference met in Berlin two weeks ago.  William Benton records his impressions.

Presentations are here.  Some highlights:

  • Dongwon Kim benchmarks Flink against MR, MR on Tez and Spark.  Flink wins.
  • Kostas Tzoumas outlines the Flink development roadmap through Release 1.0.
  • Martin Junghanns explains graph analytics with Flink.
  • Anwar Rizal demonstrates streaming decision trees with Flink.

Henning Kropp offers resources for diving deeply into Flink.

(10) Pyramid Analytics Lands New Funding

Amsterdam-based BI startup Pyramid Analytics announces a $30 million “B” round to help it try to explain why we need more BI software.

(11) Harte Hanks Switches from CDH to MapR

John Leonard explains why Harte Hanks switched from Cloudera to MapR.  Most likely explanation: they were able to cut a cheaper deal with MapR.

(12) Audience Modeling with Spark

Guest posting on the Databricks blog, Eugene Zhulenev explains audience modeling with Spark ML pipelines.

(13) New Functions in Drill

On the MapR blog, Neeraja Rentachintala describes new capabilities in Drill Release 1.2, including SQL window functions.

(14) Integrating Spark and Redshift

“Redshift is where data goes to die.”  — Rob Ferguson, Spark Summit East

On the Databricks blog, Sameer Wadkar of Axiomine explains how to use the spark-redshift package, first introduced in March of this year and now in version 0.5.2.  So you can yank your data out of Redshift and do something with it. (h/t Hadoop Weekly)

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

SAS in Hadoop: An Update

SAS supports several different products that run “inside” Hadoop based on two different in-memory architectures:

(1) The SAS High Performance Analytics suite, originally designed to run in dedicated Teradata and Greenplum appliances, includes five modules: Statistics, Data Mining, Text Mining, Econometrics and Optimization.

(2) A second set of products — SAS Visual Analytics, SAS Visual Statistics and SAS In-Memory Statistics for Hadoop — run on the SAS LASR Server architecture, which is designed for high concurrency.

SAS’ recent marketing efforts appear to favor the LASR-based software, so that is the focus of this post.  At the recent Strata + Hadoop World conference in New York, I was able to sit down with Paul Kent, Vice President of Big Data at SAS, to discuss some technical aspects of SAS LASR Server.   Paul was most generous with his time.  We discussed three areas:

(1) Can SAS LASR Server work directly with data in Hadoop?

According to SAS documentation, LASR Server can read data from traditional SAS datasets, relational databases (using SAS/Access Software) or data stored in SAS’ proprietary SASHDAT format.   That suggests SAS users must preprocess Hadoop data before loading it into LASR Server.

Paul explained that LASR Server can read Hadoop data through SAS/ACCESS Interface to Hadoop, which makes HDFS data appear to SAS as a virtual relational database. (Of course, this applies to structured data only). Reading from SASHDAT is much faster, however, so users should consider the tradeoff between the time needed to pre-process data into SASHDAT versus the runtime with SAS/ACCESS.

SAS/ACCESS Interface to Hadoop can read all widely used Hadoop data formats, including ORC, Parquet and Tab-Delimited; it can also read user-defined formats.  This builds on SAS’ long-standing ability to work with enterprise data everywhere.

Base SAS supports basic data cleansing and data transformation capability through DATA Step and DS2 processing, and can write SASHDAT format; however, since LASR Server runs DS2 but not DATA Step code, this transformation could require extract and movement to an external server.   Alternatively, users can pass Hive, Pig or MapReduce commands to Hadoop to perform data transformation in place.   Users can also license SAS ETL Server and build a process to convert raw data and store it in SASHDAT.

SAS Visual Analytics, which runs on LASR Server, includes the Data Builder component for modest data preparation tasks.

(2) Can SAS LASR Server and MapReduce run concurrently in Hadoop?

At last year’s Strata + Hadoop World, Paul mentioned some issues running SAS and MapReduce at the same time; workarounds included running SAS during the daytime and MapReduce at night. Clients who have evaluated LASR-based software say this is a concern.

Paul notes that given a fixed number of task tracker slots on a node, any use of slots by SAS necessarily reduces the number of slots available for MapReduce; this can create conflicts for customers who are unwilling or unable to make a static allocation between MapReduce and SAS workload.  This issue is not unique to SAS, but potentially applies to any software co-located with Hadoop prior to the introduction of YARN.

Under Hadoop 1.0, Hadoop workload management was tightly married to MapReduce.  Applications operating independently from MapReduce (like SAS) were essentially ungoverned.  The introduction of YARN late last year eliminates this issue because it supports unified workload management for MapReduce and non-MapReduce applications.

(3) Can SAS LASR Server run on standard commodity hardware?

SAS supports LASR Server on “spec” hardware from a number of vendors, but does not recommend specific boxes; instead, it works with customers to define expected workload, then relies on its hardware partners to recommend infrastructure. Hence, prospective customers should consult with hardware suppliers or independent experts when sizing hardware for SAS, and not rely solely on verbal representations by SAS sales and marketing personnel.

While the definition of a “standard” Hadoop DataNode node server changes rapidly, industry experts such as Doug Henschen say the current standard is a 12-core machine with 64-128G RAM; sources at Cloudera confirm this is a typical configuration.   A recently published paper from HP and Hortonworks positions the reference spec for RAM at 96 GB RAM for memory-intensive applications.

In contrast, the minimum hardware recommended by HP for SAS LASR Server is a 16-core machine with 256G RAM.

It should not surprise anyone that in-memory software needs more memory; Henschen, for example, points out that organizations seeking to use Spark or Impala should specify more memory.   While some prospective customers may balk at the task of upgrading memory in every DataNode of a large cluster, the cost of memory is coming down, so this should not be an issue in the long run.

Strata + Hadoop World 2014

A sellout crowd of 5,500 met at the Javits Center in New York last week for the 2014 Strata + Hadoop World conference.  There were three major themes:

Big Data in Action.   In his keynote address, Mike Olson of Cloudera noted the shift from talking about “geeky projects like Pig, Sqoop and Oozie” to talking about applications, such as fraud detection, product design and agriculture.   An entire track in the conference featured success stories from companies such as Goldman Sachs, Transamerica, American Express, L.L. Bean, FICO and Kaiser Permanente.

Symbiosis of Analytics and Big Data.  Paul Zikopoulos of IBM observed that “Big Data without analytics is just a bunch of data.”   Zikopoulos drew an analogy to the mining industry, which uses advanced technology to extract trace amounts of valuable material from large quantities of low-grade ore; in Big Data, we use advanced analytics to extract useful insight from large quantities of low-value per byte data.  Conference sessions reflected the critical role analytic technology plays in the Big Data value chain.

Spark has arrived.  The 2013 conference included two sessions about Spark; this year, thirteen sessions featured Spark, including the sold-out full day Spark Camp.  Moreover, vendors such as ClearStory Data and Platfora openly touted Spark integration, in the belief that this capability resonates with buyers.  Other conference sponsors recently certified on Spark include Pentaho, Skytree, Tableau, Talend and Trifacta; and MapR announced a project to deliver Apache Drill on Spark.

Among the notable Spark sessions:

  • Sean Owen of Cloudera delivered an excellent demonstration of Spark’s MLLib machine learning library for anomaly detection
  • Michael Armbrust of Databricks presented on Spark SQL and its uses as both a query language and a general framework for working with structured data

Advancing a theme he introduced last year, Olson speculated in his keynote that Hadoop will “disappear” this year because enterprises increasingly view Hadoop in the context of an overall data management strategy.  He cited the recent Teradata-Cloudera partnership as evidence of this trend.  That announcement is certainly significant, but it demonstrates the opposite of Olson’s high-level point; Teradata abandoned its exclusive relationship with Hortonworks because many of its customers prefer Cloudera to HDP, and they aren’t willing to switch simply because TD sells a “Unified Data Architecture.”  Most enterprises still make decisions about Hadoop separately from decisions about other elements in the warehousing mix, and there are currently few good reasons to change that behavior.

Rana El Kaliouby of Affectiva presented an excellent example of analytics and Big Data working together.  Affectiva uses streaming facial recognition to capture millions of data points as consumers react to content, and uses machine learning algorithms to draw insight from the data.  By mapping the streaming data to emotional states, they can identify what content resonates with consumers.

Several of the sponsored topics in the plenary sessions were quite good, including presentations by MapR, Intel, ClearStory and IBM; others were about what one expects from sponsored presentations.

There were also a number of entertaining presentations that had little to do with Big Data.  Shankar Vedantum of NPR, for example, spent ten minutes sermonizing about the propensity of the human mind to select facts that confirm existing biases, and selectively used facts to illustrate his point.  He should have paid attention in “Research Methods 101”; at best, his point seemed trite, like telling a convention of nutritionists that “dieting is hard.”

Eli Collins of Cloudera delivered the obligatory “ethics and Big Data” piece, in which he argued that we should “use data for good”; his piece was immediately followed, ironically, by a presentation about using facial recognition to get people to buy more candy.  Everyone agrees that doing good is a good thing, but a technologist delivering a sermon is as silly as a Baptist minister lecturing on Oozie.

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

Spark Summit 2014 Roundup

Key highlights from the 2014 Spark Summit:

  • Spark is the single most active project in the Hadoop ecosystem
  • Among Hadoop distributors, Cloudera and MapR are clear leaders with Spark
  • SAP now offers a certified Spark distribution and integration with HANA
  • Datastax has delivered a Cassandra connector for Spark
  • Databricks plans to offer a cloud service for Spark
  • Spark SQL will absorb the Shark project for fast SQL
  • Cloudera, MapR, IBM and Intel plan to port Hive to Spark
  • Spark MLLIb will double its supported algorithms in the next release

Last December, the 2013 Spark Summit pulled 450 attendees for a two-day event.  Six months later, the Spark Summit 2014 sold out at more than a thousand seats for a three-day affair.

It’s always ironic when manual registration at a tech conference produces long lines:

SS4

Databricks CTO Matei Zaharia kicked off the keynotes with his recap of Spark progress since the last summit.   Zaharia enumerated Spark’s two big goals: a unified platform for Big Data applications combined with a standard library for analytics.  CEO Ion Stoica followed with a Databricks update, including an announcement of the SAP alliance and an impressive demo of Databricks Cloud, currently in private beta.  Separately, Databricks announced $33 million in Series B funding.

Spark Release Manager Patrick Wendell delivered an overview of planned development over the next several releases.   Wendell confirmed Spark’s commitment to stable APIs; patches that break the API fail the build.   The project will deliver dot releases every three months beginning in August 2014, and maintenance releases as needed.   Development focus in the near future will be in the libraries:

  • Spark SQL: optimization, extensions (toward SQL 92), integration (NoSQL, RDBMS), incorporation of Shark
  • MLLib : rapid expansion of algorithms (including descriptive statistics, NMF. Sparse SVM, LDA), tighter integration with R
  • Streaming: new data sources, tighter Flume integration
  • GraphX: optimizations and API stability

Mike Franklin of Berkeley’s AMPLab summarized new developments in the Berkeley Data Analytics Stack (“BadAss”), including significant new work in genomics and energy, as well as improvements to Tachyon and MLBase.  Dave Patterson elaborated on AMPLab’s work in genomics, providing examples showing how Spark has markedly reduced both cost and runtime for genomic analysis.

Cloudera, Datastax, MapR and SAP demonstrated that the first rule of success is to show up:

  • Mike Olson of Cloudera responded to Hortonworks’ snark by confirming Cloudera’s commitment to Impala as well as Hive on Spark.  Olson drew a round of applause when he invited Horton to join the Hive on Spark consortium.
  • Martin van Ryswyk of Datastax announced immediate availability of a Cassandra driver for Spark, a component that exposes Cassandra tables as Spark RDDs.  Datastax continues to work on tighter integration with Spark, including support for Spark SQL, Streaming and GraphX libraries.  In the breakouts, Datastax delivered a deeper briefing on integration with Spark Streaming.
  • M.C. Srivas of MapR highlighted Spark benefits realized by four MapR customers, including Cisco, a health insurer, an ad platform and a pharma company.  MapR continues to claim support for Shark as a differentiator, a point mooted by the announcement that Spark SQL will soon absorb Shark.
  • Aiaz Kazi of SAP seemed pleased that most of the audience has heard of SAP HANA, and delivered an overview of SAP’s integration with Spark.

IBM wasted a Platinum sponsorship by sending some engineers to talk about “System T”, IBM’s text mining application, with passing references to Spark.  Although IBM Infosphere BigInsights is a certified Spark distribution, IBM appears uncommitted to Spark; the lack of executive presence at the Summit stood out in sharp contrast to Cloudera and MapR.

Silver sponsors Hortonworks and Pivotal hosted tables in the vendor area, but did not present anything.

Neuroscientist Jeremy Freeman, back by popular demand from the 2013 Spark Summit, presented latest developments in his team’s research into animal brains using Spark as an analytics platform.  Freeman’s presentations are among the best demonstrations of applied analytics that I’ve seen in any forum.

A number of vendors in the Spark ecosystem delivered presentations showing how their applications leverage Spark, including:

The most significant change from the 2013 Spark Summit is the number of reported production users for Spark.  While the December conference focused on Spark’s potential, I counted several dozen production users among the presentations I attended.

Also among the sellout crowd: a SAS executive checking to see if there is anything to this open source and vendor-neutral stuff.  Apparently, he did not get Jim Goodnight’s message that “Big Data is hype manufactured by media“.