Big Analytics Roundup (November 2, 2015)

Spark Summit Europe, Oracle Open World and IBM Insights all met last week, as did Cloudera’s Wrangle conference for data scientists.

But in the really important news, KC beats the Mets to take the Series.

Top news from the Spark Summit is Typesafe’s announcement of Spark support, plus some insight into what’s coming in Spark 1.6.  I will publish a separate roundup for the Spark Summit next week  when presentations are available.

Nine stories this week:

(1) Typesafe Announces Spark Support

Typesafe, the commercial venture behind Scala and Akka, announces commercial support for Apache Spark.   Planned service offerings include an offer of one day business hour response to questions for projects in development.  For production, SLAs range from 4 hour turnaround during business hours up to 24/7 with one hour turnaround.

(2) More Funding for Alteryx

The New York Times reports that Alteryx has landed an $85 million “C” round, led by Iconiq Capital.  That makes a total of $163 million in four rounds for the company.

(3) Oracle Adds Spark to Cloud

At Oracle Open World, Oracle announces Oracle Cloud Platform for Big Data, a PaaS offering;  Dave Ramel covers the story.   Key new bits include automated ingestion, preparation, repair, enrichment and governance, all built in Spark; and a DBaaS offering with Hadoop, Spark and NoSQL data services.

(4) IBM Adds Spark Support to Analytics Server

Full story here.  Great news for those who want to use the high-end version of the second most popular data mining workbench with the third and fourth most popular Hadoop distributions.

(5) Ned Explains Zeppelin

Ned’s Blog provides a nice Zeppelin walk-through, noting the UI’s rich list of language interpreters, which currently includesL HiveQL, Spark, Flink, Postgres, HAWQ, Tajo, AngularJS, Cassandra, Ignite, Phoenix, Geode, Kylin and Lens.

(6) IIT and ANL Deliver BSP with ZHT

Researchers from the Illinois Institute of Technology, Argonne Labs and Hortonworks report that they have implemented a graph processing system based on Bulk Synchronous Processing on ZHT, a distributed key-value store.   Nicole Hemsoth reports.   The new engine, called Pregelix, when benchmarked against Giraph, GraphLab, GraphX and Hama, outshines them all.

(7) Wrangle 2015 Meets in SFO

Cloudera’s Justin Kestelyn summarizes the event, which hosted data science teams from the likes of Uber, Facebook and Airbnb.  Tony Baer offers the trite perspective that data science is about people.

(8) MapR Offers Free Spark Training

MapR announces availability of its first free Apache Spark course as part of its Hadoop On-Demand Training program.  No word on quality, but it’s hard to beat the price.

(9) Cloudera Pushes HUE for Spark

On the Cloudera Engineering blog, Justin Kestelyn explains how to use HUE’s notebook app with SQL and Spark.

Big Analytics Roundup (March 2, 2015)

Here is a roundup of some recent Big Analytics news and analysis.

General

  • SiliconAngle covers the Big Data money trail.

Apache Spark

  • Curt Monash writes about Databricks and Spark on his DBMS2 blog.
  • On the Databricks blog, Dave Wang summarizes Spark highlights from Strata + Hadoop World.
  • In this post, Hammer Lab describes how to monitor Spark with Graphite and Grafana.
  • Cloudera announces Hive on Spark beta.
  • InfoWorld covers Spark’s planned support for R in Release 1.3.
  • Qubole announces Spark as a Service.

 Dato/GraphLab

  • Dato announces new version of GraphLab Create.

 H2O

  • From Strata + Hadoop World, Prithvi Pravu talks about using H2O.
  • Also from Strata, here is Cliff Click’s presentation on H2O, Spark and Python.
  • On the H2O blog, Arno Candel publishes a performance tuning guide for H2O Deep Learning.

 

 

2014 Predictions: Mid-Year Check

Back in January, I published this post with predictions for 2014.  Thought it would be fun to validate how well the crystal ball works.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

I wrote this just after attending the 2013 Spark Summit in December; it was clear then that Spark would own 2014.  But I had no idea just how fast Spark would catch fire.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation. 

The Apache Foundation announced top-level status for Spark in February; Cloudera announced immediate support for Spark in February, before it released CDH5; and every other Hadoop distributor followed suit.

At least one commercial software vendor will release software using Spark as a foundation.

There are now thirteen vendors with product certified on Spark.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

Not quite.  But the Mahout team has announced that all new projects must use a standard DSL that runs the job in Spark.

(2) “Co-location” will be the latest buzzword.

Well, not so much.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.  YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  

Co-locating your analytics in the Hadoop cluster is less attractive than integrating your analytics with Hadoop.  With Spark fully integrated with Hadoop storage APIs, co-located solutions seem much less attractive.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS has such deep pockets, one would think it unwise to bet against it.   And yet, seven months into HDP 2.0 and umpteen months into production for SAS HPA, SAS still can’t seem to produce a public success story for advanced analytics in Hadoop.

(3) Graph engines will be hot.

Meh.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

Graph analysis is really useful in the right hands, but organizations are still trying to figure out what to do with it.  That is why we still see posts like this; when something is hot, nobody writes articles about what to do with it; everyone knows what to do with it.

The other issue with graph analysis is that it’s not easy to learn.  Graph techniques are quite different from the predictive analytics algorithms most analysts learn, and the method tends to require specialized knowledge.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

Oops.  Tez isn’t really comparable to Giraph and GraphLab.  And right after I wrote this, the GraphLab open source project pretty much died.   GraphLab Inc., the commercial venture incepted to commercialize the open source project, is fiddling around with other stuff.   Meanwhile, top contributors to open source GraphLab are now working on Spark.

Since Apache Giraph has flatlined, Spark’s GraphX project appears to be the only game in town, at least in open source scalable graph analytics.

(4) R approaches parity with SAS in the commercial job market.

Hard to evaluate this one until Bob Muenchin updates his analysis for 2014.  But the trend is your friend:

fig_1b_rvsas_2014-2-23

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

Speaking with enterprise customers, I like to ask why they switched from SAS to R.  The #1 response: the people we hire know R already, not SAS.  SAS’ free “University Edition” is an attempt to stem the bleeding that might make a difference in ten years or so.

(5) SAP emerges as the company most likely to buy SAS.

Hmm.  Not really.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

After a flurry of announcements last fall (combined with optimistic predictions from SAS executives), all is quiet on the SAS+SAP front; my Google Alert grows cobwebs.  SAS has delivered an ACCESS engine to HANA but not much else considering the talk about joint solutions.  SAP bought a Platinum sponsorship at the 2014 SAS Global Forum, which is an improvement over 2013 when they didn’t show up at all.

Meanwhile, though, SAP continues to invest in HANA PAL and KXEN for predictive analytics, and recently announced support for Spark.   That makes the SAS/SAP alliance look more like a handshake than an embrace.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

Almost certainly not.  Goodnight brags that he’s “having too much fun to step down”, which is nice to know but misses the point; succession plans are only useful when they are transparent.  Anyone investing in SAS’ proprietary platform should wonder what happens next.

(6) Competition heats up for “easy to use” predictive analytics.

It’s a crowded market for “code-free” analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with AlteryxAlpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

According to Crunchbase, entrepreneurs have started 142 analytic startups in the past 18 months, and all of them want you to know that they make analytics easy.  The likely result is that analytics will be easy and cheap; tools for the casual user should cost no more than $500 per user.

Software firms like to target the easy analytics space because the fastest way to build a customer base is to attract new users who never used analytics in the past.  Experienced analysts tend to have established “sticky” preferences for analytic software, and switching is rare.

The obvious users to target already use BI tools, so the major BI players are all trying to embed analytics in their tooling; some have already done so.  For most of these startups, the best exit will be a tender offer from IBM.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.

This seems to be the trend.  Of the 142 startups mentioned above, 11 have completed two or more funding rounds.  Most of these, like MarketMuse, QuantifiedSkin and ThetaRay, offer highly specialized applications with embedded analytics.

Machine Learning in Hadoop: Part Two

This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first category.  The RHadoop project, developed and supported by Revolution Analytics, enables the R user to specify and run MapReduce jobs from R and work directly with data in HDFS and HBase.  RHIPE, a project led by Mozilla’s Suptarshi Guha, offers similar functionality, but without the HBase integration.

Both projects enable R users to implement explicit parallelization in MapReduce.  R users write R scripts specifically intended to be run as mappers and reducers in Hadoop.  Users must have MapReduce skills, and must refactor program logic for distributed execution.  There are some differences between the two projects:

  • RHadoop uses standard R functions for Mapping and Reducing; RHIPE uses unevaluated R expressions
  • RHIPE users work with data in key,value pairs; RHadoop loads data into familar R data frames
  • As noted above, RHIPE lacks an interface to HBase
  • Commercial support is available for RHadoop users who license Revolution R Enterprise; there is no commercial support available for RHIPE

Two open source projects for distributed machine learning in Hadoop stand out from the others: 0xdata’s H2O and Apache Spark’s MLLib.  Both projects have commercial backing, and show robust development activity.  Statistics from GitHub for the thirty days ended February 12 show the following:

  • 0xdata H2O: 18 contributors, 938 commits
  • Apache Spark: 77 contributors, 794 commits

H2O is a project of startup 0xdata, which operates on a services and support business model.  Recent coverage by this blog here;  additional coverage here, here and here.

MLLib is one of several projects included in Apache Spark.  Databricks and Cloudera offer commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and here.

As of this writing, H2O has more built-in analytic features than MLLib, and its R interface is more mature.  Databricks is sitting on a pile of cash to fund development, but its efforts must be allocated among several Spark projects, while 0xdata is solely focused on machine learning.

Cloudera’s decision to distribute Spark is a big plus for the project, but Cloudera is also investing heavily in its partnership with other machine learning vendors, such as SAS.  There is also a clear conflict between Spark’s fast query project (Shark) and Cloudera’s own Impala project.  Like most platform vendors, Cloudera will be customer-driven in its approach to applications like machine learning.

Two other open source projects deserve honorable mention, Apache Mahout and Vowpal Wabbit.  Development activity on these projects is much less robust than for H2O and Spark.  GitHub statistics for the thirty days ended February 12 speak volumes:

  • Apache Mahout: contributors, 54 commits
  • Vowpal Wabbit: 8 contributors, 57 commits

Neither project has significant commercial backing.  Mahout is included in most Hadoop distributions, but distributors have done little to promote or contribute to the project.  (In 2013, Cloudera acquired Myrrix, one of the few companies attempting to commercialize Mahout).  John Langford of Microsoft Research leads the Vowpal Wabbit project, but it is a personal project not supported by Microsoft.

Mahout is relatively strong in unsupervised learning, offering a number of clustering algorithms; it also offers regular and stochastic singular value decomposition.  Mahout’s supervised learning algorithms, however, are weak.  Criticisms of Mahout tend to fall into two categories:

  • The project itself is a mess
  • Mahout’s integration into MapReduce is suitable only for high latency analytics

On the first point, Mahout certainly does seem eclectic, to say the least.  Some of the algorithms are distributed, others are single-threaded; others are simply imported from other projects.  Many algorithms are underdeveloped, unsupported or both.  The project is currently in a cleanup phase as it approaches 1.0 status; a number of underused and unsupported algorithms will be deprecated and removed.

“High latency” is code for slow.  Slow food is a thing; “slow analytics” is not a thing.  The issue here is that machine learning performance suffers from MapReduce’s need to persist intermediate results after each pass through the data; for competitive performance, iterative algorithms require an in-memory approach.

Vowpal Wabbit has its advocates among data scientists; it is fast, feature rich and runs in Hadoop.  Release 7.0 offers LDA clustering, singular value decomposition for sparse matrices, regularized linear and logistic regression, neural networks, support vector machines and sequence analysis.  Nevertheless, without commercial backing or a more active community, the project seems to live in a permanent state of software limbo.

In Part Three, we will cover commercial software for machine learning in Hadoop.

2014 Predictions: Advanced Analytics

A few predictions for the coming year.

(1) Apache Spark matures as the preferred platform for advanced analytics in Hadoop.

Spark will achieve top-level project status in Apache by July; that milestone, together with inclusion in Cloudera CDH5, will validate the project’s rapid maturation.  Organizations will increasingly question the value of “point solutions” for Hadoop analytics versus Spark’s integrated platform for machine learning, streaming, graph engines and fast queries.

At least one commercial software vendor will release software using Spark as a foundation.

Apache Mahout is so done that speakers at the recent Spark Summit didn’t feel the need to stick a fork in it.

(2) “Co-location” will be the latest buzzword.

Most analytic tools can connect with Hadoop, extract data and drag it across the corporate network to a server for processing; that capability is table stakes.  Few, however, can integrate directly with MapReduce for advanced analytics with little or no data movement.

YARN changes the picture, however, as it enables integration of MapReduce and non-MapReduce applications.  In practice, that means it will be possible to stand up co-located server-based analytics (e.g. SAS) on a few nodes with expanded memory inside Hadoop.  This asymmetric architecture adds some latency (since data moves from the HDFS data nodes to the analytic nodes), but not as much as when data moves outside of Hadoop entirely.  For most analytic use cases, the cost of data movement will be more than offset by the improved performance of in-memory iterative processing.

It’s no coincidence that Hortonworks’ partnership with SAS is timed to coincide with the release of HDP 2.0 and production YARN support.

SAS and HDP

(3) Graph engines will be hot.

Not that long ago, graph engines were exotic.  No longer: a wide range of maturing applications, from fraud detection and social media analytics to national security rely on graph engines for graph-parallel analytics.

GraphLab leads in the space, with Giraph and Tez well behind; Spark’s GraphX is still in beta.  GraphX has already achieved performance parity with Giraph and it has the advantage of integration with the other pieces of Spark.  As the category matures, analysts will increasingly see graph analysis as one more arrow in the quiver.

(4) R approaches parity with SAS in the commercial job market.

R already dominates SAS in broad-based analyst surveys, but SAS still beats R in commercial job postings.  But job postings for R programmers are rapidly growing, while SAS postings are declining.  New graduates decisively prefer R over SAS, and organizations increasingly recognize the value of R for “hard money” analytics.

(5) SAP emerges as the company most likely to buy SAS.

“Most likely” as in “only logical” suitor.  IBM no longer needs SAS, Oracle doesn’t think it needs SAS, and HP has too many other issues to address before taking on another acquisition.   A weak dollar favors foreign buyers, and SAS does substantial business outside the US.  SAP lacks street cred in analytics (and knows it), and is more likely to agree to Jim Goodnight’s inflated price and terms.

Will a transaction take place this year?   Hard to say; valuations are peaking, but there are obstacles to sale, as I’ve noted previously.

(6) Competition heats up for “easy to use” predictive analytics.

For hard money analytics, programming tools such as SAS and R continue to dominate.  But organizations increasingly seek alternatives to SAS and SPSS for advanced analytic tools that are (a) easy to use, and (b) relatively inexpensive to deploy on a broad scale.  SAS’ JMP and Statistica are existing players, with Alteryx, Alpine and RapidMiner entering the fray.  Expect more entrants as BI vendors expand offerings to support more predictive analytics.

Vertical and horizontal solutions will be key to success in this category.  It’s not enough to have a visual interface; “ease of use” means “ease of use in context”.   It is easier to develop a killer app for one use case than for many.  Competitive forces require smaller vendors to target use cases they can dominate and pursue a niche strategy.