Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:
- H20 (open source project)
- h2o (R package)
Smart people from Stanford with VC backing and a social media program. Services business model with open source software. H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters; aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests. Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA). They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence. R users can interface through h2o package. Limited customer base. Partners with Cloudera and MapR.
- True open source model
- Comprehensive roadmap
- Limited functionality
- Limited user base
- Performance claims undocumented
Alpine Data Labs
Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler). Supports a reasonably broad range of analytic features. Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works. (Appears to be SQL/HiveQL push-down). In practice, most customers seem to use Alpine with Greenplum. Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability. Partners with Pivotal, Cloudera and MapR.
- Reasonable option for users already committed to Greenplum Database
- Limited partner and user ecosystem
- Performance and stability should be vetted thoroughly in POC
Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).
Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise. Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents. When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.
Oracle Connection to Hadoop is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive. As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.
- Good choice for Oracle-centric organizations
- Oracle Data Mining is a mature product with an excellent user interface
- Must move data from Hadoop to Oracle Database to leverage OAA
- Hadoop push-down from R requires expertise in MapReduce
- SAS/ACCESS Interface to Hadoop
- SAS Scoring Accelerator for Cloudera
- SAS Visual Analytics/SAS LASR Server
- SAS High Performance Analytics Server
SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server. With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees. Update: SAS has announced SAS/ACCESS for Impala.
Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics. High Performance Analytics is a suite of distributed in-memory analytics. LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware. Partners with Cloudera and Hortonworks.
- Legacy SAS connects to Hadoop, does not run in Hadoop
- SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
- Visual Analytics cannot work with “raw” data in Hadoop
- Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
- High TCO, proprietary architecture for all SAS products
Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013. Server-based technology, can connect to a range of data sources, including Hadoop. Programming interface; claims ability to run from R, Weka, C++ and Python. Good library of algorithms. Partners with Cloudera, Hortonworks, MapR. Skytree is opaque about technology and performance claims.
- Limited customer base, no announced sales since company launch
- Hadoop integration is a connection, not “inside” architecture
- Performance claims should be carefully vetted