Analytic Startups: Skytree

Skytree started out as an academic machine learning project developed at Georgia Tech’s Fastlab.  Leadership shopped the software to a number of software vendors prior to 2011 and, finding no buyers, launched as a standalone venture in 2012.

In April 2013, Skytree announced Series A funding of $18 million, with backing from U.S. Venture Partners, UPS, Javelin Venture Partners and Osage University Partners.   The company has 18 U.S. employees in LinkedIn.

Skytree’s public reference customers include Adconian, Brookfield Residential Property Services, CANFAR, eHarmony, SETI Institute and United States Golf Association.  This customer list did not change in 2013 despite significant investment in marketing and sales.

Skytree has formally partnered with Cloudera, Hortonworks and MapR.

Compared to its peers, Skytree reveals very little about its technology, which is generally a yellow flag.

urlSkytree’s principal product is Skytree Server, a server-based library of distributed algorithms.   Skytree claims to support the following techniques:

  • Support Vector Machines (SVM)
  • Nearest Neighbor
  • K-Means
  • Principal Component Analysis (PCA)
  • Linear Regression
  • Two-Point Correlation
  • Kernal Density Estimation (KDE)
  • Gradient Boosted Trees
  • Random Forests

Skytree does not show images or videos of its user interface anywhere on its website.  The implication is that it lacks a visual interface, and programming is required.  Skytree claims a web services interface as well as interfaces to R, Weka, C++ and Python.

For data sources, Skytree claims the ability to connect to relational databases (presumably through ODBC); Hadoop (presumably HDFS); and to consume data from flat files and “common statistical packages”.

Skytree claims the ability to deploy on commodity Linux servers in local, cluster, cloud or Hadoop configurations.  (Absent YARN support, though, the latter will be a “beside” architecture, with data movement).

A second product, Skytree Advisor, launched in Beta in September.  Skytree Advisor is mostly interesting for what it reveals about Skytree Server.  The product includes some unique capabilities, including the ability to produce an actual report, but the user interface evokes a blue screen of death.   The status of this offering seems to be in doubt, as Skytree no longer promotes it.

Strata Report: Advanced Analytics in Hadoop

Here is a quick review of the capabilities for advanced analytics in Hadoop for five vendors at the recent Strata NYC conference:



  • H20 (open source project)
  • h2o (R package)


Smart people from Stanford with VC backing and a social media program.   Services business model with open source software.  H20 is an open source library of algorithms designed for deployment in Hadoop or free-standing clusters;  aggressive vision, but currently available functionality limited to GLM, k-Means, Random Forests.   Update: 0xData just announced H20 2.0, which includes Distributed Trees and Regression, such as Gradient Boosting Machine (GBM), Random Forest (RF), Generalized Linear Modeling (GLM), k-Means and Principal Component Analysis (PCA).  They also claim to run “100X faster than other predictive analytics providers”, although this claim is not supported by evidence.  R users can interface through h2o package.  Limited customer base.  Partners with Cloudera and MapR.

Key Points

  • True open source model
  • Comprehensive roadmap
  • Limited functionality
  • Limited user base
  • Performance claims undocumented

Alpine Data Labs


  • Alpine 2.8


Alpine targets a business user persona with a visual workflow-oriented interface (comparable to SAS Enterprise Miner or SPSS Modeler).   Supports a reasonably broad range of analytic features.  Claims to run “in” a number of databases and Hadoop distributions, but company is opaque about how this works.  (Appears to be SQL/HiveQL push-down).   In practice, most customers seem to use Alpine with Greenplum.  Thin sales and customer base relative to claimed feature mix suggests uncertainty about product performance and stability.  Partners with Pivotal, Cloudera and MapR.

Key Points

  • Reasonable option for users already committed to Greenplum Database
  • Limited partner and user ecosystem
  • Performance and stability should be vetted thoroughly in POC




Oracle R Distribution (ORD) is a free distribution of R with bug fixes and performance enhancements; Oracle R Enterprise is a supported version of ORD with additional enhancements (detailed below).

Oracle Advanced Analytics (an option of Oracle Database Enterprise Edition) bundles Oracle Data Mining, a distributed data mining engine that runs in Oracle Database, and Oracle R Enterprise.   Oracle Advanced Analytics provides an R to SQL transparency layer that maps R functions and algorithms to native in-database SQL equivalents.  When in-database equivalents are not available, Oracle Advanced Analytics can run R commands under embedded R mode.

Oracle Connection to Hadoop  is an R interface to Hadoop; it enables the user to write MapReduce tasks in R and interface with Hive.  As of ORCH 2.1.0, there is also a fairly rich collection of machine learning algorithms for supervised and unsupervised learning that can be pushed down into Hadoop.

Key Points

  • Good choice for Oracle-centric organizations
  • Oracle Data Mining is a mature product with an excellent user interface
  • Must move data from Hadoop to Oracle Database to leverage OAA
  • Hadoop push-down from R requires expertise in MapReduce



  • SAS/ACCESS Interface to Hadoop
  • SAS Scoring Accelerator for Cloudera
  • SAS Visual Analytics/SAS LASR Server
  • SAS High Performance Analytics Server


SAS/ACCESS Interface to Hadoop enables SAS users to pass Hive, Pig or MapReduce commands to Hadoop through a connection and move the results back to the SAS server.   With SAS/ACCESS you can haul your data out of Hadoop, plug it into SAS and use a bunch of other SAS products, but that architecture is pretty much a non-starter for most Strata attendees.   Update:  SAS has announced SAS/ACCESS for Impala.

Visual Analytics is a Tableau-like visualization tool with limited predictive analytic capabilities; LASR Server is the in-memory back end for Visual Analytics.  High Performance Analytics is a suite of distributed in-memory analytics.   LASR Server and HPA Server can be co-located in a Hadoop cluster, but require special hardware.  Partners with Cloudera and Hortonworks.

Key Points

  • Legacy SAS connects to Hadoop, does not run in Hadoop
  • SAS/ACCESS users must know exact Hive, Pig or MapReduce syntax
  • Visual Analytics cannot work with “raw” data in Hadoop
  • Minimum hardware requirements for LASR and HPA significantly exceed standard Hadoop worker node specs
  • High TCO, proprietary architecture for all SAS products



  • Skytree Server


Academic machine learning project (FastLab, at Georgia Tech); with VC backing, launched as commercial software vendor January 2013.  Server-based technology, can connect to a range of data sources, including Hadoop.  Programming interface; claims ability to run from R, Weka, C++ and Python.  Good library of algorithms.  Partners with Cloudera, Hortonworks, MapR.  Skytree is opaque about technology and performance claims.

Key Points

  • Limited customer base, no announced sales since company launch
  • Hadoop integration is a connection, not “inside” architecture
  • Performance claims should be carefully vetted