Analytic Startups: 0xdata (Updated May 2014)
Updated May 22, 2014
0xdata (“Hexa-data”) is a small group of smart people from Stanford and Silicon Valley with VC backing and an open source software project for advanced analytics (H2O). Founded in 2011, 0xdata first appeared on analyst dashboards in 2012 and has steadily built a presence in the data science community since then.
0xdata operates on a services business model, and does not offer commercially licensed software. The firm has four public reference customers and claims more than 2,000 users. 0xdata has formal partnerships with Cloudera, Hortonworks, Intel and MapR.
0xdata’s H20 project is a library of distributed algorithms designed for deployment in Hadoop or free-standing clusters. 0xdata licenses H2O under the Apache 2.0 open source license. The development team is very active; in the thirty days ended May 22, 19 contributors pushed 783 commits to the project on Git.
The roadmap is aggressive; as of May 2014 the library includes:
- Summary Statistics
- Generalized Linear Models (Gaussian, Poisson, Binomial, Gamma and Tweedie)
- K-Means Clustering (Randomized, Plus-Plus and Furthest)
- “Random Forests” Ensemble Models
- Principal Components Analysis
- Gradient Boosting Regression and Classification
- Deep Learning with a multi-layer feedforward neural network (trained with SGD backpropagation)
For Generalized Linear Models, k-Means and Gradient Boosting, H2O supports a Grid Search feature enabling users to specify multiple models for simultaneous development and comparison. This feature is a significant timesaver when the optimal model parameters are unknown (which is ordinarily the case).
Users interact directly with the software through a web browser or REST API. Alternatively, R users can use the H2O.R package to invoke algorithms from RStudio or an alternative R development environment. (Video demo here). Scala users can work with H2O through the Scalala library.
For Hadoop deployment, H2O supports CDH4.x, MapR 2.x and AWS EC2. H2O integrates with HDFS, and is co-located within Hadoop. At present, H2O supports CSV, Gzip-compressed CSV, MS Excel (XLS), ARRF, HIVE file format, “and others”.
Each H2O algorithm supports scoring and prediction capability. There is currently no facility for PMML export; this is unnecessary if H2O is deployed in Hadoop (since one can simply use the native prediction capability).
In March, the Apache Mahout project announced that it will support H2O.