The Year in Machine Learning (Part Four)

This is the fourth installment in a four-part review of 2016 in machine learning and deep learning.

— Part One covered Top Trends in the field, including concerns about bias, interpretability, deep learning’s explosive growth, the democratization of supercomputing, and the emergence of cloud machine learning platforms.

— Part Two surveyed significant developments in Open Source machine learning projects, such as R, Python, Spark, Flink, H2O, TensorFlow, and others.

— Part Three reviewed the machine learning and deep learning initiatives of Big Tech Brands, industry leaders with significant budgets for software development and marketing.

In Part Four, I profile eleven startups in the machine learning and deep learning space. A search for “machine learning” in Crunchbase yields 2,264 companies. This includes companies, such as MemSQL, who offer absolutely no machine learning capability but hype it anyway because Marketing; it also includes application software and service providers, such as Zebra Medical Imaging, who build machine learning into the services they provide.

All of the companies profiled in this post provide machine learning tools as software or services for data scientists or for business users. Within that broad definition, the firms are highly diverse:

Continuum Analytics, Databricks, and H2O.ai drive open source projects (Anaconda, Apache Spark, and H2O, respectively) and deliver commercial support.

Alpine Data, Dataiku, and Domino Data Lab offer commercially licensed collaboration tools for data science teams. All three run on top of an open source platform.

KNIME and RapidMiner originated in Europe, where they have large user communities. Both combine a business user interface with the ability to work with Big Data platforms.

Fuzzy Logix and Skytree provide specialized capabilities primarily for data scientists.

DataRobot delivers a fully automated workflow for predictive analytics that appeals to data scientists and business users. It runs on an open source platform.

Four companies deserve an “honorable mention” but I haven’t profiled them in depth:

— Two startups, BigML and SkyMind, are still in seed funding stage. I don’t profile them below, but they are worth watching. BigML is a cloud-based machine learning service; SkyMind drives the DL4J open source project for deep learning.

— Two additional companies aren’t startups because they’ve been in business for more than thirty years. Salford Systems developed the original software for CART and Random Forests; the company has added more techniques to its suite over time and has a loyal following. Statistica, recently jettisoned by Dell, delivers a statistical package with broad capabilities; the company consistently performs well in user satisfaction surveys.

I’d like to take a moment to thank those who contributed tips and ideas for this series, including Sri Ambati, Betty Candel, Leslie Miller, Bob Muenchen, Thomas Ott, Peter Prettenhofer, Jesus Puente, Dan Putler, David Smith, and Oliver Vagner.

Alpine Data

In 2016, the company formerly known as Alpine Data Labs changed its name and CEO. Alpine dropped the “Labs” from its brand — I guess they didn’t want to be confused with companies that test stool samples — so now it’s just Alpine Data. And, ex-CEO Joe Otto is now an “Advisor,” replaced by Dan Udoutch, a “seasoned executive” with 30+ years of experience in business and zero years of experience in machine learning or advanced analytics. The company also dropped its CFO and head of Sales during the year, presumably because the investors were extremely happy with Alpine’s business results.

Originally built to run in Greenplum database, the company ported some of its algorithms to MapReduce in early 2013. Riding a wave of Hadoop buzz, Alpine closed on a venture round in November 2013, just in time for everyone to realize that MapReduce sucks for machine learning. The company quickly turned to Spark — Databricks certified Alpine on Spark in 2014 — and has gradually ported its analytics operators to the new framework.

screen-shot-2016-12-08-at-3-17-32-pm

It seems that rebuilding on Spark has been a bit of a slog because Alpine hasn’t raised a fresh round of capital since 2013. As a general rule, startups that make their numbers get fresh rounds every 12-24 months; companies that don’t get fresh funding likely aren’t making their numbers. Investors aren’t stupid and, like the dog that did not bark, a venture capital round that does not happen says a lot about a company’s prospects.

In product news, the company announced Chorus 6, a major release, in May, and Chorus 6.1 in September. Enhancements in the new releases include:

— Integration with Jupyter notebooks.

— Additional machine learning operators.

— Spark auto-tuning. Chorus pushes processing to Spark, and Alpine has developed an optimizer to tune the generated Spark code.

PFA support for model export. This is excellent, a cutting edge feature.

— Runtime performance improvements.

— Tweaks to the user experience.

Lawrence Spracklen, Alpine’s VP of Engineering, will speak about Spark auto-tuning at the Spark Summit East in Boston.

Prospective users and customers should look for evidence that Alpine is a viable company, such as a new funding round, or audited financials that show positive cash flow.

Continuum Analytics

Continuum Analytics develops and supports Anaconda, an open source Python distribution for data science. The core Anaconda bundle includes Navigator, a desktop GUI that manages applications, packages, environments and channels; 150 Python packages that are widely used in data science; and performance optimizations. Continuum also offers commercially licensed extensions to Anaconda for scalability, high performance and ease of use.

fusion

Anaconda 2.5, announced in February, introduced performance optimization with the Intel® Math Kernel Library. Beginning with this release, Continuum bundled Anaconda with Microsoft R Open, an enhanced free R distribution.

In 2016, Continuum introduced two major additions to the Anaconda platform:

Anaconda Enterprise Notebooks, an enhanced version of Jupyter notebooks

Anaconda Mosaic, a tool for cataloging heterogeneous data

The company also announced partnerships with Cloudera, Intel, and IBM. In September, Continuum disclosed $4 million in equity financing. The company was surprisingly quiet about the round — there was no press release — possibly because it was undersubscribed.

Continuum’s AnacondaCon 2017 conference meets in Austin February 7-9.

Databricks

Databricks leads the development of Apache Spark (profiled in Part Two of this review) and offers a cloud-based managed service built on Spark. The company also offers training, certification, and organizes the Spark Summits.

The team that originally developed Spark founded Databricks in 2013. Company employees continue to play a key role in Apache Spark, holding a plurality of the seats on the Project Management Committee and contributing more new code to the project than any other company.

visualizations-in-databricks

In 2016, Databricks added a dashboarding tool and a RESTful interface for job and cluster management to its core managed service. The company made major enhancements to the Databricks security framework, completed SOC 2 Type 1 certification for enterprise security, announced HIPAA compliance and availability in Amazon Web Services’ GovCloud for sensitive data and regulated workloads.

Databricks also launched a free Community edition; a five-part series of free MOOCs; completed its annual survey of the Spark user community, and organized three Spark Summits.

In December, Databricks announced a $60 million “C” round of venture capital. New Enterprise Associates led the round; Andreessen Horowitz participated.

Dataiku

Dataiku develops and markets Data Science Studio (DSS), a workflow and collaboration environment for machine learning and advanced analytics. Users interact with the software through a drag-and-drop interface; DSS pushes processing down to Hadoop and Spark. The product includes connectors to a wide variety of file systems, SQL platforms, cloud data stores and NoSQL databases.

dataiku

In 2016, Dataiku delivered Releases 3.0 and 3.1. Major new capabilities include H2O integration (through Sparkling Water); additional data sources (IBM Netezza, SAP HANA, Google BigQuery, and Microsoft Azure Data Warehouse); added support for Spark MLLib algorithms; performance improvements, and many other enhancements.

In October, Dataiku closed on a $14 million “A” round of venture capital. FirstMark Capital led the financing, with participation from Serena Capital.

DataRobot

DataRobot, a Boston-based startup founded by insurance industry veterans, offers an automated machine learning platform that combines built-in expertise with a test-and-learn approach.  Leveraging an open source back end, the company’s eponymous software searches through combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to identify the best model for a particular problem.

cugrnjwxeaaking

The company has a team of Kaggle-winning data scientists and leverages this expertise to identify new machine learning algorithms, feature engineering techniques, and optimization methods. In 2016, DataRobot added several new capabilities to its product, including support for Hadoop deployment, deep learning with TensorFlow, reason codes that explain prediction, feature impact analysis, and additional capabilities for model deployment.

DataRobot also announced major alliances with Alteryx and Cloudera. Cloudera awarded the company its top-level certification: the software integrates with Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels.

Earlier in the year, DataRobot closed on $33 million in Series B financing. New Enterprise Associates led the round; Accomplice, Intel Capital, IA Ventures, Recruit Strategic Partners, and New York Life also participated.

Domino Data Lab

Domino Data Lab offers the Domino Data Science Platform (DDSP) a scalable collaboration environment that runs on-premises, in virtual private clouds or hosted on Domino’s AWS infrastructure.

collab-screen

DDSP provides data scientists with a shared environment for managing projects, scalable computing with a variety of open source and commercially licensed software, job scheduling and tracking, and publication through Shiny and Flask. Domino supports rollbacks, revision history, version control, and reproducibility.

In November, Domino announced that it closed a $10.5 million “A” round led by Sequoia Capital. Bloomberg Beta, In-Q-Tel, and Zetta Venture Partners also participated.

Fuzzy Logix

Fuzzy Logix markets DB Lytix, a library of more than eight hundred functions for machine learning and advanced analytics.  Functions run as database table functions in relational databases (Informix, MySQL, Netezza, ParAccel, SQL Server, Sybase IQ, Teradata Aster and Teradata Database) and in Hadoop through Hive.

Users invoke DB Lytix functions from SQL, R, through BI tools or from custom web interfaces.  Functions support a broad range of machine learning capabilities, including feature engineering, model training with a rich mix of supported algorithms, plus simulation and Monte Carlo analysis.  All functions support native in-database scoring.  The software is highly extensible, and Fuzzy Logix offers a team of well-qualified consultants and developers for custom applications.

In April, the company announced the availability of DB Lytix on Teradata Aster Analytics, a development that excited all three of the people who think Aster has legs.

H2O.ai

H2O.ai develops and supports H2O, the open source machine learning project I profiled in Part Two of this review. As I noted in Part Two, H2O.ai updated Sparkling Water, its Spark integration for Spark 2.0; released Steam, a model deployment framework, to production, and previewed Deep Water, an interface to GPU-accelerated back ends for deep learning.

maxresdefault

In 2016, H2O.ai added 3,200 enterprise organizations and over 43,000 users to its roster, bringing its open source community to over 8,000 enterprises and nearly 70,000 users worldwide. In the annual KDnuggets poll of data scientists, reported usage tripled. New customers include Kaiser Permanente, Progressive, Comcast, HCA, McKesson, Macy’s, and eBay.

KNIME

KNIME.com AG, a commercial enterprise based in Zurich, Switzerland, distributes the KNIME Analytics Platform under a GPL license with an exception permitting third parties to use the API for proprietary extensions. The KNIME Analytics Platform features a graphical user interface with a workflow metaphor.  Users build pipelines of tasks with drag-and-drop tools and run them interactively or in batch.

knime_screenshot

KNIME offers commercially licensed extensions for scalability, integration with data platforms, collaboration, and productivity. The company provides technical support for the extension software.

During the year, KNIME delivered two dot releases and three maintenance releases. The new features added to the open source edition in Releases 3.2 and 3.3 include Workflow Coach, a recommender based on community usage statistics; streaming execution; feature selection; ensembles of trees and gradient boosted trees; deep learning with DL4J, and many other enhancements. In June, KNIME launched the KNIME Cloud Analytics Platform on Microsoft Azure.

KNIME held its first Summit in the United States in September and announced the availability of an online training course available through O’Reilly Media.

RapidMiner

RapidMiner, Inc. of Cambridge, Massachusetts, develops and supports RapidMiner, an easy-to-use package for business analysis, predictive analytics, and optimization. The company launched in 2006 (under the corporate name of Rapid-I) to drive development, support, and distribution for the RapidMiner software project. The company moved its headquarters to the United States in 2013.

rm7_process

The desktop version of the software, branded as RapidMiner Studio, is available in free and commercially licensed editions.  RapidMiner also offers a commercially licensed Server edition, and Radoop, an extension that pushes processing down to Hive, Pig, Spark, and H2O.

RapidMiner introduced Release 7.x in 2016 with an updated user interface. Other enhancements in Releases 7.0 through 7.3 include a new data import facility, Tableau integration, parallel cross-validation, and H2O integration (featuring deep learning, gradient boosted trees and generalized linear models).

The company also introduced a feature called Single Process Pushdown. This capability enables RapidMiner users to supplement native Spark and H2O algorithms with RapidMiner pipelines for execution in Hadoop. RapidMiner supports Spark 2.0 as of Release 7.3.

In January 2016, RapidMiner closed a $16 million equity round led by Nokia Growth Partners. Ascent Venture Partners, Earlybird Venture Capital, Longworth Venture Partners, and OpenOcean also participated.

Skytree

Skytree Inc. develops and markets an eponymous commercially licensed software package for machine learning. Its founders launched the venture in 2012 to monetize an academic machine learning project (Georgia Tech’s FastLab).

figure_09a_tuning_results_chart_9_way_grid

The company landed an $18 million venture capital round in 2013 and hasn’t secured any new funding since then. (Read my comments under Alpine Data to see what that indicates.) Moreover, the underlying set of algorithms does not seem to have changed much since then, though Skytree has added and dropped several different add-ons and wrappers.

Users interact with the software through the Skytree Command Line Interface (CLI), Java and Python APIs or a browser-based GUI. Output includes explanations of the model in plain English. Skytree has a grid search feature for parameterization, which it trademarks as AutoModel, labels as “ground-breaking” and is attempting to patent. Analysts who don’t know anything about grid search think this is amazing.

In 2016, Skytree introduced a freemium edition, branded as Skytree Express. Hold out another six months and they’ll pay you to try it.

As is the case with Alpine Data, if you like Skytree’s technology wait for another funding round, or ask the company to provide evidence of positive cash flow.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Big Analytics Roundup (May 9, 2016)

The big news this week: Teradata’s CEO Mike Keough walks the plank. TDC stock rises 21% on dismal numbers, which demonstrates how much Wall Street values leadership.

CRN releases its fourth annual Big Data 100 in listicle form to maximize clicks. Criteria for inclusion are “editor’s picks”, so whatever. I got through the As before giving up.

Dave Ramel details five leading Apache Big Data projects: Spark, Tez, Bigtop, REEF and Storm. What? It’s a nice summary of each, but Ramel is a slave to Apache’s silly classifications.

Bullshit Benchmarks

Here are four rules for benchmarks.

  1. Use a standard test protocol, such as TPC-DS.
  2. When there is no available standard, test multiple use cases. Make a decent effort to try a variety of workloads.
  3. Communicate with sponsors for all benchmarked software, or communicate with none of them.
  4. Publish your code and your data. (There’s this thing called GitHub….)

The ironically named Mammoth Data (current headcount: 15) violates all four rules in a Google-commissioned “study,” which concludes that Cloud Dataflow runs one use case faster than Spark. Professional cat herder Andrew Oliver replaces his Mammoth CEO hat with his analyst hat and touts the results.

Go to the back of the class, Andrew. Run more use cases, discuss results with the Spark team as well as the Google team, then let us know what you learned. I don’t doubt that Dataflow is a nifty tool, and look forward to seeing a benchmark we can trust.

Explainers

— Adrian Colyer focuses on time series:

  • Gorilla: a fast, scalable  in-memory time series database.
  • BTrDB (Berkeley Tree Database), optimized storage for time series processing.
  • The Tarzan algorithm, a technique that discovers surprising patterns in a time series database. (Fixed link — h/t Oliver Vagner).

— On BrightTalk, Databricks’ Reynold Xin explains the new bits in Spark 2.0, to be released soon.

— On the DataRobot blog, Quantopian’s Thomas Wiecki explains how to predict out-of-sample performance for trading algorithms.

— Indeed.com’s Preetha Appan explains algorithms and architecture for recommendation engines.

— In a webcast, Sean Owen and Yann Delacourt explain real-time analytics with Spark.

— Microsoft’s Lixun Zhang explains the differences among open source R, Microsoft R Open and Microsoft R Server.

Perspectives

— In Datanami, George Leopold profiles DataRobot, a machine learning startup. One point he gets wrong, DataRobot runs on Hadoop in the cloud and it runs on Hadoop on premises.

— On the Google Cloud blog, Tyler Akidau offers Google’s perspective on why they moved Cloud Dataflow development to Apache Beam. DataArtisans chirps support. Here’s what OpenHub has to say about Apache Beam:

Screen Shot 2016-05-09 at 11.01.28 AM

— In WSJ’s CIO Journal, Steven Norton interviews Airbnb’s Mike Curtis, who name-drops Apache Spark. In the same venue, Clint Boulton previously reported that Airbnb uses Spark in its Aerosolve project.

— Jim O’Reilly offers a summary of the differences among AWS, Azure and Google Cloud.

— On the Qubole blog, Monique Chmiel tries to summarize the pros and cons of Python, R and Scala for Big Data, and largely fails. None of the three is suitable for Big Data on its own, so you have to evaluate them for their APIs to scalable platforms like Spark. As of today, the Spark APIs for Scala and Python are clearly superior to the R API.

Commercial Announcements

News from commercial software providers, as well as commercial vendors that operate on an open source software model.

— Hortonworks announces that it lost $1.59 for every dollar it sold in Q1, which is slightly better than the $1.85 it lost in Q1 of 2015. At that rate, look for HDP to break even in 2018 or so, unless they run out of cash first. Wall Street drives stock down 18%.

— Teradata fires CEO, Wall Street celebrates. Don’t party too hard, guys; the numbers still stink.

Stuff I Really Don’t Care About

— Basho releases Riak TS to open source.

Big Analytics Roundup (March 28, 2016)

Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.

— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.

— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.

— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)

Explainers

— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.

— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.

— Frances Perry and Tyler Akidau explain runners in Apache Beam.

— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.

— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.

— Two interesting posts on PySpark:

  • On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
  • On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.

Perspectives

— Eric Kavanagh delivers a nice overview of the history of open source analytics.

— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.

— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.

— Ian Allison profiles Seldon, an open source machine learning platform that specializes in content and product recommenders.

— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.

— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.

— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.

— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.

Open Source Announcements

— AirBNB donates Airflow, a workflow automation system, to Apache.

— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.

— Several Apache projects have new releases:

  • Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
  • BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
  • OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4.  On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
  • SQL engine MRQL releases version 0.6, with new features for incremental query processing.

Commercial Announcements

— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.

— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.

— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.

— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.