Big Analytics Roundup (March 28, 2016)
Microsoft’s chatbot fail wins the internet this week, but the most important story is Google’s new managed service for machine learning. Also leading the week: Mesosphere’s new funding round led by Microsoft and HPE, and more funding for Domo.
— Google Cloud Platform (GCP) adds the Google Cloud Machine Learning Platform to its suite of managed machine learning services, which already includes Google Cloud Vision API (Beta); Google Cloud Speech API (Limited Preview); and Google Cloud Translate API. GCP still offers the Prediction API, but it’s no longer a top-level service. The Machine Learning platform, currently in Limited Preview, works with TensorFlow models that you train offline and Dataflow for pre-preprocessing, so you can work with data in Google Cloud Storage, BigQuery and other sources. It’s an impressive stack. A cloud of speculation and navel-gazing ensues.
— Mesosphere announces that it has closed a $73.5 million Series C round, with Microsoft and Hewlett Packard Enterprise taking lead roles. Mesosphere also announces version 1.0 of Marathon, a container orchestration service for DCOS, and a new product for source code management called Velocity.
— Domo announces that it has reached $100 million in “billings” and raised another $131 million on its Series D round at a sustained valuation of $2 billion. (Billings typically exceed GAAP revenue due to the effect of prepayments on multi-year contracts.)
— In the MIT Technology Review, Rachel Metz explains the Microsoft chatbot fail.
— Facebook’s Arun Sharma explains Dragon, a distributed graph query engine.
— Frances Perry and Tyler Akidau explain runners in Apache Beam.
— On the Netflix Tech Blog, Ben Schmaus et. al. explain Mantis, a streaming analytics platform that drives alerts and dashboards.
— At a Flink Meetup in Sao Paulo, Slim Baltagi presents real-world use cases for streaming analytics.
— Two interesting posts on PySpark:
- On the AWS Big Data Blog, Veronika Megler explains anomaly detection using PySpark, Hive and Hue.
- On the Mapr Blog, Ben Sadeghi explains churn prediction using PySpark, MLlib and ML.
— Eric Kavanagh delivers a nice overview of the history of open source analytics.
— On the Qubole Blog, MediaMath’s Rory Sawyer describes the benefits of cloud-based data science infrastructure.
— In a somewhat turgid essay, Stitch Fix’s Jeff Magnusson argues that data scientists are thinkers and engineers are doers, then argues that engineers (the “doers”) should not do ETL, an argument that rebuts itself.
— In Datanami, Alex Woodie writes a confused piece on ‘overcoming Spark performance challenges’ that appears to be mostly about touting some new products.
— Ted Dunning previews his Strata presentation on streaming. Spoiler: he likes it.
— James Haight of Blue Hill Research offers an article teasing five things to watch for at Strata, but only details four. I feel cheated.
— Sam Charrington summarizes insights from Cloudera’s third annual analyst day. If you follow him on Twitter, you’ve already read this.
Open Source Announcements
— AirBNB donates Airflow, a workflow automation system, to Apache.
— KeystoneML, a machine learning pipeline framework that runs on Spark, releases version 0.3, with new solvers, new operators and a number of performance improvements. I continue to wonder why this AMPLab project isn’t part of the Spark ML library.
— Several Apache projects have new releases:
- Apache Mahout 0.11.2 updates Spark support, includes performance enhancers and bug fixes.
- BSP framework Apache Hama releases version 0.7.1 with bug fixes and a new scheduler.
- OLAP-on-Hadoop project Apache Kylin delivers releases 1.3 and release 1.5 in quick succession, skipping release 1.4. On the Apache Kylin technical blog, Hongbin Ma details the new bits in Release 1.3, and Li Yang explains Release 1.5.
- SQL engine MRQL releases version 0.6, with new features for incremental query processing.
— Altiscale announces the Altiscale Insight Cloud, an analytics-as-a-service platform that runs on top of the Altiscale Data Cloud. The service combines a number of popular tools, including Spark, Hive, Pig, Python, R, Mahout, Matlab and H2O. Altiscale also claims to include Revolution R, which is curious since Microsoft acquired and rebranded the product.
— Alteryx and Microsoft announce a partnership, which makes sense for both parties. Alteryx, a Windows-based product, fills a gap in Microsoft’s product line, and Azure greatly expands Alteryx’s market reach.
— DataRobot announces that it is certified on Cloudera, claims to be the only Cloudera partner that is certified on all of Cloudera’s bits, including Spark, YARN, Cloudera Service Descriptors, and Cloudera Parcels. George Leopold reports.
— Sense announces that it has been acquired by Cloudera. I’m struggling to understand why I should care.