Tag Archives: Amazon Web Services

The Year in Machine Learning (Part One)


This is the first installment in a four-part review of 2016 in machine learning and deep learning.

In the first post, we look back at ML/DL news organized in five high-level topic areas:

  • Concerns about bias
  • Interpretable models
  • Deep learning accelerates
  • Supercomputing goes mainstream
  • Cloud platforms build ML/DL stacks

In Part Two, we cover developments in each of the leading open source machine learning and deep learning projects.

Parts Three and Four will review the machine learning and deep learning moves of commercial software vendors.

Concerns About Bias

As organizations expand the use of machine learning for profiling and automated decisions, there is growing concern about the potential for bias. In 2016, reports in the media documented racial bias in predictive models used for criminal sentencing, discriminatory pricing in automated auto insurance quotes, an image classifier that learned “whiteness” as an attribute of beauty, and hidden stereotypes in Google’s word2vec algorithm.

Two bestsellers were published in 2016 that address the issue. The first, Cathy O’Neil’s Weapons of Math Destruction, is a candidate for the National Book Award. In a review for The Wall Street Journal, Jo Craven McGinty summarizes O’Neil’s arguments as “algorithms aren’t biased, but the people who build them may be.”

A second book, Virtual Competition, written by Ariel Ezrachi and Maurice Stucke, focuses on the ways that machine learning and algorithmic decisions can promote price discrimination and collusion. Burton Malkiel notes in his review that the work “displays a deep understanding of the internet world and is outstandingly researched. The polymath authors illustrate their arguments with relevant case law as well as references to studies in economics and behavioral psychology.”

Most working data scientists are deeply concerned about bias in the work they do. Bias, after all, is a form of error, and a biased algorithm is an inaccurate algorithm. The organizations that employ data scientists, however, may not commit the resources needed for testing and validation, which is how we detect and correct bias. Moreover, people in business suits often exaggerate the accuracy and precision of predictive models or promote their use for inappropriate applications.

In Europe, GDPR creates an incentive for organizations that use machine learning to take the potential for bias more seriously. We’ll be hearing more about GDPR in 2017.

Interpretable Models

Speaking of GDPR, beginning in 2018, organizations that use machine learning to drive automated decisions must be prepared to explain those decisions to the affected subjects and to regulators. As a result, in 2016 we saw considerable interest in efforts to develop interpretable machine learning algorithms.

— The MIT Computer Science and Artificial Intelligence Laboratory announced progress in developing neural networks that deliver explanations for their predictions.

— At the International Joint Conference on Artificial Intelligence, David Gunning summarized work to date on explainability.

— MIT selected machine learning startup Rulex as a finalist in its Innovation Showcase. Rulex implements a technique called Switching Neural Networks to learn interpretable rule sets for classification and regression.

— In O’Reilly Radar, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin explained Local Interpretable Model-Agnostic Explanations (LIME), a technique that explains the predictions of any machine learning classifier.

The Wall Street Journal reported on an effort by Capital One to develop machine learning techniques that account for the reasoning behind their decisions.

In Nautilus, Aaron M. Bornstein asked: Is artificial intelligence permanently inscrutable?  There are several issues, including a lack of clarity about what “interpretability” means.

It is important to draw a distinction between “interpretability by inspection” versus “functional” interpretability. We do not evaluate an automobile by disassembling its engine and examining the parts; we get behind the wheel and take it for a drive. At some point, we’re all going to have to get behind the idea that you evaluate machine learning models by how they behave and not by examining their parts.

Deep Learning Accelerates

In a September Fortune article, Roger Parloff explains why deep learning is suddenly changing your life. Neural networks and deep learning are not new techniques; we see practical applications emerge now for three reasons:

— Computing power is cheap and getting cheaper; see the discussion below on supercomputing.

— Deep learning works well in “cognitive” applications, such as image classification, speech recognition, and language translation.

— Researchers are finding new ways to design and train deep learning models.

In 2016, the field of DL-driven cognitive applications reached new milestones:

— A Microsoft team developed a system that recognizes conversational speech as well as humans do. The team used convolutional and long short-term memory (LSTM) neural networks built with Microsoft Cognitive Toolkit (CNTK).

— On the Google Research Blog, a Google Brain team announced the launch of the Google Neural Machine Translation System, a system based on deep learning that is currently used for 18 million translations per day.

— In TechCrunch, Ken Weiner reported on advances in DL-driven image recognition and how they will transform business.

Venture capitalists aggressively funded startups that leverage deep learning in applications, especially those that can position themselves in the market for cognitive solutions:

Affectiva, which uses deep learning to read facial expressions in digital video, closed on a $14 million “D” round led by Fenox Venture Capital.

Clarifai, a startup that offers a DL-driven image and video recognition service, landed a $30 million Series B round led by Menlo Ventures.

Zebra Medical Vision, an Israeli startup, uses DL to examine medical images and diagnose diseases of the bones, brain, cardiovascular system, liver, and lungs. Zebra disclosed a $12 million venture round led by Intermountain Health.

There is an emerging ecosystem of startups that are building businesses on deep learning. Here are six examples:

Deep Genomics, based in Toronto, uses deep learning to understand diseases, disease mutations and genetic therapies.

— Cybersecurity startup Deep Instinct uses deep learning to predict, prevent, and detect threats to enterprise computing systems.

Ditto Labs uses deep learning to identify brands and logos in images posted to social media.

Enlitic offers DL-based patient triage, disease screening, and clinical support to make medical professionals more productive.

— Gridspace provides conversational speech recognition systems based on deep learning.

Indico offers DL-driven tools for text and image analysis in social media.

And, in a sign that commercial development of deep learning isn’t all hype and bubbles, NLP startup Idibon ran out of money and shut down. We can expect further consolidation in the DL tools market as major vendors with deep pockets ramp up their programs. The greatest opportunity for new entrants will be in specialized applications, where the founders can deliver domain expertise and packaged solutions to well-defined problems.

Supercomputing Goes Mainstream

To make deep learning practical, you need a lot of computing horsepower. In 2016, hardware vendors introduced powerful new platforms that are purpose-built for machine learning and deep learning.

While GPUs are currently in the lead, there is a serious debate under way about the relative merits of GPUs and FPGAs for deep learning. Anand Joshi explains the FPGA challenge. In The Next Platform, Nicole Hemsoth describes the potential of a hybrid approach that leverages both types of accelerators. During the year, Microsoft announced plans to use Altera FPGAs, and Baidu said it intends to standardize on Xilinx FPGAs.

NVIDIA Launches the DGX-1

NVIDIA had a monster 2016, tripling its market value in the course of the year. The company released the DGX-1, a deep learning supercomputer. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

NVIDIA also revealed a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow, and Theano.

Tech media salivated:

MIT Technology Review interviewed NVIDIA CEO Jen-Hsun Huang, who is now Wall Street’s favorite tech celebrity.

Separately, Karl Freund reports on NVIDIA’s announcements at the SC16 supercomputing show.

Early users of the DGX-1 include BenevolentAI, PartnersHealthCare, Argonne and Oak Ridge Labs, New York University, Stanford University, the University of Toronto, SAP, Fidelity Labs, Baidu, and the Swiss National Supercomputing Centre. Nicole Hemsoth explains how NVIDIA supports cancer research with its deep learning supercomputers.

Cray Releases the Urika-GX

Cray launched the Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high-performance network interconnect. Cray ships 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

Intel Responds

The headline on the Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. Intel acquired Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reported a price tag of $408 million. The customary tech media unicorn story storm ensues.

Intel said it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Paul Alcorn offers additional detail on Intel’s new Xeon CPU and Deep Learning Inference Accelerator. In Fortune, Aaron Pressman argues that Intel’s strategy for machine learning and AI is smart, but lags NVIDIA. Nicole Hemsoth describes Intel’s approach as “war on GPUs.”

Separately, Intel acquired Movidius, the folks who put a deep learning chip on a memory stick.

Cloud Platforms Build ML/DL Stacks

Machine learning use cases are inherently well-suited to cloud platforms. Workloads are ad hoc and project oriented; model training requires huge bursts of computing power for a short period. Inference workloads are a different matter, which is one of many reasons one should always distinguish between training and inference when choosing platforms.

Amazon Web Services

After a head fake earlier in the year when it publishing DSSTNE, a deep learning project that nobody wants, AWS announces that it will standardize on MXNet for deep learning. Separately, AWS launched three new machine learning managed services:

Rekognition, for image recognition

Polly, for text to speech

Lex, a conversational chatbot development platform

In 2014, AWS was first to market among the cloud platforms with GPU-accelerated computing services. In 2016, AWS added P2 instances with up to 16 Tesla K8- GPUs.

Microsoft Azure

Released in 2015 as CNTK, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit and released Version 2.0, with a new Python API and many other enhancements. The company also launched 22 cognitive APIs in Azure for vision, speech, language, knowledge, and search. Separately, MSFT released its managed service for Spark in Azure HDInsight and continued to enhance Azure Machine Learning.

MSFT also announced the Azure N-Series compute instances powered by NVIDIA GPUs for general availability in December.

Azure is one part of MSFT’s overall strategy in advanced analytics, which I’ll cover in Part Three of this review.

Google Cloud

In February, Google released TensorFlow Serving, an open source inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explained.

Later in the Spring, Google announced that it was building its own deep learning chips, or Tensor Processing Units (TPUs). In Forbes, HPC expert Karl Freund dissected Google’s announcement. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs.

Google launched a dedicated team in October to drive Google Cloud Machine Learning, and announced a slew of enhancements to its services:

— Google Cloud Jobs API provides businesses with capabilities to find, match and recommend jobs to candidates. Currently available in a limited alpha.

Cloud Vision API now runs on Google’s custom Tensor Processing Units; prices reduced by 80%.

Cloud Translation API will be available in two editions, Standard and Premium.

Cloud Natural Language API graduates to general availability.

In 2017, GPU-accelerated instances will be available for the Google Compute Engine and Google Cloud Machine Learning. Details here.

IBM Cloud

In 2016, IBM contributed heavily to the growing volume of fake news.

At the Spark Summit in June, IBM announced a service called the IBM Data Science Experience to great fanfare. Experienced observers found the announcement puzzling; the press release described a managed service for Apache Spark with a Jupyter IDE, but IBM already had a managed service for Apache Spark with a Jupyter IDE.

In November, IBM quietly released the service without a press release, which is understandable since there was nothing to crow about. Sure enough, it’s a Spark service with a Jupyter IDE, but also includes an R service with RStudio, some astroturf “community” documents and “curated” data sources that are available for free from a hundred different places. Big Whoop.

In IBM’s other big machine learning move, the company rebranded an existing SPSS service as Watson Machine Learning. Analysts fell all over themselves raving about the new service, apparently without actually logging in and inspecting it.


Of course, IBM says that it has big plans to enhance the service. It’s nice that IBM has plans. We should all aspire to bigger and better things, but keep in mind that while IBM is very good at rebranding stuff other people built, it has never in its history developed a commercially successful software product for advanced analytics.

IBM Cloud is part of a broader strategy for IBM, so I’ll have more to say about the company in Part Three of this review.

Roundup 10/24/2016


Top machine learning (ML) and deep learning (DL) stories from last week, plus new content from Friday and the weekend.

The theme for featured images this week is art produced by deep learning.

ICYMI: Top Stories of Last Week

— AMD, Dell EMC, Google, HPE, IBM, Mellanox, Micron, NVIDIA, and Xilinx launch the OpenCAPI Consortium and industry group to promote specs for the next generation of data center hardware.


— Apple hires Carnegie Mellon University professor Ruslan Salakhutdinov as Director of AI research. Linkapalooza here.

— Andrew Oliver proposes dropping seven technologies from the Big Data ecosystem: MapReduce, Storm, Pig, Java, Tez, Oozie, and Flume. He forgets to mention Mahout, which is forgivable since nobody uses it.

— Daniel Gutierrez interviews Jim McHgh of NVIDIA’s Deep Learning Group, who says he wants to collaborate with Databricks to integrate the BIDMach machine learning library with Spark.

— Meanwhile, Gartner announces the top ten strategic technology trends for 2017, and machine learning is right up there at #1 on the list.

— Serdar Yegualp describes Microsoft’s big bet on FPGAs, explains the potential of FPGAs for machine learning, notes that existing machine learning software generally does not support FPGA acceleration.

— Meanwhile, however, Baidu announces that it will accelerate its machine learning applications with Xilinx FPGAs.

— Xilinx is on a roll. TeraDeep announces a fast deep learning solution that leverages Xilinx FPGAs.

— Using CNTK, MSFT researchers achieve parity with humans in speech recognition; medialanche ensues.

— In HBR, Tom Davenport explains how to introduce AI into your organization. The next generation of AI will introduce itself.

— Tesla announces that its new cars will include all of the hardware needed for level 5 autonomy. The software isn’t available yet but will be added through over-the-air updates.

Good Reads from Last Week

— Christine Barton et. al. explain why companies can’t turn customer insights into growth.

— François Maillet of MLDB.ai explains how to use MLDB for machine learning. MLDB looks like an exciting project.

— Emmanuelle Rieuf reviews Cathy O’Neil’s Weapons of Math Destruction. So does Jo Craven McGinty in the Wall Street Journal.

Microsoft’s Big Bet on FPGAs

— Top analysts chew over Microsoft’s announcement that it uses Field Programmable Gate Arrays (FPGAs) to accelerate servers in its data centers. Karl Freund of Moor Insights and Strategy dissects Microsoft’s approach. In The Next Platform, Stacey Higginbotham delivers a tick-tock covering how MSFT decided to place its bet.

Methods and Techniques

— Alex Handy lists a collection of resources in ML, DL, and AI.

— A community of contributors offers an excellent open guide to Amazon Web Services, including Amazon Machine Learning.

Health and Medical Applications

— Jennifer Bresnick explains the potential impact of Blockchain, IoT and ML on healthcare.

— HealthNextGen, a startup that specializes in ML for health care, announces a partnership with Charité – Universitätsmedizin in Berlin, Europe’s largest university hospital.

— The National Institutes of Health awards a grant of $1.2 million to Xi Luo of Brown University and colleagues at Johns Hopkins and Yale. The grant funds ML-driven research into brain scans and brain function.

Software and Services

— Serdar Yegulalp reports on progress towards a version of TensorFlow that runs on Windows. I wonder if it will force you to upgrade to Windows 10.

— Bernd Bischl et. al. describe mlr, a machine learning framework for R with more than 160 learners and support for parallel high-performance computing.


— Nielsen adds a machine learning capability to the Nielsen Marketing Cloud.

— Wall Street mulls Tesla’s partnership with NVIDIA.

ICYMI: Top ML/DL Stories 10/3-10/7


Top stories of the week, compiled from daily roundups.

Top Reads

How to steal a predictive model.

Nicole Hemsoth describes 26 emerging applications for deep learning across business, scientific and engineering disciplines. Last week, she covered the next wave of deep learning architectures.

Sebastian Raschka publishes Part III of his series on model evaluation, model selection and algorithm selection in machine learning. Parts I and II are here and here.

Gale Morrison’s survey of neural computing in Semiconductor Engineering is an absolute must read. She summarizes intellectual and technical developments in the field, market drivers and briefly discusses what works. Companies mentioned: Amazon, Baidu, Google, Huawei, Intel, Nervana, NVIDIA, and Samsung.

AWS Announces GPU-Accelerated Instances

Amazon Web Services announces the availability of P2 instances designed for computationally intensive science and engineering applications. Details are here. Linkapalooza here.

Denso Invests in Vision Recognition

Global automotive supplier Denso announces an investment in THINCI, which produces low-power vision-recognition technology with embedded DL.

Mitsubishi: We Can Automate Deep Learning 

Mitsubishi Electric announces what it describes as the first Automated Design Deep Learning Algorithm, a system that designs deep learning structures to speed development of AI applications.


Ozzies Deploy Deep Learning Supercomputers

Australia’s Commonwealth Scientific and Industrial Research Organization (CSIRO) deploys two stupidly powerful NVIDIA DGX-1s, each of which has the throughput of 250 servers. First up on the list of projects: sifting through massive quantities of data to understand the impact of environment on disease. In ZDNet, Chris Duckett reports.

NVIDIA, FANUC to Build Smart Robots

NVIDIA and FANUC announce an alliance to embed DL-driven AI in the FANUC Intelligent Edge Link and Drive (FIELD) system for robotics. Adding AI enables robots to teach themselves to perform tasks more efficiently. FANUC will use NVIDIA GPUs and DL software. In MIT Technology Review, Will Knight explains why this is a big deal.

Elsewhere, NVIDIA Eats the News

  • Dave Neal recaps NVIDIA’s Global Technology Conference (GTC) in Amsterdam, featuring the Xavier SoC for autonomous vehicles.
  • On NVIDIA’s Deep Learning blog, Brian Caulfield describes spider-like robot MANTIS, which uses NVIDIA’s Jetson TX1 system for embedded deep learning and computer vision.
  • AI and robotics dominate NVIDIA’s GTC Japan and GTCx Australia.
  • UK-based computer retailer Scan proposes to offer time on NVIDIA’s stupidly powerful DGX-1, thereby demonstrating that GPU-based systems are the new mainframes.
  • Wall Street notices that NVIDIA is on a roll; in Seeking Alpha, Chris Lau explains.
  • Seeking Alpha’s got a fever, and the only prescription is more NVIDIA.
  • Meanwhile, HPC cloud provider Nimbix announces a partnership with NVIDIA and IBM; Nimbix will power its cloud with IBM Power Systems S822LC servers featuring NVIDIA Pascal GPUs.
  • According to reports, AMD’s RTG Radeon Technology will market a dual-GPU chip in 2017. Linkapalooza here.
  • NVIDIA demonstrates an autonomous vehicle driving on unmarked roads at night.

CIA Uses ML/DL to Predict Unrest

In the Fiscal Times, Frank Konkel summarizes how the CIA uses machine learning to predict such things as the flow of illicit cash to extremists. In Defense One, Frank Konkel reports that the CIA says it can predict social unrest 3-5 days in advance. I don’t think this is what folks have in mind when they call for Data Science for Social Good.

Khronos Group Proposes Open Standards for Neural Networks

The Khronos Group, a consortium of hardware and software companies, announces two initiatives to promote the development of neural network techniques. The Neural Network Exchange Format (NNEF) initiative will develop an open standard file format to exchange deep learning models between training and inference systems. The OpenVX Neural Network Extension project will be a high-level architecture specification to run Convolutional Neural Networks (CNN) as OpenVX graphs. Brandon Lewis reports.

Machine Learning Roundup (October 3, 2016)


Machine learning (ML) and deep learning (DL) content from Friday and the weekend. Scroll to the bottom for job postings.

ICYMI, the roundup is now daily and focuses solely on machine learning and deep learning.

Top stories from last week:

— Google releases Cloud Machine Learning to public beta.

— NVIDIA introduces System-on-Chip for Autonomous Vehicles.

— Amazon, Facebook, Google, IBM, and Microsoft form partnership to restrain trade and stifle innovation promote ethical AI.

— IBM solves world hunger announces a data movement tool.


October 21: at UC Berkeley, Sergey Levine delivers a Colloquium on Deep Robotic Learning.

October 26: in Toronto, the MLX Fintech Conference.

Must Read

Nicole Hemsoth describes 26 emerging applications for deep learning across business, scientific and engineering disciplines. Last week, she covered the next wave of deep learning architectures.

Sebastian Raschka publishes Part III of his series on model evaluation, model selection and algorithm selection in machine learning. Parts I and II are here and here.

Yahoo! Releases Porn Filter

On the Yahoo! Engineering blog, Jay Mahadeokar and Gerry Pesavento describe open_nsfw, open source code that detects NSFW images. The system uses CaffeOnSpark. Snark ensues. Linkapalooza here.

AWS Announces Stupidly Powerful GPU-Accelerated Instances

Amazon Web Services announces the availability of P2 instances designed for computationally intensive science and engineering applications. Details are here. Linkapalooza here.


The editors if Inside Big Data go out on a limb and predict that the volume of data will grow. That alone doesn’t get them into this roundup, but they’re also bullish on ML, so OK.

Abinash Tripathy thinks AI will not take over the world. How do we know he’s not a bot?


In a paper available at arXiv.org, Andy Zeng et. al. describe a system for robotic warehouse automation, which uses a convolutional neural network to segment and label views of a scene.


In TechCrunch, daco.io’s Claire Bretton explains how deep learning enables computers to see.

Methods and Techniques

In WildML, Google Brain’s Denny Britz offers tips on learning Reinforcement Learning.

Mikio Braun dissects how Zalando, a European fashion retailer, puts machine learning to work.

Rick Fulton, Engineering Lead at Postmates, explains how to predict delivery times.


On NVIDIA’s Deep Learning blog, Brian Caulfield describes MANTIS, a six-legged robot on display last week at the GPU Technology Conference (GTC) in Amsterdam last week. MANTIS uses NVIDIA’s Jetson TX1 system for embedded deep learning and computer vision.


In Forbes, Gil Press describes how Cox Automotive uses Splunk for machine learning.

In a published paper, Tao Zheng et. al. describe an approach that uses machine learning to identify Type 2 diabetes through electronic health records.

Bob Tedeschi reports on how recent developments in machine learning affect the practice of radiology.

The aptly named Mariella Moon reports on space drones that use machine learning.

Ananya Bhattacharya describes how researchers use game theory and machine learning to predict where elephant poachers will strike. Presumably, they strike where the elephants are.


In Forbes, Gil Press describes Baidu’s DeepBench, open-source software designed to evaluate the performance of deep learning operations on different hardware platforms.

Apache MADLib (Incubating), an open-source SQL-based machine learning the library, delivers release 1.9.1, with pivots, sessionization and prediction quality metrics. On the Pivotal blog, Frank McQuillan explains.

Demandbase announces the availability of DemandGraph, an AI-powered business graph for B2B marketing.


Noel Bambrick discusses key trends driving AI and ML in the enterprise.


Fujitsu announces a new memory technology for GPUs that radically reduces the time needed to train deep learning models. In Forbes, Kevin Murnane explains.


Leslie D’Monte interviews Google executive John Giannandrea, who discusses the role of machine learning in improving Google Search.

In MedCity News, Stephanie Baum surveys new ML-driven startups in Health Care.

In Dignomica, Derek du Preez describes InsideSales, a startup that uses ML to improve sales effectiveness.

Tony Quested reports on several ML startups located in Cambridge (UK), including Prowler.io, ThisWayGlobal, Luminance, Invenia Technical Computing, and Sophia Genetics.


In Dusseldorf, Germany, Trivago seeks an algorithm engineer.

The Voleon Group, a fintech company in Berkeley, CA, seeks a senior researcher for its ML group.

Big Analytics Roundup (August 29, 2016)

Chris Green, a tribal member, and his son, Clayton, get the dogs out early to round up a herd at Big Cypress Reservation.

TechCrunch reports results of a new study, which says that you really don’t need a co-founder after all. Next, they’ll be telling us we don’t need to floss.

Python and R

Matt Asay argues that Python is a gateway language that leads data scientists to R. (h/t Oliver Vagner). That’s oversimplified and mostly incorrect. The breadth of R’s analytics functionality tends to draw statisticians and scientists, while Python tends to be an entry language for software developers. While R supports more analytics than Python, Python has better tooling for Big Data; PySpark, for example, does much more than SparkR.

In KDnuggets’ 2016 poll, Python use increased markedly from 2015; this suggests that R users are adding Python to their battery of tools.  More people in the poll use both Python and R than use either one alone.

While R is an excellent tool for personal use, its GPL license discourages adoption by companies that develop products or deliver services built on analytics. Thus, it is very unlikely that R will overtake Python as a development platform for machine learning applications.

Aster on Hadoop

Teradata announces the availability of Aster on Hadoop and AWS. Aster on Hadoop strikes me as a bladeless knife without a handle.

Aster was kind of interesting back in 2012; SQL/MapReduce offered analysts a way to run queries in Hadoop back when Hive was clunky and slow. Today, Aster is rendered obsolete by the likes of Impala, Spark, Presto, Drill, and Hawq. According to DB-Engines, Aster ranks 77th in popularity, well below competitors Vertica, Netezza, and Greenplum.

Teradata’s leadership says that Aster is a great foundation for custom applications. Assuming that is true, for the sake of argument, the logical move is to donate Aster to open source, as Pivotal did with Greenplum.

SAP Acquiring Altiscale?

VentureBeat reports that SAP is acquiring BDaaS provider Altiscale for more than $125 million; neither SAP nor Altiscale confirms. Doug Henschen comments.

Late Summer Reading

In 2012, Amgen researchers disclosed that they were unable to reproduce findings in 47 out of 53 published cancer discoveries. In Nautilus, Ahmed Alkhateeb argues that we should not accept scientific results unless the findings are reproducible.

In a thesis submitted to Sweden’s KTH Royal Institute of Technology, Ahsan Javed Awan reports the results of benchmarking Apache Spark on a single scale-up server. He ran into some scaling issues on machines with more than twelve cores, which he records in some detail.


— Felix Gessert explains the ins and outs of different NoSQL databases and offers a rubric for choosing one.

— On the Google Research Blog, Peter Liu explains text summarization with TensorFlow.

— Joe Osborne interviews Google’s Norm Jouppi, who explains the Tensor Processing Unit (TPU).

— On the Kudu blog, Dan Burkert explains new range partitioning features in Kudu.

— Marco Tulio Ribiero et. al. explain Local Interpretable Model-Agnostic Explanations, a fancy name for partial dependency analysis.

— Stephen J. Bigelow explains the tools available on AWS for BI solutions: S3, RDS, Aurora, DynamoDB, EMR, Redshift, Quicksight and Amazon Machine Learning.

— On Slideshare, Manu Zhang and Sean Zhong explain Apache Gearpump, which is Yet Another Streaming Engine.

— Julie Bort explains why you shouldn’t depend on one cloud service provider.


— On the Confluent blog, Jay Kreps argues that multi-tenancy is the key capability of distributed systems.

— Cynthia Harvey compares AWS and Azure; she misses the big picture. AWS is a software-agnostic IaaS provider; MSFT is a software company with complementary PaaS and SaaS services. There are advantages and disadvantages to each model, but first one must recognize the difference.

— Curt Monash asks if analytic RDBMSs and data warehouse appliances are obsolete.

— SAP’s Ken Tsai opines on the role of Hadoop in digital transformation and IoT. Spoiler: he thinks Hadoop has a role.

— Sam Dean touts Grappa. He should have clicked through.

Screen Shot 2016-08-29 at 11.30.18 AM

Open Source News

— Hazelcast announces the general availability of Hazelcast 3.7, with performance improvements and a modular architecture. Hazelcast is an in-memory data grid.

— The Apache Geode team announces release 1.0.0-incubating.M3. Geode is a distributed in-memory database; it is the back end of Pivotal Gemfire.

— Apache Ignite completes the hat trick for in-memory bits by announcing Ignite 1.7.0.

— Apache Kudu launches Kudu 0.10.0.

— Microsoft announces the availability of Microsoft R Open (MRO) 3.3.1, with a streamlined installation process, additional packages, and bug fixes. MRO is a free and open source enhanced distribution of R.

Commercial Announcements

— Big-Data-as-a-Service provider BlueData announces a $20 million “C” round led by Intel Capital. The company also announces a partnership with Intel to deliver its software on Xeon processors.

— Google offers several webinars in September for those who want to learn more about BigQuery, Cloud Dataflow, and the Google Cloud Platform.

— Syncsort announces that it has completed the acquisition of Cogito, a maker of mainframe stuff that complements Syncsort’s other mainframe stuff.

Big Analytics Roundup (August 15, 2016)


In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover


— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.


— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (June 20, 2016)

Bill White at Roundup

Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.

On his personal blog, Michael Malak recaps the Spark Summit.

Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.

On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.

In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:

Screen Shot 2016-06-20 at 12.24.07 PM

And yes, there are a few holdouts in the lower left quadrants.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

IBM Data Science Experience

Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement.  The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.

Screen Shot 2016-06-20 at 11.17.54 AM

It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.

Evaluating the release conceptually:

  • IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
  • Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
  • R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
  • A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.

Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.


— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.

— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.

— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.

— In TechRepublic, Hope Reese explains machine learning to smart people. For everyone else, there’s this.

— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.

— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.

— Jessica Davis compiles a listicle of Tech Giants who embrace open source.

— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.


— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.

— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.

Open Source Announcements

— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.

— Apache Mahout announces version 0.12.2, a maintenance release.

— Apache SystemML (incubating) announces release 0.10.0.

Commercial Announcements

— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.

— Databricks announces availability of its managed Spark service on AWS GovCloud (US).

— Qubole announces QDS HBase-as-a-Service on AWS.

« Older Entries