The Year in Machine Learning (Part One)

This is the first installment in a four-part review of 2016 in machine learning and deep learning.

In the first post, we look back at ML/DL news organized in five high-level topic areas:

  • Concerns about bias
  • Interpretable models
  • Deep learning accelerates
  • Supercomputing goes mainstream
  • Cloud platforms build ML/DL stacks

In Part Two, we cover developments in each of the leading open source machine learning and deep learning projects.

Parts Three and Four will review the machine learning and deep learning moves of commercial software vendors.

Concerns About Bias

As organizations expand the use of machine learning for profiling and automated decisions, there is growing concern about the potential for bias. In 2016, reports in the media documented racial bias in predictive models used for criminal sentencing, discriminatory pricing in automated auto insurance quotes, an image classifier that learned “whiteness” as an attribute of beauty, and hidden stereotypes in Google’s word2vec algorithm.

Two bestsellers were published in 2016 that address the issue. The first, Cathy O’Neil’s Weapons of Math Destruction, is a candidate for the National Book Award. In a review for The Wall Street Journal, Jo Craven McGinty summarizes O’Neil’s arguments as “algorithms aren’t biased, but the people who build them may be.”

A second book, Virtual Competition, written by Ariel Ezrachi and Maurice Stucke, focuses on the ways that machine learning and algorithmic decisions can promote price discrimination and collusion. Burton Malkiel notes in his review that the work “displays a deep understanding of the internet world and is outstandingly researched. The polymath authors illustrate their arguments with relevant case law as well as references to studies in economics and behavioral psychology.”

Most working data scientists are deeply concerned about bias in the work they do. Bias, after all, is a form of error, and a biased algorithm is an inaccurate algorithm. The organizations that employ data scientists, however, may not commit the resources needed for testing and validation, which is how we detect and correct bias. Moreover, people in business suits often exaggerate the accuracy and precision of predictive models or promote their use for inappropriate applications.

In Europe, GDPR creates an incentive for organizations that use machine learning to take the potential for bias more seriously. We’ll be hearing more about GDPR in 2017.

Interpretable Models

Speaking of GDPR, beginning in 2018, organizations that use machine learning to drive automated decisions must be prepared to explain those decisions to the affected subjects and to regulators. As a result, in 2016 we saw considerable interest in efforts to develop interpretable machine learning algorithms.

— The MIT Computer Science and Artificial Intelligence Laboratory announced progress in developing neural networks that deliver explanations for their predictions.

— At the International Joint Conference on Artificial Intelligence, David Gunning summarized work to date on explainability.

— MIT selected machine learning startup Rulex as a finalist in its Innovation Showcase. Rulex implements a technique called Switching Neural Networks to learn interpretable rule sets for classification and regression.

— In O’Reilly Radar, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin explained Local Interpretable Model-Agnostic Explanations (LIME), a technique that explains the predictions of any machine learning classifier.

The Wall Street Journal reported on an effort by Capital One to develop machine learning techniques that account for the reasoning behind their decisions.

In Nautilus, Aaron M. Bornstein asked: Is artificial intelligence permanently inscrutable?  There are several issues, including a lack of clarity about what “interpretability” means.

It is important to draw a distinction between “interpretability by inspection” versus “functional” interpretability. We do not evaluate an automobile by disassembling its engine and examining the parts; we get behind the wheel and take it for a drive. At some point, we’re all going to have to get behind the idea that you evaluate machine learning models by how they behave and not by examining their parts.

Deep Learning Accelerates

In a September Fortune article, Roger Parloff explains why deep learning is suddenly changing your life. Neural networks and deep learning are not new techniques; we see practical applications emerge now for three reasons:

— Computing power is cheap and getting cheaper; see the discussion below on supercomputing.

— Deep learning works well in “cognitive” applications, such as image classification, speech recognition, and language translation.

— Researchers are finding new ways to design and train deep learning models.

In 2016, the field of DL-driven cognitive applications reached new milestones:

— A Microsoft team developed a system that recognizes conversational speech as well as humans do. The team used convolutional and long short-term memory (LSTM) neural networks built with Microsoft Cognitive Toolkit (CNTK).

— On the Google Research Blog, a Google Brain team announced the launch of the Google Neural Machine Translation System, a system based on deep learning that is currently used for 18 million translations per day.

— In TechCrunch, Ken Weiner reported on advances in DL-driven image recognition and how they will transform business.

Venture capitalists aggressively funded startups that leverage deep learning in applications, especially those that can position themselves in the market for cognitive solutions:

Affectiva, which uses deep learning to read facial expressions in digital video, closed on a $14 million “D” round led by Fenox Venture Capital.

Clarifai, a startup that offers a DL-driven image and video recognition service, landed a $30 million Series B round led by Menlo Ventures.

Zebra Medical Vision, an Israeli startup, uses DL to examine medical images and diagnose diseases of the bones, brain, cardiovascular system, liver, and lungs. Zebra disclosed a $12 million venture round led by Intermountain Health.

There is an emerging ecosystem of startups that are building businesses on deep learning. Here are six examples:

Deep Genomics, based in Toronto, uses deep learning to understand diseases, disease mutations and genetic therapies.

— Cybersecurity startup Deep Instinct uses deep learning to predict, prevent, and detect threats to enterprise computing systems.

Ditto Labs uses deep learning to identify brands and logos in images posted to social media.

Enlitic offers DL-based patient triage, disease screening, and clinical support to make medical professionals more productive.

— Gridspace provides conversational speech recognition systems based on deep learning.

Indico offers DL-driven tools for text and image analysis in social media.

And, in a sign that commercial development of deep learning isn’t all hype and bubbles, NLP startup Idibon ran out of money and shut down. We can expect further consolidation in the DL tools market as major vendors with deep pockets ramp up their programs. The greatest opportunity for new entrants will be in specialized applications, where the founders can deliver domain expertise and packaged solutions to well-defined problems.

Supercomputing Goes Mainstream

To make deep learning practical, you need a lot of computing horsepower. In 2016, hardware vendors introduced powerful new platforms that are purpose-built for machine learning and deep learning.

While GPUs are currently in the lead, there is a serious debate under way about the relative merits of GPUs and FPGAs for deep learning. Anand Joshi explains the FPGA challenge. In The Next Platform, Nicole Hemsoth describes the potential of a hybrid approach that leverages both types of accelerators. During the year, Microsoft announced plans to use Altera FPGAs, and Baidu said it intends to standardize on Xilinx FPGAs.

NVIDIA Launches the DGX-1

NVIDIA had a monster 2016, tripling its market value in the course of the year. The company released the DGX-1, a deep learning supercomputer. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

NVIDIA also revealed a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow, and Theano.

Tech media salivated:

MIT Technology Review interviewed NVIDIA CEO Jen-Hsun Huang, who is now Wall Street’s favorite tech celebrity.

Separately, Karl Freund reports on NVIDIA’s announcements at the SC16 supercomputing show.

Early users of the DGX-1 include BenevolentAI, PartnersHealthCare, Argonne and Oak Ridge Labs, New York University, Stanford University, the University of Toronto, SAP, Fidelity Labs, Baidu, and the Swiss National Supercomputing Centre. Nicole Hemsoth explains how NVIDIA supports cancer research with its deep learning supercomputers.

Cray Releases the Urika-GX

Cray launched the Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high-performance network interconnect. Cray ships 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

Intel Responds

The headline on the Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. Intel acquired Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reported a price tag of $408 million. The customary tech media unicorn story storm ensues.

Intel said it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Paul Alcorn offers additional detail on Intel’s new Xeon CPU and Deep Learning Inference Accelerator. In Fortune, Aaron Pressman argues that Intel’s strategy for machine learning and AI is smart, but lags NVIDIA. Nicole Hemsoth describes Intel’s approach as “war on GPUs.”

Separately, Intel acquired Movidius, the folks who put a deep learning chip on a memory stick.

Cloud Platforms Build ML/DL Stacks

Machine learning use cases are inherently well-suited to cloud platforms. Workloads are ad hoc and project oriented; model training requires huge bursts of computing power for a short period. Inference workloads are a different matter, which is one of many reasons one should always distinguish between training and inference when choosing platforms.

Amazon Web Services

After a head fake earlier in the year when it publishing DSSTNE, a deep learning project that nobody wants, AWS announces that it will standardize on MXNet for deep learning. Separately, AWS launched three new machine learning managed services:

Rekognition, for image recognition

Polly, for text to speech

Lex, a conversational chatbot development platform

In 2014, AWS was first to market among the cloud platforms with GPU-accelerated computing services. In 2016, AWS added P2 instances with up to 16 Tesla K8- GPUs.

Microsoft Azure

Released in 2015 as CNTK, Microsoft rebranded its deep learning framework as Microsoft Cognitive Toolkit and released Version 2.0, with a new Python API and many other enhancements. The company also launched 22 cognitive APIs in Azure for vision, speech, language, knowledge, and search. Separately, MSFT released its managed service for Spark in Azure HDInsight and continued to enhance Azure Machine Learning.

MSFT also announced the Azure N-Series compute instances powered by NVIDIA GPUs for general availability in December.

Azure is one part of MSFT’s overall strategy in advanced analytics, which I’ll cover in Part Three of this review.

Google Cloud

In February, Google released TensorFlow Serving, an open source inference engine that handles model deployment after training and manages their lifetime.  On the Google Research Blog, Noah Fiedel explained.

Later in the Spring, Google announced that it was building its own deep learning chips, or Tensor Processing Units (TPUs). In Forbes, HPC expert Karl Freund dissected Google’s announcement. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs.

Google launched a dedicated team in October to drive Google Cloud Machine Learning, and announced a slew of enhancements to its services:

— Google Cloud Jobs API provides businesses with capabilities to find, match and recommend jobs to candidates. Currently available in a limited alpha.

Cloud Vision API now runs on Google’s custom Tensor Processing Units; prices reduced by 80%.

Cloud Translation API will be available in two editions, Standard and Premium.

Cloud Natural Language API graduates to general availability.

In 2017, GPU-accelerated instances will be available for the Google Compute Engine and Google Cloud Machine Learning. Details here.

IBM Cloud

In 2016, IBM contributed heavily to the growing volume of fake news.

At the Spark Summit in June, IBM announced a service called the IBM Data Science Experience to great fanfare. Experienced observers found the announcement puzzling; the press release described a managed service for Apache Spark with a Jupyter IDE, but IBM already had a managed service for Apache Spark with a Jupyter IDE.

In November, IBM quietly released the service without a press release, which is understandable since there was nothing to crow about. Sure enough, it’s a Spark service with a Jupyter IDE, but also includes an R service with RStudio, some astroturf “community” documents and “curated” data sources that are available for free from a hundred different places. Big Whoop.

In IBM’s other big machine learning move, the company rebranded an existing SPSS service as Watson Machine Learning. Analysts fell all over themselves raving about the new service, apparently without actually logging in and inspecting it.

screen-shot-2016-10-30-at-11-05-33-am

Of course, IBM says that it has big plans to enhance the service. It’s nice that IBM has plans. We should all aspire to bigger and better things, but keep in mind that while IBM is very good at rebranding stuff other people built, it has never in its history developed a commercially successful software product for advanced analytics.

IBM Cloud is part of a broader strategy for IBM, so I’ll have more to say about the company in Part Three of this review.

Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (June 20, 2016)

Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.

On his personal blog, Michael Malak recaps the Spark Summit.

Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.

On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.

In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:

Screen Shot 2016-06-20 at 12.24.07 PM

And yes, there are a few holdouts in the lower left quadrants.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

IBM Data Science Experience

Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement.  The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.

Screen Shot 2016-06-20 at 11.17.54 AM

It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.

Evaluating the release conceptually:

  • IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
  • Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
  • R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
  • A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.

Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.

Explainers

— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.

— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.

— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.

— In TechRepublic, Hope Reese explains machine learning to smart people. For everyone else, there’s this.

— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.

— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.

— Jessica Davis compiles a listicle of Tech Giants who embrace open source.

— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.

Perspectives

— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.

— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.

Open Source Announcements

— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.

— Apache Mahout announces version 0.12.2, a maintenance release.

— Apache SystemML (incubating) announces release 0.10.0.

Commercial Announcements

— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.

— Databricks announces availability of its managed Spark service on AWS GovCloud (US).

— Qubole announces QDS HBase-as-a-Service on AWS.

Big Analytics Roundup (May 16, 2016)

This week we have more insight into Spark 2.0, scheduled for release just before Spark Summit 2016. (Yes, I’m going.) Also, kudos to BI-on-Hadoop startup AtScale for a new round of funding; Amazon releases YADLF (Yet Another Deep Learning Framework); and there are a number of new faces at H2O.ai.

Plus, we have an extended review of the Palantir story.

Buzzfeed on Palantir

Last week, I deemed Buzzfeed’s story on Palantir too dumb to link. (“Forget it, Jake. It’s Buzzfeed.”) Buzzfeed “news” reporter William Alden, who was all over a story about maggots in Facebook lunches, breathlessly mines a cache of “secret internal documents” and discovers:

  • Palantir expects employee turnover of around 20% for 2016.
  • Palantir lost some clients.
  • Palantir books more work than it bills.

Does Palantir have an employee turnover problem?  No. A 20% turnover rate is slightly above the 17% reported for all industries in 2015, and about on track for Silicon Valley. (There are companies in SV with 100% turnover rates.) On Glassdoor, employees give Palantir high marks.

Does Palantir have a client retention problem? Not exactly. The story cites four clients — American Express, Coca-Cola, Kimberley-Clark and Nasdaq — who engaged Palantir to conduct a pilot, then decided not to proceed with a long-term contract. In other words, lost sales and not cancelled contracts. The document Buzzfeed obtained is Palantir’s won/lost analysis, which shows that the company is attempting to learn from its lost sales.

Does Palantir have a revenue problem? No. Palantir’s 2015 revenue was up 50% from the previous year. Buzzfeed obsesses over the difference between Palantir’s bookings of $1.7 billion and its revenue of $420 million. A high book-to-bill ratio  is typical for consultancies that pursue large multi-year projects; it is a sign of strong demand for the company’s services. Under GAAP accounting, companies can accrue revenue only as work is performed, even if they bill the work in advance. Note that consulting giant Accenture’s bookings exceed its revenue for its most recent quarter.

Does Palantir have a profitability problem? Possibly. Buzzfeed reports that the company lost $80 million last year on revenue of $420 million. Consulting margins tend to be fairly high, so a loss means that Palantir is “investing” in a lot of unbillable work. It’s hard to say if these “investments” will pay off. Palantir closed another round of funding in December, 2015, so people with more and better information than Buzzfeed obviously think they will, and are backing up their belief with cash.

By the way, you know who has an actual revenue problem? Buzzfeed.

Roger Peng attempts to draw lessons for data scientists from the Buzzfeed story, without questioning its premises. He should stick to Biostatistics.

Spark 2.0

— Databricks announces preview of Apache Spark 2.0 on Databricks Community Edition.

— From last week: Reynold Xin explains what’s new in Spark 2.0.

— Dave Ramel summarizes the new features, including faster SQL; consolidation of the Dataset and DataFrame APIs; support for ANSI (2003) SQL; and Structured Streaming, an integrated view of tables and streams.

— Now that Spark 2.0 is in preview, MapR offers Spark 1.6.1.

Explainers

— Four from Adrian Colyer:

— Richard Williamson explains how to build a streaming prediction engine with Spark, MADlib, Kudu and Impala.

— On the Cloudera Vision blog, Santosh Kumar explains Hive-on-Spark.

— DataStax’ Dani Traphagen explains data processing with Spark and Cassandra.

— In ZDNet, Andrew Brust explains Microsoft’s R strategy, and gets it right.

Perspectives

— For a planted article in Linux.com, Pam Baker interviews IBM’s Mike Breslin, who answer questions nobody is asking about using Spark and Cloudant.

— Joyce Wells recaps a presentation by Booz Allen’s Jair Aguirre, who touts Apache Drill.

— Alex Woodie attends the Apache: Big Data 2016 conference and discovers open source projects.

— In Business Insider, Sam Shead describes FBLearnerFlow, a workbench for machine learning and AI.

— Leslie D’Monte describes some ways companies use machine learning in their operations.

Open Source Announcements

— Google announces release to open source of SyntaxNet, a framework for natural language understanding. Included in the release: an English parser dubbed Parsey McParseface. Journalists respond to the latter like dogs to a squirrel.

— Amazon releases yet another deep learning framework, this one branded as “Deep Scalable Sparse Tensor Network Engine (DSSTNE)” or “Destiny”. Stephanie Condon reports.

— Salesforce donates PredictionIO to Apache.

— Apache Storm announces two new maintenance releases:

  • Storm 0.10.1 has bug fixes.
  • Storm 1.0.1 has performance improvements and bug fixes.

— Apache Flink announces Release 1.0.3, with bug fixes and improved documentation.

— Apache Apex pushes a release to resolve a security issue.

Commercial Announcements

— BI-on-Hadoop startup AtScale announces an $11 million “B” round. Media coverage here.

— H2O.ai announces new hires with a strong orientation towards visualization, suggesting the company plans to add a more robust user interface to its best-in-class machine learning engine.

Big Analytics Roundup (March 7, 2016)

Hortonworks wins the internet this week beating the drum for its partnership with Hewlett-Packard Enterprise.  The story is down under “Commercial Announcements,” just above the story about Hortonworks’ shareholder lawsuit.

Google releases a distributed version of TensorFlow, and HDP releases a new version of Dataflow.  We are reaching peak flow.

IBM demonstrates its core values.

Folks who fret about cloud security don’t understand that data is safer in the cloud than it is on premises.  There are simple steps you can take to reduce or eliminate concerns about data security.  Here’s a practical guide to anonymizing your data.

Explainers

In the morning paper, Adrian Colyer explains trajectory data mining,

On the AWS Big Data Blog, Manjeet Chayel explains how to analyze your data on DynamoDB with Spark.

Nicholas Perez explains how to log in Spark.

Altiscale’s Andrew Lee explains memory settings in part 4 of his series of Tips and Tricks for Running Spark on Hadoop.  Parts 1-3 are here, here and here.

Sayantam Dey explains topic modeling using Spark for TF-IDF vectorization.

Slim Baltagi updates all on state of Flink community.

Martin Junghanns explains scalable graph analytics with Neo4j and Flink.

On SlideShare, Vasia Kalavri explains batch and stream graph processing with Flink.

DataTorrent’s Thomas Weise explains exactly-once processing with DataTorrent Apache Apex.

Nishant Singh explains how to get started with Apache Drill.

On the Cloudera Engineering Blog, Xuefu Zhang explains what’s new in Hive 2.0.

On the Google Cloud Platform Blog, Matthieu Mayran explains how to build a recommender with the Google Compute Engine.

In TechRepublic, James Sanders explains Amazon Web Services in what he characterizes as a smart person’s guide.  If you’re not smart and still want to use AWS, go here.

Perspectives

We continue to digest analysis from Spark Summit East:

— Altiscale’s Barbara Lewis summarizes her nine favorite sessions.

— Jack Vaughan interviews attendees from CapitalOne, eBay, DataXu and some other guy who touts open source.

— Alex Woodie interviews attendees from Bloomberg and Comcast and grabs quotes from Tony Baer, Mike Gualtieri and Anjul Bhambhri, who all agree that Spark is a thing.

In other matters:

— In KDnuggets, Gregory Piatetsky attacks the idea of the “citizen data scientist” and give it a good thrashing.

— Paige Roberts probes the true meaning of “real time.”

— MapR’s Jim Scott compares Drill and Spark for SQL, offers his opinion on the strengths of each.

— Sri Ambati describes the road ahead for H2O.ai.

Open Source Announcements

— Google releases Distributed TensorFlow without an announcement.  On KDnuggets, Matthew Mayo applauds.

— Hortonworks announces a new release of Dataflow, which is Apache NiFi with the Hortonworks logo.  New bits include integrated security and support for Apache Kafka and Apache Storm.

— On the Databricks blog, Joseph Bradley et. al. introduce GraphFrames, a graph processing library that works with the DataFrames API.  GraphFrames is a Spark Package.

Commercial Announcements

— Hortonworks announces partnership with Hewlett Packard Enterprise to enhance Apache Spark.  HPE claims to have rewritten Spark shuffle for faster performance, and HDP will help them contribute the code back to Spark.  That’s nice.  Not exactly the ground-shaking announcement HDP touted at Spark Summit East, but nice.

— Meanwhile, Hortonworks investors sue the company, claiming it lied in a November 10-Q when it said it had enough cash on hand to fund twelve months of operations.  The basic issue is that Hortonworks burns cash faster than Kim Kardashian out for a spree on Rodeo Drive, spending more than $100 million in the first nine months of 2015, leaving $25 million in the bank.  Hortonworks claims analytic prowess; perhaps it should apply some of that know-how to financial controls.

— OLAP on Hadoop vendor AtScale announces 5X revenue growth in 2015, which isn’t too surprising since they were previously in stealth.  One would expect infinite revenue growth.

Teradata Lays Another Egg

Teradata reports Q3 revenue of $606 million, down 3% in “constant” dollars, down 9% in actual dollars, the kind you can spend.  Product revenue, from selling software and boxes, declined 14%.

In a brutal call with analysts, CEO Mike Koehler noted: “revenue was not what we expected.”  It could have been a recorded message.

Teradata executives tried to blame the weak revenue on the strong dollar.  When pressed, however, they admitted that deferred North American sales drove the shortfall, as companies put off investments in Teradata’s big box solutions.

In other words, the dogs don’t like the dog food.

From the press release:

Teradata is in the process of making transformational changes to improve the long-term performance of the company, including offering more flexibility and options in the way customers buy Teradata products such as a software-only version of Teradata as well as making Teradata accessible in the public cloud. The initial cloud version of Teradata will be available on Amazon’s Web Services in the first quarter of 2016.

An analyst asked about expected margins in the software-only business; Teradata executives clammed up.  The answer is zero.  Teradata without a box is a bladeless knife without a handle, competing directly with open source databases, such as Apache Greenplum.

Another analyst asked about Teradata on AWS, noting that Teradata executives previously declared that their customers would never use AWS.  Response from the executives was more mush.  HP just shuttered its cloud business; Teradata’s move to AWS implies that Teradata Cloud is toast.

Koehler also touted Teradata’s plans to offer Aster on Hadoop, citing “100 pre-built applications”.  Good luck with that.  Aster on Hadoop is a SQL engine that still runs through MapReduce; in other words it’s obsolete, a point reinforced by Teradata’s plans to move forward with Presto.  Buying an analytic database with pre-built applications is like buying a car with pre-built rides.

More from the press release:

“We remain confident in Teradata’s technology, our roadmaps and competitive leadership position in the market and we are taking actions to increase shareholder value.  We are making transformative changes to the company for longer term success, and are also aligning our cost structure for near term improvement,” said Mike Koehler, chief executive officer, Teradata Corporation. 

In other words, expect more layoffs.

“Our Marketing Applications team has made great progress this year, and has market leading solutions. As part of our business transformation, we determined it best to exclusively focus our investments and attention on our core Data and Analytics business.  We are therefore selling our Marketing Applications business. As we go through this process, we will work closely with our customers and employees for continued success.

“We overpaid for Aprimo five years ago, so now we’re looking for some greater fool to buy this dog.”

In parallel, we are launching key transformation initiatives to better align our Data and Analytics solutions and services with the evolving marketplace and to meet the needs of the new Teradata going forward.”

Update your resumes.

During the quarter, Teradata purchased approximately 8.5 million shares of its stock worth approximately $250 million.  Year to date through September 30, Teradata purchased 15.5 million shares, worth approximately $548 million.

“We have no vision for how to invest in our business, so we’re buying back the stock.”

In early trading, Teradata’s stock plunges.

In 2012, five companies led the data warehousing platform market: Oracle, IBM, Microsoft, Teradata and SAP.  Here’s how their stocks have fared since then:

  • Oracle: Up 24%
  • IBM: Down 29%
  • Microsoft: Up 77%
  • Teradata: Down 61%
  • SAP: Up 22%

Nice work, Teradata!  Making IBM look good…

Big Analytics Roundup (October 12, 2015)

Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics.  Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics.  StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.

EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica.  It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances.  Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.

EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf.  Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.

What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms.  But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.

In other news, the Apache Software Foundation announces three interesting software releases:

  • Apache AccumuloRelease 1.6.4, a maintenance release.
  • Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
  • Apache Kafka: Release 0.8.2.2, a maintenance release.

Spark

On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.

Spark Performance

Dave Ramel summarizes a paper he thinks is too long for you to read.  That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads.  As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.

Spark Appliances

Ordinarily I don’t link sponsored content, but this article from Numascale is interesting.  Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.

Spark on Amazon EMR

On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR.  The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.

Use Cases

In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.

Stitch Fix offers personalized style recommendations to its customers.  Jas Khela describes how the startup uses Spark.   (h/t Hadoop Weekly)

SQL/OLAP/BI

Apache Drill

MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data.  Without a trace of irony, he explains how to bring SQL to NoSQL datastores.

Apache Hawq

On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib.  Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library.   MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so.  In Datanami, George Leopold reports.

In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.

At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive.  The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.

Apache Phoenix

At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase.  Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver.  SQL support is broad enough to run TPC benchmark queries.  Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.

Data Blending

On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.

Presto

On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR.  Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.

Machine Learning

Apache MADLib

MADLib is an open source project for machine learning in SQL.  Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community.  Machine learning functionality is quite rich.  Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq.  In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)

Apache Spark (SparkR)

On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.

Apache Spark (MLLib)

On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.

Apache Zeppelin

At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.

H2O and Spark (Sparkling Water)

In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.

How Yahoo Does Deep Learning on Spark

Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark.  Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node.  The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.

Screen Shot 2015-10-06 at 10.32.39 AM

Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC).  In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)

Streaming Analytics

Apache Flink

On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.

Apache Kafka

Cloudera’s Gwen Shapira presents Kafka “worst practices”: true stories that happened to Kafka clusters. (h/t Hadoop Weekly)

Apache Spark Streaming

On the MapR blog, Jim Scott offers a guide to Spark Streaming.

Big Analytics Roundup (September 21, 2015)

Top story of the week: release of AtScale’s Hadoop Maturity Survey, which triggered a flurry of analysis.  Meanwhile, the Economist ventures into the world of open source software and venture capital, embarrassing itself in the process; and IBM announces plans to use Spark in its search for extraterrestrial intelligence, a project that would be more useful if pointed toward IBM headquarters.

AtScale Releases Hadoop Adoption Survey

OLAP-on-Hadoop vendor AtScale publishes results of a survey of 2,200 respondents who are either actively working with Hadoop today or planning to do so in the near future.  AtScale partnered with Cloudera, Hortonworks, MapR and Tableau to recruit respondents for the survey.

A copy of the survey report is here; the survey instrument is here.  AtScale will deliver a webinar summarizing results from the survey; you can register here.

There are multiple stories about this survey in the media: here, here, here, here, here, here, here, here, and here.  Some highlights:

  • Andrew Oliver compares this survey to Gartner’s Hadoop assessment back in May and concludes that Gartner blew it.  While I agree that Gartner’s outlook on Hadoop is too conservative, (and said so at the time) the two surveys are apples and oranges: while AtScale surveyed people who are either already using Hadoop or plan to do so, Gartner surveyed a panel of CIOs.  Hence, it is not surprising that AtScale’s respondents are more positive about prospects for Hadoop.
  • Matt Asay notes that “Cost saving” is the third most frequently cited reason for adopting Hadoop, after “Scale-out needs” and “New applications.”  This is somewhat surprising, given Hadoop’s reputation as a cheap datastore.  Cost is still a factor driving Hadoop adoption, it’s just not the primary factor.

Here are a few insights from this survey not mentioned by other analysts.  First look at the difference in BI tool usage between those currently using Hadoop and those planning to use Hadoop.  Compared to current users, planners are significantly more likely to say they want to use Excel and less likely to say they want to use Tableau or SAS.  (Current and planned use of SAP Business Objects and IBM Cognos are about the same.)

Screen Shot 2015-09-21 at 10.06.17 AM

Also interesting to note differences in Hadoop maturity among the BI users.  SAS users are more likely than others to self-identify as “Low Maturity”:

Screen Shot 2015-09-21 at 10.06.37 AM

Finally, a significant minority of current Hadoop users cite Management, Security, Performance, Governance and Accessibility as challenges using Hadoop.  However, most who plan to use Hadoop do not anticipate these challenges — which suggest these respondents are in for a rude awakening.

Screen Shot 2015-09-21 at 10.07.01 AM

SQL on Hadoop

For those who like things distilled to sound bites, eWeek offers a point of view on when to select Apache Spark, Hadoop or Hive.   Brevity is the soul of wit, but sometimes it’s just brevity.

Amazon Web Services

Redshift is an OEM version of Actian’s ParAccel columnar database with analytic capabilities removed, which is why data scientists say that Redshift is where data goes to die.  Amazon Web Services has taken baby steps to ameliorate this, adding Python UDFs.  Christopher Crosbie reports, on the AWS Big Data Blog. (h/t Hadoop Weekly)

Apache Apex/DataTorrent

On the DataTorrent blog, Amol Kekre introduces you to Apache Apex, which was just accepted by Apache as an incubator project.  DataTorrent touts Apex as kind of like Spark, only better, thereby demonstrating the importance of timing in life.  (h/t Hadoop Weekly)

If you think that Apex does nothing, Munagala Ramanath shares the good news that Apex supports the Malhar library.  Honestly, though, it still seems to do nothing.

In an email to David Ramel, DataTorrent CEO Phu Hoang identifies flaws in Spark, points to his Apache Apex project as a solution.  Bad move on his part.

Apache Drill

Chloe Green discusses implications of the European Commission’s digital single market, and suggests that retailers will use Apache Drill to analyze the data that will be produced under this regulatory framework.  There are two problems with this article.  First, Green makes no effort to consider alternatives to Drill.  Second, the article itself accepts the premise that more regulation will produce business growth; in fact, the opposite is more likely (except for those in the compliance industry.)

The Drill team explains how to implement Drill in ten minutes.

Jim Scott summarizes the benefits of Drill for the BI user.

On O’Reilly Radar, Ellen Friedman recaps the history of Drill as an open source project.

Zygimantas Jacikevicius offers an introduction to Drill and explains why it is useful.

Apache Flink

On the DataArtisans blog, Kostas Tzoumas seeks to position Flink against Spark by arguing that batch is a special case of streaming.  Of course, you can argue the opposite just as easily — that streaming is batch with very small batches.

If you care about Off-heap Memory in Apache Flink, Stephan Ewen offers a summary.

At a DC Area Flink Meetup, Capital One’s Slim Baltagi explains unified batch and real-time stream processing with Flink.

Flink sponsor DataArtisans announces partnership with SciSpike, a training and consulting provider.

Apache NiFi

Yves de Montcheuil explains why you should care about Apache NiFi, a project that connects data-generating systems with data processing systems.  Spoiler: it’s all about security and reliability.

Apache Spark

In Fortune, Derrick Harris describes Microsoft’s “Spark-inspired” and “Spark-like” Prajna project, does not explain why MSFT is reinventing the wheel.

Cloudera announces a Spark training curriculum.  For those without prior Hadoop experience, two courses cover data ingestion with Sqoop and Flume, data modeling, data processing with Spark and Spark Streaming with Kafka.  There is also a single shorter course covering the same ground for those with prior Hadoop experience.  Finally, a data science course covers advanced analytics with MLLib.

Document analytics vendor Ephesoft introduces new software built on Spark.

Matt Asay uses the Spark/Fire metaphor once too often.

In a post about DataStax, Curt Monash notes synergies between Spark and Cassandra.

MongoDB offers a white paper which explains, not surprisingly, how to use Spark with Mongo.

On the Basho blog, Korrigan Clark discusses his work using Spark to develop an algorithmic stock trading program.

Here are two items from Cloudera’s Kostas Sakellis on SlideShare.  The first explains why your Spark job fails; the second reviews how to get Spark customers to production.

GraphLab/Dato

Dato, the University of Washington and Coursera announce a machine learning specialization consisting of five courses and a capstone project.  The curriculum is platform neutral, though I suspect that co-creator Carlos Guestrin manages to get in a good word for his project.

H2O/H2O.ai

Two items on slideshare:

  • From a meetup at 6Sense, Mark Landry explains H2O Gradient Boosted Machines for Ad Click Prediction.
  • Avni Wadhwa and Vinod Iyengar demonstrate how to build machine learning applications with Sparkling Water, H2O’s interface to Spark.

Spark Summit 2015: Preliminary Report

So I guess Spark really is enterprise ready.  Nick Heudecker, call your office.

There are several key themes coming from the Summit:

Spark Continues to Mature

Spark and its contributors deserve a round of applause.  Some key measures of growth since the 2014 Summit:

  • Contributor headcount increased from 255 to 730
  • Committed lines of code increased from 175K to 400K

There is increasing evidence of Spark’s scalability:

  • Largest cluster: 8,000 nodes
  • Largest job: 1 petabyte
  • Top streaming intake: 1TB/hour

Project Tungsten aims to make Spark faster and prepare for the next five years; the project has already accomplished significant performance improvements through better use of memory and CPU.

IBM and Spark

IBM drops the big one with its announcement.  Key bits from the announcement:

  • IBM will build Spark into the core of its analytic and commerce products, including IBM Watson Health Cloud
  • IBM will open source its machine learning library (System ML) and work with Databricks to port it to Spark.
  • IBM will offer Spark as a Cloud service on Bluemix.
  • IBM will commit 3,500 developers to Spark-related projects.
  • IBM (and its partners) will train more than a million people on Spark

I will post separately on this next week

Spark is Enterprise-Ready

If IBM’s announcement is not sufficient to persuade skeptics, presentations from Adobe, Airbnb, Baidu, Capital One, CIA, NASA/JPL, NBC Universal, Netflix, Thompson Reuters, Toyota and many others demonstrate that Spark already supports enterprise-level workloads.

In one of the breakouts, Arsalan Tavakoli-Shiraji of Databricks presented results from his analysis of more than 150 production deployments of Spark.  As expected, organizations use Spark for BI and advanced analytics; the big surprise is that 60% use non-HDFS data sources.  These organizations use Spark for data consolidation on the fly, decoupling compute from storage, with unification taking place on the processing layer.

Databricks Cloud is GA

Enough said.

SparkR

Spark 1.4 includes R bindings, opening Spark to the large community of R users.  Out of the gate, the R interface enables the R user to leverage Spark DataFrames; the Spark team plans to extend the capability to include machine learning APIs in Spark 1.5.

Spark’s Expanding Ecosystem

Every major Hadoop distributor showed up this year, but there were no major announcements from the distributors (other than IBM’s bombshell).

In other developments:

  • Amazon Web Services announced availability of a new Spark on EMR service
  • Intel announced a new Streaming SQL project for Spark
  • Lucidworks showcased its Fusion product, with Spark embedded
  • Alteryx announced its plans to integrate with Spark in its Release 10

One interesting footnote — while there were a number of presentations about Tachyon last year, there were none this year.

These are just the key themes.  I’ll publish a more detailed story next week.

Big Analytics Roundup (April 13, 2015)

This week:  Microsoft closes on the acquisition of Revolution Analytics, plus lots of cloud news driven by the AWS Summit in San Francisco.

But the top item for the week is this History of Hadoop, from Marko Bonaci.

Update:  OK, the top item is actually this piece from Dave McClure on unicorns and dinosaurs.

Amazon Web Services

If you thought Amazon would let Microsoft own the cloud-based machine learning space, think again.  Amazon introduces Amazon Machine Learning. (h/t Oliver Vagner)

Apache Drill

In Big Data Quarterly, Jim Scott offers an excellent summary of Apache Drill and its significance for the Hadoop ecosystem

Apache Mahout

The Mahout team announces Release 0.10, which includes a distributed algebraic optimizer, a Scala API and the Spark interface.  The team has optimistically re-branded these capabilities as Samsara, which suggests that we can escape from Mahout by following the Buddhist path.

Apache Spark

Advanced Analytics with Spark, the new book by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, is now available.

Writing in insideBIGDATA, MemSQL CEO Eric Frenkiel champions Spark working together with MemSQL.

Cloud

Writing in ITBusinessEdge Arthur Cole says analytics is heading toward the cloud.  Newsflash: analytics is already in the cloud, big time.  There are organizations today that run most or all of their advanced analytics in the cloud, and the most sophisticated have done so for years.

Cloud is eating the analytics world because predictive modeling requires large-scale computing power in short bursts; organizations that scale up on-premises computing power to meet peak requirements will own a lot of unused server capacity.  Moreover, cloud enables analysts to radically reduce cycle time and build better models with massively parallel test-and-learn operations.

In an InfoWorld piece headlined Big Data is All About the Cloud Matt Asay argues that Big Data is about other things, too, like streaming and dedicated task clusters.  He interviews Matt Wood of Amazon Web Services, who thinks cloud is a good thing.

Databricks

Databricks announces that it is now an Amazon Web Services Advanced Technology Partner.

On the Databricks blog, Andy Konwinski recaps Spark Summit East.

Informatica

News of the company’s plan to go private produces a slew of overwrought articles about “generational shifts” in data integration like this one from Alex Woodie in Datanami.   Venture capitalists pay for potential and Wall Street pays for growth, but private owners want recurring revenue and profit margins; hence, private ownership is the best model for firms that are well along in the hype cycle, past the “Trough of Disillusionment” and well into the “Slope of Enlightenment”.  It shouldn’t surprise anyone that SnapLogic, Alteryx, ClearstoryData, Trifacta and Paxata all have higher growth rates than Informatica; after all, 1+1 equals 100% growth.  Nevertheless, the total revenue of those companies amounts to rounding error on Informatica’s 10-K, so grave-dancing seems premature.

Gartner_Hype_Cycle.svg

Microsoft

Microsoft closes on its acquisition of Revolution Analytics (previously discussed here, here and here.)   Financial terms are undisclosed, so we will just have to troll through MSFT’s next 10-Q to confirm rumors about the price.  Additional coverage here and here.  Dave Rich, CEO of Revolution Analytics, assumes the role of General Manager, Advanced Analytics for Microsoft.