How to Steal a Predictive Model

In the Proceedings of the 25th USENIX Security Symposium, Florian Tramer et. al. describe how to “steal” machine learning models via Prediction APIs. This finding won’t surprise anyone in the business, but Andy Greenberg at Wired and Thomas Claburn at The Register express their amazement.

Here’s how you “steal” a model:

— The prediction API tells you what variables the model uses; the packaging for a prediction API will say something like “submit X1 and X2, we will return a prediction for Y”; so you know that X1 and X2 are the variables in the model. The developer can try to fool you by directing you to submit a hundred variables even though it only needs two, but that’s not likely; most developers make the prediction API as parsimonious as possible.

— Use an experimental design to create test records with a range of values for each variable in the model. You won’t need many records; the number depends on the number of variables in the model and the degree of granularity you want.

— Now, ping the API with each test record and collect the results.

— With the data you just collected, you can estimate a model that approximates the model behind the prediction API.

The authors of the USENIX paper tested this approach with BigML and Amazon Machine Learning, succeeding in both cases. BigML objects; Amazon sleeps.

Legally, it may not be stealing. Model coefficients are intellectual property. If someone hacks into your model repository and steals the model file, or bribes one of your data scientists into providing the coefficients, that is theft. But while IP owners can assert a right over their actual code, it is much harder to assert a right to an application’s observable behavior. Reverse-engineering is legal in the U.S. and the European Union so long as the party that performs the work has legal possession of the relevant artifacts. If someone lawfully purchases predictions from your prediction API, they can reverse-engineer your model.

Restrictive licenses offer limited protection. Intellectual property owners can assert a claim against reverse-engineering if the predictions are under an end-user license that prohibits the practice. The fine print will please your Legal department, but is virtually impossible to enforce. Predictions, unlike other forms of intellectual property, aren’t watermarked; they’re just numbers.

Pricing plays a role. While it may be technically feasible to reverse-engineer a predictive model, it may be prohibitively expensive to do so. Models that predict behavior with financial implications, such as consumer credit risk models, are expensive. Arguably, the best way to prevent reverse-engineering is to charge a non-cancellable annual subscription fee for access to the API rather than selling predictions by the record. In any event, the risk of reverse-engineering should be a consideration in pricing.

Encryption may be necessary. If you want to do business with trusted parties over an open API, a hashing algorithm can scramble the prediction in a way that makes reverse-engineering impossible. Of course, the customer must be able to decrypt the prediction at their end of the transaction, with a key transmitted separately or from a common random seed.

Access control is key. The key point of the USENIX authors is that if your prediction API is available “in the wild,” you might as well call it an open source model because reverse-engineering is easy to do. Of course, if you are in the business of selling predictions, you already have some form of access control so you can meter usage and charge an account. Bad actors, however, have credit cards; so, if you are concerned about your predictive model’s IP, you’re going to have to establish tighter control over access to the prediction API.

Disruptive Analytics

This is an introduction to my book, Disruptive Analytics, available now from Amazon and  Apress.

DA Cover

Disruption: in business, a radical change in an industry or business strategy, especially involving the introduction of a new product or service that creates a new market.

From its birth in 1979, Teradata led the field in data warehousing. The company built a reputation for technical acumen, serving customers like Wal-Mart and Citibank; analysts and implementers alike rated the company’s massively parallel databases “best in class.”  After a 2007 spinoff from NCR, the company grew by double digits.

On August 6, 2012, Teradata released its earnings report for the second quarter.  Results excelled; revenue was up 18% and earnings per share (EPS) up 28%.  Teradata stock traded at $80, five times its value four years earlier.

“We are increasing our guidance for constant currency revenue growth and EPS for 2012,” wrote CEO Mike Koehler.

In retrospect, that moment was Teradata’s peak. Over the next three and a half years, the company lost 75% of its market value, as it repeatedly missed revenue and earnings targets. In 2015, Koehler announced a restructuring and sale of assets; several top executives departed. Finally, after a brutal first quarter earnings report, Koehler himself stepped down in May 2016.

Management blamed many factors for the sluggish sales: long sales cycles, a sluggish economy, and unfavorable currency movement.  But worldwide spending on business analytics increased during this period, and some vendors reported double-digit revenue growth.

One can blame Teradata’s struggles on poor leadership, but the truth isn’t that simple. The company’s growth problems in the last few years are not unique: in the same period, Oracle and IBM suffered declining revenue; Microsoft and SAP failed to grow consistently, disappointing investors; and SAS had to walk back embarrassing projections of double-digit growth, recording low single-digit gains.

In short, while businesses continue to invest in analytics, they aren’t buying what the industry leaders are selling.

Meanwhile, a steady stream of innovation creates new value networks in the business analytics marketplace:

Open Source Analytics. With substantial gains in the last several years, open source software makes deep inroads in the analytics community. Surveys show that working data scientists prefer to use open source R and Python more than any brand of commercial software. Technology leaders like Oracle, IBM, and Microsoft rush to get on the bandwagon.

Hadoop and its Ecosystem. As Hadoop matures, it competes successfully with data warehouse appliances, even displacing them. Technology consultant Gartner estimates that 42% of all enterprises now use Hadoop. A few years ago, data warehousing vendors laughed at Hadoop; they aren’t laughing today.

In-Memory Analytics. As the cost of memory declines, fast and scalable high-performance analytics are within reach for any organization. Adoption of open source Apache Spark increases exponentially. With more than a thousand contributors, Spark is the most active open source project in Big Data.

Streaming Analytics. Organizations face a growing volume of data in motion, driven in part by the Internet of Things (IoT). Today, there are no less than six open source projects for streaming analytics in the Apache ecosystem. In-memory databases position themselves as streaming engines for hybrid transactional/analytical processing (HTAP).

Analytics in the Cloud. When Amazon Web Services introduced its Redshift columnar database in 2012, it lacked many of the features available in competing data warehouses. For many businesses, however, Amazon offered a compelling value proposition: “good enough” functionality, at a fraction of the cost of a Teradata warehouse. The leading cloud services all report double-digit revenue growth; Gartner estimates that 44% of all businesses use the cloud.

Deep Learning. Cheap high-performance computing power makes deep learning practical. Nvidia releases its DGX-1 chip for deep learning, with the power of 250 servers; Cray announces its Urika-GX appliance with up to 1,728 cores and 35 terabytes of solid-state memory. Meanwhile, Google releases its TensorFlow framework to open source and declares that it uses deep learning in “hundreds” of applications.

Self-Service Analytics. With an easy-to-learn user interface and robust connectors to data sources, Tableau disrupts the business intelligence software industry and grows its revenues tenfold.

We do not hype Big Data in this book; petabytes of data are worthless unless they answer a business question. However, the tsunami of data produced by the digital economy is a fact of life that managers and analysts must address. Whether you manage a multinational or drive a truck, your business generates more data than ever; you will either use it or discard it, but one way or the other, you must decide what to do with it.

In a disrupted business analytics market, managers must focus ruthlessly on needs for insight, then build systems and processes that satisfy those needs. Understanding the innovations described in these chapters is a step towards that end, but the focus must remain on the demand for insight and the value chain that delivers it.

Innovations do not spring fully formed from the mind of an inventor; they are the result of a long process of tinkering. Many of the most significant innovations we describe in this book are more than fifty years old; they emerge today for various reasons, such as the long-run decline of computing costs. We present a historical perspective at several points in this book so the reader can distinguish between that which is new and that which is merely repackaged and rebranded.

In the middle chapters of this book, we present a survey of key innovations in business analytics. These chapters include detailed information about available software products and open source projects. In general, we do not cover offerings from industry leaders, under the premise that these companies have ample marketing budgets to build awareness of their products.

We close the book with a handbook for managers: specific strategies to profit from disruptive innovation. Some of these strategies may seem radical; if this disturbs you, put this book down – it’s not for you. But if you are ready to embrace disruptive innovation, and profit by it, read on.