In the Proceedings of the 25th USENIX Security Symposium, Florian Tramer et. al. describe how to “steal” machine learning models via Prediction APIs. This finding won’t surprise anyone in the business, but Andy Greenberg at Wired and Thomas Claburn at The Register express their amazement.
Here’s how you “steal” a model:
— The prediction API tells you what variables the model uses; the packaging for a prediction API will say something like “submit X1 and X2, we will return a prediction for Y”; so you know that X1 and X2 are the variables in the model. The developer can try to fool you by directing you to submit a hundred variables even though it only needs two, but that’s not likely; most developers make the prediction API as parsimonious as possible.
— Use an experimental design to create test records with a range of values for each variable in the model. You won’t need many records; the number depends on the number of variables in the model and the degree of granularity you want.
— Now, ping the API with each test record and collect the results.
— With the data you just collected, you can estimate a model that approximates the model behind the prediction API.
The authors of the USENIX paper tested this approach with BigML and Amazon Machine Learning, succeeding in both cases. BigML objects; Amazon sleeps.
Legally, it may not be stealing. Model coefficients are intellectual property. If someone hacks into your model repository and steals the model file, or bribes one of your data scientists into providing the coefficients, that is theft. But while IP owners can assert a right over their actual code, it is much harder to assert a right to an application’s observable behavior. Reverse-engineering is legal in the U.S. and the European Union so long as the party that performs the work has legal possession of the relevant artifacts. If someone lawfully purchases predictions from your prediction API, they can reverse-engineer your model.
Restrictive licenses offer limited protection. Intellectual property owners can assert a claim against reverse-engineering if the predictions are under an end-user license that prohibits the practice. The fine print will please your Legal department, but is virtually impossible to enforce. Predictions, unlike other forms of intellectual property, aren’t watermarked; they’re just numbers.
Pricing plays a role. While it may be technically feasible to reverse-engineer a predictive model, it may be prohibitively expensive to do so. Models that predict behavior with financial implications, such as consumer credit risk models, are expensive. Arguably, the best way to prevent reverse-engineering is to charge a non-cancellable annual subscription fee for access to the API rather than selling predictions by the record. In any event, the risk of reverse-engineering should be a consideration in pricing.
Encryption may be necessary. If you want to do business with trusted parties over an open API, a hashing algorithm can scramble the prediction in a way that makes reverse-engineering impossible. Of course, the customer must be able to decrypt the prediction at their end of the transaction, with a key transmitted separately or from a common random seed.
Access control is key. The key point of the USENIX authors is that if your prediction API is available “in the wild,” you might as well call it an open source model because reverse-engineering is easy to do. Of course, if you are in the business of selling predictions, you already have some form of access control so you can meter usage and charge an account. Bad actors, however, have credit cards; so, if you are concerned about your predictive model’s IP, you’re going to have to establish tighter control over access to the prediction API.