Go to a trade show for predictive analytics and listen to the presentations; most will focus on building more accurate predictive models. Presenters will differ on how this should be done: some will tell you to purchase their brand of software, others will encourage you to adopt one method or another, but most will agree: accuracy isn’t everything, it’s the only thing.
I’m not going to argue in this post that accuracy isn’t a good thing (all other things equal), but consider the following scenario: you have a business problem that can be mitigated with a predictive model. You ask three consultants to submit proposals, and here’s what you get:
- Consultant A proposes to spend one day and promises to produce a model that is more accurate than a coin flip
- Consultant B proposes to spend one week, and promises to produce a model that is more accurate than Consultant A’s model
- Consultant C proposes to spend one year, and promises to produce the most accurate model of all
Which one will you choose?
This is an extreme example, of course, but my point is that one rarely hears analysts talk about the time and effort needed to achieve a given level of accuracy. or the time and effort needed to implement a predictive model in production. But in real enterprises, there are essential trade-offs that must be factored into the analytics process. As we evaluate these three proposals, consider the following points:
(1) We can’t know how accurate a prediction will be; we can only know how accurate it was.
We judge the accuracy of a prediction after the event of interest has occurred. In practice, we evaluate the accuracy of a predictive model by examining how accurate it is with historical data. This is a pretty good method, since past patterns often persist into the future. The key word is “often”, as in “not always”; the world isn’t in a steady state, and black swans happen. This does not mean we should abandon predictive modeling, but it does mean we should treat very small differences in model back-testing with skepticism.
(2) Overall model accuracy is irrelevant.
We live in an asymmetrical world, and errors in prediction are not all alike. Let’s suppose that your doctor thinks that you may have a disease that is deadly, but can be treated with an experimental treatment that is painful and expensive. The doctor gives you the option of two different tests, and tells you that Test X has an overall accuracy rate of 60%, while Test Y has an overall accuracy of 40%.
Hey, you think, that’s a no-brainer; give me Test X.
What the doctor did not tell you is that all of the errors for Test X are false negatives: the test says you don’t have the disease when you actually do. Test Y, on the other hand, produces a lot of false positives, but also correctly predicts all of the actual disease cases.
If you chose Test X, congratulations! You’re dead.
(3) We can’t know the value of model accuracy without understanding the differential cost of errors.
In the previous example, the differential cost of errors is starkly exaggerated: on the one hand, we risk death, on the other hand, we undergo painful and expensive treatment. In commercial analytics, the economics tend to be more subtle: in Marketing, for example, we send a promotional message to a customer who does not respond (false positive) or decline a credit line for a prospective customer who would have paid on time (false negative). The actual measure of a model, however isn’t its statistical accuracy, but its economic accuracy: the overall impact of the predictive model on the business process it is designed to support.
Taking these points into consideration, Consultant A’s quick and dirty approach looks a lot better, for three reasons:
- Better results in back-testing for Consultants B and C may or may not be sustainable in production
- Consultant A’s model benefits the business sooner than the other models
- Absent data on the cost of errors, it’s impossible to say whether Consultants B and C add more value
A fourth point goes to the heart of what Agile Analytics is all about. While Consultants B and C continue to tinker in the sandbox, Consultant A has in-market results to use in building a better predictive model.
The bottom line is this: the first step in any predictive modeling effort must always focus on understanding the economics of the business process to be supported. Unless and until the analyst knows the business value of a correct prediction — and the cost of incorrect predictions — it is impossible to say which predictive model is “best”.