How GDPR Affects Data Science

Adapted from a post originally published on the Cloudera VISION Blog.

If your organization collects data about citizens of the European Union (EU), you probably already know about the General Data Protection Regulation (GDPR). GDPR defines and strengthens data protection for consumers and harmonizes data security rules within the EU. The European Parliament approved the measure on April 27, 2016. It goes into effect in less than a year, on May 25, 2018.

Much of the commentary about GDPR focuses on how the new rules affect the collection and management of personally identifiable information (PII) about consumers. However, GDPR will also change how organizations practice data science. That is the subject of this blog post.

One caveat before we begin. GDPR is complicated. In some areas, GDPR defines high-level outcomes, but delegates detailed compliance rules to a new entity, the European Data Protection Board. GDPR regulations intersect with many national laws and regulations; organizations that conduct business in the United Kingdom must also assess the unknown impacts of Brexit. Organizations subject to GDPR should engage expert management and legal counsel to assist in developing a compliance plan.  

GDPR and Data Science

GDPR affects data science practice in three areas. First, GDPR imposes limits on data processing and consumer profiling. Second, for organizations that use automated decision-making, GDPR creates a “right to an explanation” for consumers. Third, GDPR holds firms accountable for bias and discrimination in automated decisions.  

Data processing and profiling. GDPR imposes controls on data processing and consumer profiling; these rules supplement the requirements for data collection and management. GDPR defines profiling as:

Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular, to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.

In general, organizations may process personal data when they can demonstrate a legitimate business purpose (such as a customer or employment relationship) that does not conflict with the consumer’s rights and freedoms. Organizations must inform consumers about profiling and its consequences, and provide them with the opportunity to opt out.

The Right to an Explanation. GDPR grants consumers the right “not to be subject to a decision…which is based solely on automated processing and which provides legal effects (on the subject).”  Experts characterize this rule as a “right to an explanation.”  GDPR does not precisely define the scope of decisions covered by this section. The United Kingdom’s Information Commissioner’s Office (ICO) says that the right is “very likely” to apply to credit applications, recruitment, and insurance decisions. Other agencies, law courts or the European Data Protection Board may define the scope differently.

Bias and Discrimination. When organizations use automated decision-making, they must prevent discriminatory effects based on racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect. Moreover, they may not use specific categories of personal data in automated decisions except under defined circumstances.

How GDPR Affects Data Science Practice

How will the new rules affect the way data science teams do their work? Let’s examine the impact in three key areas.

Data Processing and Profiling. The new rules allow organizations to process personal data for specific business purposes, fulfill contractual commitments, and comply with national laws. A credit card issuer may process personal data to determine a cardholder’s available credit; a bank may screen transactions for money laundering as directed by regulators. Consumers may not opt out of processing and profiling performed under these “safe harbors.”

However, organizations may not use personal data for a purpose other than the original intent without securing additional permission from the consumer. This requirement could limit the amount of data available for exploratory data science.

GDPR’s constraints on data processing and profiling apply only to data that identifies an individual consumer.

The principles of data protection should therefore not apply to … personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

The clear implication is that organizations subject to GDPR must build robust anonymization into data engineering and data science processes.

Explainable Decisions. There is some controversy about the impact of this provision. Some cheer it; others disapprove; still others deny that GDPR creates such a right. One expert in EU law argues that the requirement may force data scientists to stop using opaque techniques (such as deep learning), which can be hard to explain and interpret.

There is no question that GDPR will affect how organizations handle certain decisions. The impact on data scientists, however, may be exaggerated:

— The “right to an explanation” is limited in scope. As noted above, one regulator interprets the law to cover credit applications, recruitment, and insurance decisions. Other regulators or law courts may interpret the rules differently, but it’s clear that the right applies in specific settings. It does not apply to every automated decision.

— In many jurisdictions, a “right to an explanation” already exists and has existed for years. For example, regulations governing credit decisions in the United Kingdom are similar to those in the United States, where issuers must provide an explanation for adverse credit decisions based on credit bureau information. GDPR expands the scope of these rules, but tools for compliance are commercially available today.

— Most businesses that decline some customer requests understand that adverse decisions should be explained to customers. This is already common practice in the lending and insurance industries. Smart businesses treat adverse decisions as an opportunity to position an alternate product.

— The need to deliver an explanation affects decision engines but need not influence the choice of methods for model training. Techniques available today make it possible to “reverse-engineer” interpretable explanations for model scores even if the data scientist uses an opaque method to train the model.

Nevertheless, there are good reasons for data scientists to consider using interpretable techniques. Financial services giant Capital One considers them to be a potent weapon against hidden bias (discussed below.) But one should not conclude that GDPR will force data scientists to limit the techniques they use to train predictive models.

Bias and Discrimination. GDPR requires that organizations must avoid discriminatory effects in automated decisions. This rule places an extra burden of due diligence on data scientists who build predictive models, and on the procedures organizations use to approve predictive models for production.

Organizations that use automated decision-making must:

  • Ensure fair and transparent processing
  • Use appropriate mathematical and statistical procedures
  • Establish measures to ensure the accuracy of subject data employed in decisions

GDPR expressly prohibits the use of personal characteristics (such as age, race, ethnicity, and other enumerated classes) in automated decisions. However, it is not sufficient to just avoid using this data. The mandate against discriminatory outcomes means data scientists must also take steps to prevent indirect bias from proxy variables, multicollinearity or other causes. For example, an automated decision that uses a seemingly neutral characteristic, such as a consumer’s residential neighborhood, may inadvertently discriminate against ethnic minorities.

Data scientists must also take affirmative steps to confirm that the data they use when they develop predictive models is accurate; “garbage in/garbage out,” or GIGO, is not a defense. They must also consider whether biased training data on past outcomes can bias models. As a result, data scientists will need to concern themselves with data lineage, to trace the flow of data through all processing steps from source to target. GDPR will also drive greater concern for reproducibility, or the ability to accurately replicate a predictive modeling project.

Your Next Steps

If you do business in the European Union, now is the time to start planning for GDPR. There is much to be done: evaluating the data you collect, implementing compliance procedures, assessing your processing operations and so forth. If you are currently using machine learning for profiling and automated decisions, there are four things you need to do now.

Limit access to personally identifiable information (PII) about consumers.

Implement robust anonymization, so that by default analytic users cannot access PII. Define an exception process that permits access to PII in exceptional cases under proper security.  

Identify predictive models that currently use PII.

In each case, ask:

  • Is this data analytically necessary?
  • Does the PII provide unique and irreplaceable information value?
  • Does the predictive model support a permitted use case?

Inventory consumer-facing automated decisions.

  • Identify decisions that require explanations.
  • Implement procedures to handle consumer questions and concerns.

Establish a data science process that minimizes the risk of errors and bias.

  • Implement a workflow that ensures proper model development and testing.
  • Consider the possibility of bias “built in” to training data.
  • Rigorously test and validate predictive models.
  • Implement peer review for an independent assessment of every model.

Even if your organization is not subject to GDPR, consider implementing these practices anyway. It’s the right way to do business.


  • A thought in model explainability:

    Let’s take the Capital One situation of Credit Decisioning. For a given denied applicant, I would simply try to find the smallest change in his/her input profile, that can change the decision from “Decline” to “Approval”. One can extend this to exploring change in 2-3 variables together to find the minimum change (say measured by mahalanobis distance between the normalized original and modified input vectors).

    The individual variable or combination of 2-3 variables with smallest change resulting in a decision flip “explain” the “Decline” decision

    • Or, you could just compute feature importance (with any of a number of techniques) and use that to create an ordered list of reasons.

      • I believe one issue with using Feature importance is that for a given model, we have one ordered list of feature importance. However, the primary reason why one particular applicant gets denied can vary from one individual to another (the shortest distance from Approve/Decline boundary). Hence, the need to identify which variable(s) were key in driving the decision.

      • Feature importance tells you what features to examine to determine the decline reasons.

      • Agree with that

  • If using Oracle’s machine learning SQL functions, Prediction_Details are designed to help address the need for some level of explanatory rules.. See
    PREDICTION_DETAILS returns prediction details for each row in the selection. The return value is an XML string that describes the attributes of the prediction.

    For regression, the returned details refer to the predicted target value. For classification and anomaly detection, the returned details refer to the highest probability class or the specified class_value.

  • This article has certainly increased my understanding about the General Data Protection Regulation affects on Data Science. I think is the best analysis about the effect of new European data safety regulations on data science.

  • Thanks for the post it was an interesting read. There should be more discussion about the effects of GDPR to data science. How do you come up with this though: “GDPR expressly prohibits the use of personal characteristics (such as age, race, ethnicity, and other enumerated classes) in automated decisions.”

    Automated decisions (including profiling for one) can be done e.g. if it is necessary for entering into a contract or with explicit consent. Even special categories of data (which age is not) can be used if there is an explicit consent and other measures to safeguard data subject’s rights and freedoms.

    • I came up with it by reading the text of the regulations.

      GDPR does not treat automated decisions and profiling in the same way, so one should not conflate them, as you just did. It allows exceptions for profiling that do not apply to automated decisions, and vice-versa.

      Special categories of data defined in Article 9 include such things as health, genetic, and biometric data; it’s hard to argue that a person’s age doesn’t belong in that category.

  • Pingback: Data Science and Machine Learning Predictions | ML/DL

Leave a Reply to Charlie Berger (@CharlieDataMine) Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.