DE Tutorial - Ethics, Data Privacy & Laws (GDPR)

Data Ethics & GDPR in AI

Pascual Vila

Data Engineering Lead // Code Syllabus

"Just because we have the data, doesn't mean we should use it." Ethical data engineering goes beyond legal compliance—it establishes trust between humans and machine learning models.

1. The Core of GDPR

The General Data Protection Regulation (GDPR) fundamentally changed how data pipelines are built. As a Data Engineer, you must ensure Data Minimization (extracting only what is strictly necessary) and Purpose Limitation (not reusing marketing data to train a credit-scoring model without explicit consent).

2. Handling PII in Pipelines

Personally Identifiable Information (PII) must be handled with care. Best practices include:

Anonymization: Permanently destroying identifying data. (e.g., Aggregating user ages into brackets: 18-24, 25-34).
Pseudonymization: Replacing names with synthetic IDs. The mapping key is kept in a highly secured, separate database.

3. The "Right to be Forgotten" in ML

When a user requests deletion, deleting their row in a Postgres database is easy. But what about the Machine Learning model trained on their data? Machine Unlearning is an emerging field, but the easiest current architectural pattern is to retrain models periodically from compliant datasets where the deleted user is already scrubbed.

❓ Frequently Asked Questions (AI Compliance)

How does GDPR affect Machine Learning models?

GDPR mandates the "Right to Explanation" for automated decisions. Black-box models (like deep neural networks) can be problematic if they make decisions affecting users (like loan approvals) without explainability. Also, models must not inadvertently memorize and leak PII.

What is the difference between Anonymization and Pseudonymization?

Anonymization is a one-way street; the data can NEVER be traced back to the user. GDPR does not apply to truly anonymized data. Pseudonymization hides the identity behind a key/token, but the user CAN be re-identified if the key is compromised. GDPR still applies to pseudonymized data.

How do data engineers handle DSARs?

Data Subject Access Requests (DSARs) require architectures where all user data is traceable. Engineers often build centralized User Identity graphs so when a deletion request arrives, automated scripts can purge records across the Data Lake, Kafka logs, and Warehouses.

Compliance Glossary

PII

Personally Identifiable Information. Any data that could potentially identify a specific individual.

concept.log

# Example: Names, SSN, IP, Email

GDPR

General Data Protection Regulation. A strict EU law on data protection and privacy.

concept.log

Compliance = Consent + Security + Rights

Data Masking

The process of hiding original data with modified content (characters or other data).

concept.log

SELECT REGEXP_REPLACE(phone, '[0-9]', 'X') FROM users;

DSAR

Data Subject Access Request. A request made by an employee or customer to view or delete their data.

concept.log

DELETE FROM users WHERE id = 'requested_id';

Ethics & Data Privacy

Ethics & Law Matrix

Concept: Identifying PII

Compliance Check

Compliance Audits

Data Governance Hub

Discuss Privacy Strategies

Data Ethics & GDPR in AI

1. The Core of GDPR

2. Handling PII in Pipelines

3. The "Right to be Forgotten" in ML

❓ Frequently Asked Questions (AI Compliance)

Compliance Glossary