Data Ethics & GDPR in AI

Pascual Vila
Data Engineering Lead // Code Syllabus
"Just because we have the data, doesn't mean we should use it." Ethical data engineering goes beyond legal compliance—it establishes trust between humans and machine learning models.
1. The Core of GDPR
The General Data Protection Regulation (GDPR) fundamentally changed how data pipelines are built. As a Data Engineer, you must ensure Data Minimization (extracting only what is strictly necessary) and Purpose Limitation (not reusing marketing data to train a credit-scoring model without explicit consent).
2. Handling PII in Pipelines
Personally Identifiable Information (PII) must be handled with care. Best practices include:
- Anonymization: Permanently destroying identifying data. (e.g., Aggregating user ages into brackets: 18-24, 25-34).
- Pseudonymization: Replacing names with synthetic IDs. The mapping key is kept in a highly secured, separate database.
3. The "Right to be Forgotten" in ML
When a user requests deletion, deleting their row in a Postgres database is easy. But what about the Machine Learning model trained on their data? Machine Unlearning is an emerging field, but the easiest current architectural pattern is to retrain models periodically from compliant datasets where the deleted user is already scrubbed.
❓ Frequently Asked Questions (AI Compliance)
How does GDPR affect Machine Learning models?
GDPR mandates the "Right to Explanation" for automated decisions. Black-box models (like deep neural networks) can be problematic if they make decisions affecting users (like loan approvals) without explainability. Also, models must not inadvertently memorize and leak PII.
What is the difference between Anonymization and Pseudonymization?
Anonymization is a one-way street; the data can NEVER be traced back to the user. GDPR does not apply to truly anonymized data. Pseudonymization hides the identity behind a key/token, but the user CAN be re-identified if the key is compromised. GDPR still applies to pseudonymized data.
How do data engineers handle DSARs?
Data Subject Access Requests (DSARs) require architectures where all user data is traceable. Engineers often build centralized User Identity graphs so when a deletion request arrives, automated scripts can purge records across the Data Lake, Kafka logs, and Warehouses.