🚀 LEVEL UP TO SENIOR:Unlock 500+ Advanced Practical Challenges & Exercises.
🎓 COURSERA PARTNER:Earn professional Google, Meta, and IBM certificates to supercharge your resume.
HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///
Total XP: 0|💻 artificialintelligence XP: 0

Data Splitting in AI & Artificial Intelligence

Learn the essential methodology of Train/Validation/Test splitting. Understand reproducibility, stratification, and the golden rule of unbiased evaluation.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Splitting Hub

The gatekeeper of model integrity.

Quick Quiz //

What is the primary function of the 'Validation Set' in the machine learning workflow?


A model that performs perfectly on its training data is often useless. Real intelligence is the ability to handle data it has never seen before.

1The Three Vaults

How do you actually know if a machine learning model is intelligent? If you test a student using the exact same questions they studied, you're testing memory, not intelligence.

To solve this, we divide our dataset into three separate vaults: Training, Validation, and Testing. The model is forbidden from seeing the Test set during its learning phase. This mimics the real-world scenario where a model encounters completely new data.

editor.html
from sklearn.model_selection import train_test_split

# The Three Vaults:
# 1. Training (70-80%): The Textbook
# 2. Validation (10-15%): The Practice Exams
# 3. Testing (10-15%): The Final Exam
localhost:3000

2The Training Set

The Training set is the largest portion of your data, typically 70 to 80 percent. This is the 'textbook' the model uses to learn patterns, adjust its internal mathematical weights, and understand the relationship between inputs and outputs.

If you evaluate your model's ultimate performance using this same data, you will fail to detect 'Overfitting'—when the model simply memorizes the answers instead of learning true logic.

editor.html
# Splitting data to get the Training Set
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)
# X_train is now ready for model.fit()
localhost:3000

3The Validation & Test Sets

Next is the Validation set. Imagine you are building a model and deciding whether it should have 5 layers or 10 layers. You train both models, then use the Validation set to test them. Whichever performs better on this 'practice quiz' becomes your final architecture.

Finally, the Test set. This is the vault. It is locked away during the entire development process. You only run predictions on the Test set ONCE at the very end of your project. This gives you the unbiased truth of how your model will perform in production.

editor.html
# Validation: Tune Hyperparameters
accuracy_val = test_architecture(model_A, X_val, y_val)

# Testing: The Unbiased Truth (Run Once!)
accuracy_test = model.predict(X_test)
localhost:3000

4Reproducibility (Random State)

Reproducibility is a pillar of science. When we randomly split data, we use a 'Random State' (often the number 42). This ensures the random number generator produces the exact same split every time the code runs, allowing other engineers to verify your exact results.

Without a fixed seed, your model's accuracy would fluctuate wildly every time you ran the script just based on 'lucky' splits.

editor.html
from sklearn.model_selection import train_test_split

# Using random_state ensures the split is identical next time
X_train, X_temp = train_test_split(X, test_size=0.3, random_state=42)
localhost:3000

5Stratification for Imbalanced Data

But what if your dataset has 99 normal emails and 1 spam email? If you split randomly, your training set might get 0 spam emails, making it impossible to learn. 'Stratification' forces the split to maintain the exact original ratio of classes in both Train and Test sets.

Failing to stratify imbalanced datasets is a critical flaw. You might build a model that scores 99% accuracy simply because it blindly predicts 'Not Spam' every time, having never seen a spam email during training.

editor.html
# Stratified Splitting
# stratify=y guarantees the same % of spam in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
localhost:3000

?Frequently Asked Questions

Pascual Vila

Pascual Vila

Frontend Instructor // Code Syllabus

Lesson Glossary

[01]Training Set

The portion of data (usually 70-80%) used to train the machine learning model.

Code Preview
The Textbook

[02]Validation Set

The data used to provide an unbiased evaluation of a model fit while tuning hyperparameters.

Code Preview
The Practice Quiz

[03]Test Set

Data used ONLY to provide an unbiased evaluation of the final model's performance.

Code Preview
The Final Exam

[04]Overfitting

When a model learns the noise and details of the training data too well, failing to generalize to new data.

Code Preview
Memorization

[05]Random State

A parameter used to ensure that the random split of data is identical every time the code is run.

Code Preview
random_state=42

[06]Stratification

Ensuring that the resulting splits have the same percentage of samples for each target class as the original dataset.

Code Preview
stratify=y

Continue Learning