Why can't I just use two sets: Train and Test?

If you only use Train and Test, you will inevitably use the Test set to tweak and tune your model's parameters to get a better score. This means your model is indirectly 'learning' from the Test set, ruining its integrity. A three-way split gives you a Validation set for tuning, leaving the Test set completely unseen.

What is 'Overfitting' and how does splitting prevent it?

Overfitting is when a model memorizes the specific noise and exact answers of the training data instead of learning general logic. By evaluating the model on a separate Validation/Test set that it has never seen, you can immediately detect overfitting because the training score will be very high, but the validation score will be terribly low.

When is it absolutely mandatory to use Stratification?

You must use stratification when your dataset is highly imbalanced. For example, if you are predicting credit card fraud and only 0.1% of transactions are fraudulent, a random split might put zero fraudulent cases in your training set. Stratification guarantees the 0.1% ratio is maintained in all splits.

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

Data Splitting in AI & Artificial Intelligence

Learn the essential methodology of Train/Validation/Test splitting. Understand reproducibility, stratification, and the golden rule of unbiased evaluation.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Splitting Hub

The gatekeeper of model integrity.

Quick Quiz //

What is the primary function of the 'Validation Set' in the machine learning workflow?

A model that performs perfectly on its training data is often useless. Real intelligence is the ability to handle data it has never seen before.

1The Three Vaults

How do you actually know if a machine learning model is intelligent? If you test a student using the exact same questions they studied, you're testing memory, not intelligence.

To solve this, we divide our dataset into three separate vaults: Training, Validation, and Testing. The model is forbidden from seeing the Test set during its learning phase. This mimics the real-world scenario where a model encounters completely new data.

editor.html

from sklearn.model_selection import train_test_split

# The Three Vaults:
# 1. Training (70-80%): The Textbook
# 2. Validation (10-15%): The Practice Exams
# 3. Testing (10-15%): The Final Exam

localhost:3000

2The Training Set

The Training set is the largest portion of your data, typically 70 to 80 percent. This is the 'textbook' the model uses to learn patterns, adjust its internal mathematical weights, and understand the relationship between inputs and outputs.

If you evaluate your model's ultimate performance using this same data, you will fail to detect 'Overfitting'—when the model simply memorizes the answers instead of learning true logic.

editor.html

# Splitting data to get the Training Set
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)
# X_train is now ready for model.fit()

localhost:3000

3The Validation & Test Sets

Next is the Validation set. Imagine you are building a model and deciding whether it should have 5 layers or 10 layers. You train both models, then use the Validation set to test them. Whichever performs better on this 'practice quiz' becomes your final architecture.

Finally, the Test set. This is the vault. It is locked away during the entire development process. You only run predictions on the Test set ONCE at the very end of your project. This gives you the unbiased truth of how your model will perform in production.

editor.html

# Validation: Tune Hyperparameters
accuracy_val = test_architecture(model_A, X_val, y_val)

# Testing: The Unbiased Truth (Run Once!)
accuracy_test = model.predict(X_test)

localhost:3000

4Reproducibility (Random State)

Reproducibility is a pillar of science. When we randomly split data, we use a 'Random State' (often the number 42). This ensures the random number generator produces the exact same split every time the code runs, allowing other engineers to verify your exact results.

Without a fixed seed, your model's accuracy would fluctuate wildly every time you ran the script just based on 'lucky' splits.

editor.html

from sklearn.model_selection import train_test_split

# Using random_state ensures the split is identical next time
X_train, X_temp = train_test_split(X, test_size=0.3, random_state=42)

localhost:3000

5Stratification for Imbalanced Data

But what if your dataset has 99 normal emails and 1 spam email? If you split randomly, your training set might get 0 spam emails, making it impossible to learn. 'Stratification' forces the split to maintain the exact original ratio of classes in both Train and Test sets.

Failing to stratify imbalanced datasets is a critical flaw. You might build a model that scores 99% accuracy simply because it blindly predicts 'Not Spam' every time, having never seen a spam email during training.

editor.html

# Stratified Splitting
# stratify=y guarantees the same % of spam in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

localhost:3000