A model that performs perfectly on its training data is often useless. Real intelligence is the ability to handle data it has never seen before.
1The Three Vaults
How do you actually know if a machine learning model is intelligent? If you test a student using the exact same questions they studied, you're testing memory, not intelligence.
To solve this, we divide our dataset into three separate vaults: Training, Validation, and Testing. The model is forbidden from seeing the Test set during its learning phase. This mimics the real-world scenario where a model encounters completely new data.
from sklearn.model_selection import train_test_split
# The Three Vaults:
# 1. Training (70-80%): The Textbook
# 2. Validation (10-15%): The Practice Exams
# 3. Testing (10-15%): The Final Exam2The Training Set
The Training set is the largest portion of your data, typically 70 to 80 percent. This is the 'textbook' the model uses to learn patterns, adjust its internal mathematical weights, and understand the relationship between inputs and outputs.
If you evaluate your model's ultimate performance using this same data, you will fail to detect 'Overfitting'—when the model simply memorizes the answers instead of learning true logic.
# Splitting data to get the Training Set
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
# X_train is now ready for model.fit()3The Validation & Test Sets
Next is the Validation set. Imagine you are building a model and deciding whether it should have 5 layers or 10 layers. You train both models, then use the Validation set to test them. Whichever performs better on this 'practice quiz' becomes your final architecture.
Finally, the Test set. This is the vault. It is locked away during the entire development process. You only run predictions on the Test set ONCE at the very end of your project. This gives you the unbiased truth of how your model will perform in production.
# Validation: Tune Hyperparameters
accuracy_val = test_architecture(model_A, X_val, y_val)
# Testing: The Unbiased Truth (Run Once!)
accuracy_test = model.predict(X_test)4Reproducibility (Random State)
Reproducibility is a pillar of science. When we randomly split data, we use a 'Random State' (often the number 42). This ensures the random number generator produces the exact same split every time the code runs, allowing other engineers to verify your exact results.
Without a fixed seed, your model's accuracy would fluctuate wildly every time you ran the script just based on 'lucky' splits.
from sklearn.model_selection import train_test_split
# Using random_state ensures the split is identical next time
X_train, X_temp = train_test_split(X, test_size=0.3, random_state=42)5Stratification for Imbalanced Data
But what if your dataset has 99 normal emails and 1 spam email? If you split randomly, your training set might get 0 spam emails, making it impossible to learn. 'Stratification' forces the split to maintain the exact original ratio of classes in both Train and Test sets.
Failing to stratify imbalanced datasets is a critical flaw. You might build a model that scores 99% accuracy simply because it blindly predicts 'Not Spam' every time, having never seen a spam email during training.
# Stratified Splitting
# stratify=y guarantees the same % of spam in both sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)