Feature Scaling: Leveling the Playing Field
In machine learning, features with larger magnitudes can disproportionately influence the model. Scaling algorithms ensures that every feature contributes equally to distance calculations and gradient descents.
Normalization (Min-Max)
Normalization rescales the values into a range of [0, 1]. This is incredibly useful when you know the boundaries of your data, like image pixels (0-255) or percentage scores. It preserves the shape of the original distribution.
Standardization (Z-Score)
Standardization rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1. Unlike Min-Max, Standardization doesn't bound data to a specific range, which makes it much more robust to outliers. It's the default go-to for algorithms like SVM, PCA, and Logistic Regression.
Critical Avoidance: Data Leakage+
Never fit your scaler on the entire dataset! You must split your data into training and test sets first. Then, `fit()` the scaler ONLY on the training data, and use it to `transform()` both the training and test sets. Fitting on the whole dataset leaks information about the test set into your model.
❓ Frequently Asked Questions
MinMaxScaler vs StandardScaler: Which one to choose?
Use MinMaxScaler as your default if you need values bounded (e.g., Deep Learning image inputs). Use StandardScaler if your data is normally distributed or if you are using algorithms that assume zero-centered data (like Principal Component Analysis).
What do I do if my data has massive outliers?
Both MinMax and Standard scalers are sensitive to extreme outliers. In these cases, use Scikit-Learn's RobustScaler. It removes the median and scales the data according to the Interquartile Range (IQR), completely ignoring the extreme values.