DATA SCIENCE /// PREPROCESSING /// SCALING /// NORMALIZATION /// DATA SCIENCE /// PREPROCESSING /// SCALING ///

Feature Scaling

Level the playing field for your algorithms. Master MinMaxScaler, StandardScaler, and RobustScaler to optimize ML model accuracy.

pipeline.py
1 / 7
12345
📊

Tutor:Machine Learning models often struggle when features have different scales. Imagine comparing salary ($100,000) to age (35). The salary will dominate the distance calculation!

Preprocessing Tree

UNLOCK NODES BY MASTERING TRANSFORMS.

Min-Max Normalization

Transforms features by scaling each feature to a given range, default [0, 1].

Logic Verification

When is MinMaxScaler NOT ideal?


Community Notebooks

Share Your EDA

ACTIVE

Wrangled a messy dataset? Share your Kaggle notebooks and get peer reviews on your preprocessing pipelines!

Feature Scaling: Leveling the Playing Field

In machine learning, features with larger magnitudes can disproportionately influence the model. Scaling algorithms ensures that every feature contributes equally to distance calculations and gradient descents.

Normalization (Min-Max)

Normalization rescales the values into a range of [0, 1]. This is incredibly useful when you know the boundaries of your data, like image pixels (0-255) or percentage scores. It preserves the shape of the original distribution.

Standardization (Z-Score)

Standardization rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1. Unlike Min-Max, Standardization doesn't bound data to a specific range, which makes it much more robust to outliers. It's the default go-to for algorithms like SVM, PCA, and Logistic Regression.

Critical Avoidance: Data Leakage+

Never fit your scaler on the entire dataset! You must split your data into training and test sets first. Then, `fit()` the scaler ONLY on the training data, and use it to `transform()` both the training and test sets. Fitting on the whole dataset leaks information about the test set into your model.

Frequently Asked Questions

MinMaxScaler vs StandardScaler: Which one to choose?

Use MinMaxScaler as your default if you need values bounded (e.g., Deep Learning image inputs). Use StandardScaler if your data is normally distributed or if you are using algorithms that assume zero-centered data (like Principal Component Analysis).

What do I do if my data has massive outliers?

Both MinMax and Standard scalers are sensitive to extreme outliers. In these cases, use Scikit-Learn's RobustScaler. It removes the median and scales the data according to the Interquartile Range (IQR), completely ignoring the extreme values.

Preprocessing Glossary

fit()
Computes the parameters (like mean and standard deviation) needed for scaling the data.
code.py
transform()
Applies the calculated scaling parameters to the dataset.
code.py
fit_transform()
A convenience method that calculates parameters and applies them in a single step. ONLY use on training data.
code.py
RobustScaler
Scales features using statistics that are robust to outliers (median and interquartile range).
code.py