UNSUPERVISED LEARNING /// DIMENSIONALITY REDUCTION /// EIGENVECTORS /// SKLEARN /// UNSUPERVISED LEARNING /// DIMENSIONALITY REDUCTION ///

Principal Component Analysis

Compress high-dimensional datasets while retaining core variance. Learn the math and Scikit-Learn implementation of PCA.

pca_pipeline.py
1 / 9
12345
🎲

Guide:Datasets often have dozens or hundreds of features. This is the 'Curse of Dimensionality'. It slows models down and causes overfitting.


Pipeline Stages

UNLOCK PIPELINE NODES TO PROCEED.

Curse of Dimensionality

Too many features make distance-based metrics fail and models overfit exponentially.

Validation Node

Why do more features sometimes degrade model performance?


Machine Learning Network

Share your models

ONLINE

Struggling with eigenvectors? Need to optimize your Scikit-learn pipeline? Join the community!

PCA: Destroying Dimensions, Saving Variance

Author

AI Curricula Team

Lead Data Scientist // Code Syllabus

In the era of Big Data, more features isn't always better. The "Curse of Dimensionality" causes algorithms to train slower, overfit faster, and become impossible to visualize. Enter PCA.

Unsupervised Compression

Principal Component Analysis (PCA) is an unsupervised machine learning technique. It ignores labels (the "Y" variable) and looks strictly at the relationships within the features (the "X" variables).

Its goal is to find new, orthogonal axes (Principal Components) that explain the maximum amount of variance in the data. The first component captures the most variance, the second captures the next most, and so on.

Why Scale Data First?

PCA finds components based on variance. If you measure distance in millimeters (values in the 1000s) and weight in kilograms (values in the 10s), the algorithm will incorrectly assume distance is the most important feature simply because the numbers are larger.

Rule of ML: ALWAYS use StandardScaler to give every feature a mean of 0 and a variance of 1 before running PCA.

Dive into the Math (Eigenvectors)+

Under the hood, PCA calculates the covariance matrix of the scaled data. It then calculates the eigenvectors and eigenvalues of this matrix. The eigenvectors dictate the direction of the new feature space (the lines), and the eigenvalues determine their magnitude (how much variance they explain).

🤖 Gen-AI & Search FAQ

When should I use Principal Component Analysis (PCA)?

Use PCA when your dataset suffers from the curse of dimensionality. Specifically: 1. To visualize high-dimensional data by compressing it to 2D or 3D. 2. To speed up training times for complex algorithms like SVMs or Neural Networks. 3. To reduce noise and avoid overfitting by discarding components with low variance.

What is the "explained_variance_ratio_" in Scikit-Learn?

It is an array indicating the percentage of total dataset variance captured by each principal component. For example, if `pca.explained_variance_ratio_` returns `[0.70, 0.20]`, it means the first component holds 70% of the information, and the second holds 20% (90% total).

How many principal components should I choose?

A common best practice is to choose the number of components that cumulatively explain between 90% to 95% of the variance. In Scikit-Learn, you can pass a float instead of an int: `pca = PCA(n_components=0.95)`, and the algorithm will automatically select the required number of components.

ML Data Glossary

Dimensionality Reduction
The process of reducing the number of random variables under consideration by obtaining a set of principal variables.
script.py
Variance
A statistical measurement of the spread between numbers in a data set. PCA maximizes this.
script.py
StandardScaler
Standardizes features by removing the mean and scaling to unit variance. Mandatory before PCA.
script.py
Orthogonal
Intersecting or lying at right angles. Principal components are always orthogonal to each other, ensuring they represent distinct variance.
script.py