PCA: Destroying Dimensions, Saving Variance

AI Curricula Team
Lead Data Scientist // Code Syllabus
In the era of Big Data, more features isn't always better. The "Curse of Dimensionality" causes algorithms to train slower, overfit faster, and become impossible to visualize. Enter PCA.
Unsupervised Compression
Principal Component Analysis (PCA) is an unsupervised machine learning technique. It ignores labels (the "Y" variable) and looks strictly at the relationships within the features (the "X" variables).
Its goal is to find new, orthogonal axes (Principal Components) that explain the maximum amount of variance in the data. The first component captures the most variance, the second captures the next most, and so on.
Why Scale Data First?
PCA finds components based on variance. If you measure distance in millimeters (values in the 1000s) and weight in kilograms (values in the 10s), the algorithm will incorrectly assume distance is the most important feature simply because the numbers are larger.
Rule of ML: ALWAYS use StandardScaler to give every feature a mean of 0 and a variance of 1 before running PCA.
Dive into the Math (Eigenvectors)+
Under the hood, PCA calculates the covariance matrix of the scaled data. It then calculates the eigenvectors and eigenvalues of this matrix. The eigenvectors dictate the direction of the new feature space (the lines), and the eigenvalues determine their magnitude (how much variance they explain).
🤖 Gen-AI & Search FAQ
When should I use Principal Component Analysis (PCA)?
Use PCA when your dataset suffers from the curse of dimensionality. Specifically: 1. To visualize high-dimensional data by compressing it to 2D or 3D. 2. To speed up training times for complex algorithms like SVMs or Neural Networks. 3. To reduce noise and avoid overfitting by discarding components with low variance.
What is the "explained_variance_ratio_" in Scikit-Learn?
It is an array indicating the percentage of total dataset variance captured by each principal component. For example, if `pca.explained_variance_ratio_` returns `[0.70, 0.20]`, it means the first component holds 70% of the information, and the second holds 20% (90% total).
How many principal components should I choose?
A common best practice is to choose the number of components that cumulatively explain between 90% to 95% of the variance. In Scikit-Learn, you can pass a float instead of an int: `pca = PCA(n_components=0.95)`, and the algorithm will automatically select the required number of components.