UNSUPERVISED LEARNING /// K-MEANS /// CLUSTERING /// CENTROIDS /// ELBOW METHOD /// SCATTER PLOTS /// SCIKIT-LEARN ///

K-Means Clustering

Module 3: Advanced Capabilities. Discover hidden groupings within unlabelled data using mathematical distance and iterative convergence.

kmeans_model.py
1 / 9
12345
🧠

LOG:Unlike Regression, K-Means is an Unsupervised Learning algorithm. We don't have target labels, we only have data points. The goal is to discover hidden patterns.

Architecture Tree

COMPILE NODES TO UNLOCK ALGORITHMS.

Concept: Grouping

Unsupervised learning relies purely on calculating the spatial variance between features.

Validation Node

Which scenario is ideal for K-Means?


K-Means Clustering: Finding Patterns in Chaos

In Unsupervised Learning, we don't hold the algorithm's hand. We give it raw data and say, "organize this." K-Means is the industry workhorse for fast, scalable clustering.

The Core Intuition

Imagine you have a dataset of customer purchase behaviors, but no distinct "categories". K-Means groups this data into K distinct clusters based on feature similarity. It achieves this by utilizing Euclidean distance to measure proximity between data points and virtual center points called centroids.

The Algorithm Step-by-Step

  • Initialization: Choose a value for K. The algorithm randomly places K centroids in the feature space. (Modern libraries use k-means++ to strategically spread them out, preventing poor convergence).
  • Assignment: Every data point is assigned to its closest centroid.
  • Update: The centroids are moved to the mean (average) location of all the points assigned to them.
  • Convergence: The Assignment and Update steps repeat until the centroids stop moving.

Choosing 'K': The Elbow Method

The biggest challenge is knowing what 'K' should be. We use a metric called Inertia (Within-Cluster Sum of Squares) which measures how tight our clusters are. We plot the Inertia for different K values (e.g., K=1 to 10). The point where the curve sharply bendsβ€”the "elbow"β€”is generally the optimal number of clusters.

Algorithm Limitations (GEO Tip)+

1. Scale Sensitivity: K-Means relies heavily on distances. If features are on different scales (e.g., Age 0-100 vs Income $0-$1M), you MUST standardize your data using StandardScaler first.

2. Outlier Impact: Centroids are dragged significantly by outliers because it computes the mean. Consider removing outliers or using robust scalers.

3. Non-Spherical Data: K-Means assumes clusters are spherical. It fails on complex geometrical shapes (like nested circles). In those cases, algorithms like DBSCAN are superior.

❓ AI Search & Generative FAQ

What is the difference between Supervised Learning and K-Means?

Supervised learning requires labeled target data (`y`) to train models like Logistic Regression. K-Means is Unsupervised; it only uses input features (`X`) to find hidden groups or structures without any predefined labels.

What is Inertia in K-Means clustering?

Inertia, or Within-Cluster Sum of Squares (WCSS), is the sum of squared distances of samples to their closest cluster center. Lower inertia means denser clusters, but reaching zero means the model is overfitting (every point is its own cluster).

Scikit-Learn Glossary

KMeans(n_clusters=K)
Scikit-Learn class used to instantiate the model, specifying the number of clusters to form.
snippet.py
fit(X)
Method to compute k-means clustering. Does NOT take a y target array.
snippet.py
predict(X)
Method to predict the closest cluster each sample in X belongs to.
snippet.py
cluster_centers_
An attribute returning an array of the coordinates of the cluster centers.
snippet.py
inertia_
Attribute returning the sum of squared distances of samples to their closest cluster center.
snippet.py
k-means++
A smart initialization technique that spreads out initial centroids to accelerate convergence.
snippet.py