Data points don't exist in isolation. Hierarchical clustering allows us to see the nested relationships between groups, from individual points to the entire population.
1Building Trees
While K-Means forces you to choose a single number of clusters upfront, Hierarchical Clustering builds a tree of relationships, showing how every point connects to every other point.
Instead of just returning a flat list of groups, it creates a nested hierarchy. This is incredibly useful in biology (like building evolutionary trees) or customer segmentation, where you might want to see both broad groups (e.g., 'Spenders') and specific sub-groups (e.g., 'Weekend Spenders').
# K-Means: Flat groups
# Hierarchical: Nested relationships
print("Building the hierarchy...")2Agglomerative Merging
The most common method of hierarchical clustering is 'Agglomerative'. It operates 'bottom-up'.
It starts with every single data point acting as its own individual cluster. Then, it iteratively finds the two closest clusters and merges them into one. It repeats this process, building larger and larger clusters, until everything is merged into a single massive group.
from sklearn.cluster import AgglomerativeClustering
# Bottom-up clustering
model = AgglomerativeClustering(n_clusters=3)
model.fit(X)3The Dendrogram
To visualize these hierarchical connections, we use a 'Dendrogram'. It's a tree diagram that records the entire sequence of merges.
The horizontal axis represents the data points, and the vertical axis represents the distance between them. When two branches merge, the height of the vertical line tells you exactly how far apart those two clusters were. A very tall vertical line means you are merging two very distinct, dissimilar groups.
import scipy.cluster.hierarchy as sch
# Generate the tree
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))4Cutting the Tree
The true power of a dendrogram is that you can 'cut' the tree at different heights to get different numbers of clusters, without recalculating anything.
If you want highly specific groups, you make a low horizontal cut (resulting in many clusters). If you want broad categories, you make a high cut (resulting in few clusters). It gives you the flexibility to explore the data and choose the right scale for your specific business problem.
# No 'n_clusters' required for linkage calculation!
Z = sch.linkage(X, 'ward')
# Cut the tree later to decide K5Linkage and Computational Cost
When merging groups, how do you measure the distance between them? This is called 'Linkage'. 'Ward' linkage minimizes the variance within each cluster, leading to tight, compact groups.
However, hierarchical clustering has a massive downside: computational cost. Because it must calculate the distance between every single pair of points iteratively, it is much slower than K-Means and generally cannot be used on datasets with millions of rows.
# Ward linkage for compact clusters
model = AgglomerativeClustering(linkage='ward')
# Warning: O(n^3) complexity in worst case