Why can't Graph Neural Networks be 100 layers deep like ResNets? When you try to stack GNN layers, you hit the hard mathematical limits of graph topology.
1Over-smoothing: The Feature Collapse
Message passing is fundamentally a low-pass filtering operation — it smooths out variations. Each time you add a layer, a node's features become a weighted average of a larger and larger neighborhood. In a shallow model (2-3 layers), this builds vital local context. However, if you push a standard GCN to 32 layers, information diffuses so far that every node ends up 'Seeing' the entire graph.
Mathematically, the node features converge to a stationary distribution (minimizing the Dirichlet Energy). The graph becomes a blurry 'soup' where a fraudulent transaction node looks exactly the same as a normal transaction node because their 32-hop neighborhoods overlap completely. To fix this, we use Initial Residuals (like in the GCNII architecture). By explicitly feeding the original, un-smoothed node features (H0) back into every single layer, we force the model to remember its primary identity, allowing us to safely scale to 64+ layers.
// GCNII: Preventing Over-smoothing
// H_0: Original Input Features
// alpha: Identity preservation weight
function GCNII_Layer(H_prev, H_0, A_norm, alpha) {
// 1. Standard neighbor aggregation
const smoothed = A_norm @ H_prev;
// 2. Initial Residual Connection
// Mix smoothed features with original identity
const restored = (1 - alpha) * smoothed
+ (alpha) * H_0;
// 3. Transformation
return relu(restored @ W);
}2Over-squashing: The Topological Choke Point
Over-squashing is a related but distinct structural problem. It occurs when a graph's volume grows exponentially with its radius (high curvature, like a tree). If you have a 5-layer GNN, a node's receptive field includes nodes 5 hops away. In a dense network, there might be 10,000 nodes in that 5-hop radius. The GNN is forced to compress ('squash') the information from all 10,000 nodes through the graph topology into a single 64-dimensional vector at the target node.
The 'bottleneck' causes critical long-range dependencies to be completely lost. Strategies to fix this include Graph Rewiring (adding synthetic edges to bridge distant parts of the graph, reducing the topological distance) and DropEdge (randomly deleting edges during training to prevent the model from overfitting to the dense local structure and acting as a powerful regularizer against both squashing and smoothing).
// DropEdge: Structural Regularization
// Run dynamically every training epoch
function applyDropEdge(adjMatrix, p_drop) {
const droppedAdj = createEmptyMatrix();
for (const edge of adjMatrix.edges) {
// Only keep edge with probability (1 - p)
if (Math.random() > p_drop) {
droppedAdj.addEdge(edge);
}
}
// Rescale weights to maintain expected value
return rescale(droppedAdj, 1 / (1 - p_drop));
}