When should I use Sum pooling instead of Mean pooling for graph classification?

You should use Sum pooling when the absolute size or volume of the graph is highly correlated with the target you are predicting. For example, if you are predicting the molecular mass of a chemical compound, or the total traffic volume of a road network, Sum pooling naturally scales with the number of nodes. Mean pooling divides by the number of nodes, making a 5-node graph indistinguishable from a structurally identical 500-node graph in the aggregated feature space.

Why is DiffPool so memory intensive?

DiffPool learns a soft assignment matrix S of size N × C, where N is the number of nodes in the current layer and C is the number of clusters in the next layer. This matrix is dense, meaning every node has a small probability of belonging to every cluster. Multiplying the sparse adjacency matrix A by this dense assignment matrix (S^T A S) results in a dense coarsened adjacency matrix for the next layer. For large graphs, storing and computing on this dense matrix causes severe Out-Of-Memory (OOM) errors.

How does SAGPool solve the memory issues of DiffPool?

Instead of clustering nodes together using a dense assignment matrix, SAGPool simply drops nodes. It computes an attention score for each node, ranks them, and completely removes the bottom X% of nodes and all their connected edges. The remaining graph is just a subset of the original sparse graph, so it remains sparse and highly memory efficient. The trade-off is that information from the dropped nodes is permanently lost, whereas DiffPool retains all information in its clusters.

🚀 LEVEL UP TO SENIOR:Unlock 500+ Advanced Practical Challenges & Exercises.

🎓 COURSERA PARTNER:Earn professional Google, Meta, and IBM certificates to supercharge your resume.

Tutorials

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

Graph Pooling and Hierarchical Reps in AI & Artificial Intelligence

Master the architectures used for graph-level summarization. Explore Global Readout operations (Sum, Mean, Max) and their theoretical expressivity limitations. Dive into advanced Hierarchical Pooling techniques like DiffPool and SAGPool, learning how to preserve structural motifs and community groupings while compressing network dimensionality.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Pooling Hub

Summary logic.

Quick Quiz //

What is the primary purpose of a Readout Layer in a GNN?

How do you represent a complex, million-node graph with a single vector? Pooling is the critical bridge from local node-level features to global, graph-level intelligence.

1Global Readout: The Flat Summary

For graph-level tasks — like predicting whether a molecule is toxic, or classifying a sub-reddit's community type — you eventually need to compress all node embeddings into a single fixed-size vector. This operation is called the Readout Layer or Global Pooling.

The simplest approaches are order-invariant aggregations over the entire graph: Global Mean, Global Sum, and Global Max. Each has different properties. Mean calculates the average feature distribution but is blind to graph size (a graph with 10 carbon atoms looks identical to a graph with 100 carbon atoms). Sum preserves graph volume, which is critical for chemistry tasks where atom count dictates physical properties. Max extracts the strongest isolated signals (useful for detecting a single toxic substructure). However, all global readouts are 'flat' — they instantly destroy all community and motif structures. You compress a complex topology into a single dense vector in one brutal step.

—

// Simple Global Readout
// h_nodes shape: [num_nodes, features]

function graphReadout(h_nodes, method) {
  if (method === 'sum') {
    return sumCols(h_nodes); // Volume aware
  }
  if (method === 'mean') {
    return meanCols(h_nodes); // Distribution
  }
  if (method === 'max') {
    return maxCols(h_nodes); // Strongest signal
  }
}
// Result: single [features] vector
// Topology is completely forgotten.

localhost:3000

localhost:3000/global-readout

Molecule Property Task

Mean Pooling: FAILS on mass prediction

Sum Pooling: PASSES on mass prediction

Always match pooling to the physics.

2Hierarchical Pooling (DiffPool & SAGPool)

Hierarchical Pooling attempts to mirror the natural organization of networks, much like a CNN pools pixels into edges, edges into textures, and textures into objects. A graph is pooled down progressively across multiple layers.

DiffPool (Differentiable Pooling) trains a secondary GNN to learn an assignment matrix *S*. This matrix maps the N nodes of the current graph into C clusters (where C < N). The features of the nodes in a cluster are aggregated to form a new super-node, and the edges between clusters form a new coarsened adjacency matrix. This preserves structural motifs (like functional groups in a molecule) at intermediate layers.

SAGPool (Self-Attention Graph Pooling) takes a different approach: node dropping. It uses a GNN to calculate an attention score for every node. It simply sorts the nodes by score, drops the bottom 50% (the 'noise'), and keeps the top 50% (the 'signal'), preserving the edges between the retained nodes. This is much more memory efficient than DiffPool, which requires dense N×C assignment matrices.

—

// SAGPool: Top-K Node Dropping
function sagPool(X, A, ratio = 0.5) {
  // 1. Score nodes using a GCN
  const scores = sigmoid(A_norm @ X @ W_att);
  
  // 2. Find cut-off threshold
  const k = Math.floor(X.length * ratio);
  const topKIndices = argsort(scores).slice(0, k);
  
  // 3. Drop uninformative nodes
  const X_pooled = X[topKIndices] * scores[topKIndices];
  const A_pooled = A[topKIndices][topKIndices];
  
  return { X: X_pooled, A: A_pooled };
}

localhost:3000

localhost:3000/sag-pool

SAGPool Reduction (Ratio = 0.5)

Input: 10,000 nodes, 45,000 edges

Output: 5,000 nodes, 11,200 edges

Computations halved for next layer ✓