Not all neighbors are equal. GAT brings selective, learnable attention to graph neural networks, allowing models to dynamically focus on the neighbors that matter most for each specific task.
1Anisotropic Filtering: Learning to Focus
Unlike GCNs, which are Isotropic (treating all neighbors identically, weighted only by graph structure), GATs are Anisotropic. The importance of a neighbor is learned from data rather than fixed by topology. For every directed edge (j → i), the attention coefficient is computed as: e_ij = LeakyReLU(aᵀ · [Wh_i || Wh_j]), where W is a shared learnable weight matrix, a is a learnable attention vector, and || denotes concatenation. The raw score e_ij is then normalized with Softmax over all neighbors: α_ij = exp(e_ij) / Σ_k exp(e_ik). This produces a valid probability distribution over the neighborhood, and the node's new embedding is the α-weighted sum of its neighbors' transformed features.
Consider why this matters. In a citation network, a paper about deep learning has many neighboring papers. Some are tightly related (transformer architectures), and some are only loosely related (early symbolic AI). A GCN treats all these citations equally, diluting the signal. A GAT learns to assign high α to the relevant neighbors and near-zero α to irrelevant ones, dramatically improving classification precision on noisy, real-world graphs.
// GAT Attention Coefficient
function computeAttention(h_i, h_j, W, a) {
const Whi = matMul(W, h_i);
const Whj = matMul(W, h_j);
// Concatenate projected features
const concat = [...Whi, ...Whj];
// Raw attention score
const e_ij = leakyRelu(dot(a, concat));
return e_ij;
}
// Normalize over neighborhood
// α_ij = exp(e_ij) / sum_k(exp(e_ik))
const alpha = softmax(neighbors.map(
j => computeAttention(h_i, h[j], W, a)
));2Multi-Head Attention for Stability
Attention mechanisms can be unstable. A single attention head may collapse — assigning nearly all weight to one neighbor and effectively ignoring the rest. This is especially problematic in early training when the attention vector 'a' is randomly initialized. Multi-Head Attention solves this with ensemble diversity. K independent attention processes run in parallel. Each head learns a different projection matrix W^(k) and attention vector a^(k), so it specializes in a different aspect of the neighborhood: one head might capture topical similarity, another captures structural proximity, a third focuses on node degree.
For hidden layers, the K heads' outputs are concatenated: h_i' = ||_{k=1}^{K} σ(Σ_j α_ij^(k) W^(k) h_j). This produces a richer embedding of K×F' dimensions. For the final output layer, they are averaged to keep dimensionality manageable. In the original GAT paper, K=8 heads with F'=8 features each were used on Cora, achieving 83.0% accuracy — outperforming GCN's 81.5%.
// Multi-Head GAT (K=8 heads)
class MultiHeadGAT {
constructor(K, in_dim, out_dim) {
this.heads = Array.from({length: K},
() => new GATHead(in_dim, out_dim)
);
}
forward(h, edgeList) {
const outputs = this.heads.map(
head => head.forward(h, edgeList)
);
// Hidden layers: CONCATENATE
return concatenate(outputs);
// → K * out_dim features per node
}
}
// K=8, out_dim=8 → 64-dim embedding