Graph Attention Networks: Computing Importance
Not all edges are created equal. By leveraging masked self-attention over graphs, GATs allow nodes to attend to their neighborhoods dynamically, overcoming the structural rigidities of standard Graph Convolutional Networks.
The Core Concept
In standard Graph Convolutional Networks (GCNs), a node updates its representation by aggregating features from its neighbors. However, the aggregation weights are strictly determined by the graph's structure (like node degrees).
Graph Attention Networks (GATs) introduce a learnable attention mechanism. They compute an attention coefficient ($\alpha_&123;ij&125;$) that represents the importance of node $j$'s features to node $i$. This allows the model to selectively focus on the most relevant neighbors regardless of structural density.
Mathematical Blueprint
The attention score between node $i$ and $j$ is computed by first applying a shared linear transformation parameterized by a weight matrix $\mathbf&123;W&125;$. Then, the transformed features are concatenated and multiplied by a learnable weight vector $\mathbf&123;a&125;$, followed by a LeakyReLU activation:
To make coefficients easily comparable across different nodes, they are normalized using a softmax function over all neighbors $j \in \mathcal&123;N&125;_i$.
Multi-Head Attention
To stabilize the learning process of self-attention, GAT utilizes Multi-Head Attention. Instead of learning a single attention function, the network learns $K$ independent attention mechanisms.
- Hidden Layers: The outputs of the $K$ heads are concatenated. If each head outputs a vector of size $F'$, the new feature size is $K \times F'$.
- Final Layer: For classification or regression (the final layer), concatenation is replaced by averaging to yield the final output vector.
❓ AI Search & FAQ
What is the main difference between GCN and GAT?
While Graph Convolutional Networks (GCNs) use fixed, structurally derived weights (like degree normalizations) to aggregate neighbor features, Graph Attention Networks (GATs) use a trainable self-attention mechanism. This allows GATs to assign dynamic importance to different neighbors based on their node features, making them more expressive for complex datasets.
Why is LeakyReLU used in the GAT attention mechanism?
LeakyReLU is used before the softmax normalization to allow small, non-zero gradients for negative inputs. If a standard ReLU was used, negative attention scores would output exactly zero, causing the gradients to die during backpropagation and halting the learning process for those specific neighbor pairs.
How to implement Multi-Head Attention in PyTorch Geometric?
In PyTorch Geometric, implementing multi-head attention is as simple as defining the `heads` argument inside the `GATConv` layer. By default, PyG concatenates the outputs. For the final layer, you should set `concat=False` to average the heads instead.
# Hidden layer (concatenates)
conv1 = GATConv(in_channels, hidden, heads=8)
# Output layer (averages)
conv2 = GATConv(hidden * 8, out_classes, heads=1, concat=False)