Intro to Transformers

Attention Is All You Need

Pascual Vila

AI & ML Engineer // Code Syllabus

In 2017, a team at Google Brain published a paper that fundamentally altered Artificial Intelligence. By entirely dropping recurrence (RNNs) in favor of "Self-Attention", they gave birth to the Transformer—the architecture powering ChatGPT, BERT, and modern AI.

The Death of Sequential Processing

Traditional models like LSTMs and RNNs read text like humans do: left to right, word by word. To understand the end of a sentence, they must "remember" the beginning through a hidden state bottleneck. This makes them slow to train on GPUs, which excel at parallel processing. Transformers solved this by processing all tokens in a sequence simultaneously.

Positional Encodings

If you process all words at once, how does the model know that "The dog bit the man" is different from "The man bit the dog"? Since transformers lack recurrence, they inject absolute or relative positional data directly into the word embeddings. Typically, this is done using high-frequency sine and cosine waves.

The Heart: Self-Attention

Self-attention computes a "focus" matrix. For every word, it asks: "How relevant is every other word to me?" It does this by creating three vectors for each word:

Queries (Q): What I am looking for.
Keys (K): What I contain.
Values (V): My actual content/meaning.

The attention score is mathematically derived by taking the dot product of Q and K, scaling it, applying a softmax, and multiplying by V. This mechanism is mathematically elegant and highly parallelizable.

❓ AI Search & FAQ

What is Multi-Head Attention?

Instead of computing attention once, transformers run multiple self-attention mechanisms in parallel (heads). For example, one head might learn to focus on grammar (subject-verb agreement), while another focuses on sentiment, and another on entities. Their outputs are then concatenated.

Encoder vs Decoder Transformers?

Encoder (BERT): Reads the entire sequence forward and backward (bidirectional). Excellent for classification and understanding context.

Decoder (GPT): Generates text autoregressively. It uses "masked" attention to prevent it from looking at future words while generating the current word.

Why are Transformers so data hungry?

Because transformers lack the inductive biases (built-in assumptions) of CNNs (spatial locality) or RNNs (sequentiality), they have to learn everything from scratch. This makes them extremely powerful but requires massive amounts of text to learn grammatical structure and world knowledge.

Model Lexicon

Self-Attention

A mechanism that relates different positions of a single sequence to compute a representation of the sequence.

tensor_math.py

Multi-Head Attention

Running attention multiple times in parallel to capture different types of relationships between words.

tensor_math.py

Positional Encoding

Vectors injected into input embeddings to provide information about the relative or absolute position of the tokens.

tensor_math.py

Encoder

The part of the transformer that maps an input sequence to a continuous representation holding all learned information.

tensor_math.py

Feed-Forward Network

A fully connected neural network applied to each position separately and identically.

tensor_math.py

Softmax

A mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities sum to 1.

tensor_math.py

Intro To Transformers

Architecture Graph

Concept: RNN Bottlenecks

Logic Verification

Model Engineering Checks

Global Neural Sync

Fine-Tune With Peers

Attention Is All You Need

The Death of Sequential Processing

Positional Encodings

The Heart: Self-Attention

❓ AI Search & FAQ

Model Lexicon