Note
Attention Is All You Need
Attention Is All You Need
TL;DR
The paper that introduced the Transformer architecture. Proposes that sequence-to-sequence tasks can be solved using attention mechanisms alone, without recurrence or convolutions. This is the foundation of every modern LLM (GPT, Claude, GLM, DeepSeek, Llama, etc.).
Key Facts
- Published: June 2017 (Google Brain / Google Research / University of Toronto)
- Architecture: Transformer (encoder-decoder)
- Model size: d_model=512, N=6 layers, h=8 heads, d_ff=2048 (base)
- Results: 28.4 BLEU on EN-DE (WMT 2014), 41.8 BLEU on EN-FR
- Training: 8x NVIDIA P100 GPUs, base model ~12 hours, big model ~3.5 days
Core Concept: Attention
Intuition: For each token in a sequence, determine how much it should "look at" every other token to understand context. In "The cat sat on the mat because it was tired" — "it" needs to attend strongly to "cat".
Every token is projected into three vectors:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What information do I provide?"
Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Step by step:
- QK^T — dot product of every query with every key = similarity scores
- / sqrt(d_k) — scaling factor. Without it, for large d_k, dot products grow large (variance = d_k when Q and K components are independent with mean 0 and variance 1), pushing softmax into regions with extremely small gradients. Dividing by sqrt(d_k) normalizes the variance back to 1.
- softmax — converts scores to probability weights (sum to 1)
- multiply by V — weighted combination of values = context-aware output
Complexity: O(L²) — every token attends to every other token. This is the bottleneck that later work like DeepSeek Sparse Attention, FlashAttention, etc. address.
Multi-Head Attention
Instead of one attention function, run h parallel attention operations with different learned projections:
MultiHead(Q,K,V) = Concat(head_1, ..., head_h) * W^O
where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
- Projection matrices: W_i^Q, W_i^K in R^(d_model x d_k), W_i^V in R^(d_model x d_v), W^O in R^(h*d_v x d_model)
- Paper uses h=8, d_k = d_v = d_model/h = 64
- Why multiple heads? "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this." Each head can learn different relationship patterns — syntax, coreference, position, etc.
- Total compute stays similar to single-head because each head operates on smaller dimension
Self-Attention
When Q, K, V all come from the same sequence (output of previous layer), it's self-attention — tokens within a sequence attend to each other.
Three uses in the Transformer:
- Encoder self-attention: each position attends to all positions in the encoder
- Decoder self-attention: masked so tokens can't attend to future positions (preserves autoregressive property)
- Encoder-decoder attention (cross-attention): decoder queries attend to encoder key/value outputs
The Transformer Architecture
Encoder (N=6 identical layers)
Each layer has two sub-layers:
- Multi-head self-attention
- Position-wise feed-forward network:
FFN(x) = max(0, xW1+b1)W2+b2with d_ff=2048
Each sub-layer wrapped with: LayerNorm(x + Sublayer(x)) — residual connection + layer norm. All sub-layers and embeddings produce d_model=512 dimensions.
Decoder (N=6 identical layers)
Each layer has three sub-layers:
- Masked multi-head self-attention (prevents attending to future positions)
- Multi-head encoder-decoder attention (queries from decoder, K/V from encoder)
- Position-wise feed-forward network
Same residual connections and layer normalization.
Positional Encoding
Since there's no recurrence, position must be explicitly encoded. Added to input embeddings:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
- Each dimension corresponds to a sinusoid with wavelengths from 2pi to 10000*2pi (geometric progression)
- For any fixed offset k, PE(pos+k) is a linear function of PE(pos) — enables learning relative positions
- Tested learned positional embeddings: "nearly identical results", but sinusoidal may extrapolate to longer sequences than seen in training
Training Details
- Optimizer: Adam (beta1=0.9, beta2=0.98, eps=1e-9)
- LR schedule:
lrate = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))with warmup_steps=4000. Linear warmup for first 4000 steps, then inverse sqrt decay. - Dropout: P_drop=0.1 applied to sub-layer outputs (before residual add + norm) and to embedding + positional encoding sums
- Label smoothing: eps=0.1 — hurts perplexity (model becomes less certain) but improves accuracy and BLEU
Key Ablation Findings
- Single-head vs 8-head: -0.9 BLEU. Multiple heads matter significantly.
- Reducing d_k hurts quality: "determining compatibility is not easy" — needs enough capacity
- Larger models > smaller models consistently
- Learned vs sinusoidal positional encodings: nearly identical performance
- Generalizes beyond translation: outperforms BerkeleyParser on constituency parsing even when trained on only 40K sentences
Why This Paper Changed Everything
- Replaced RNNs/LSTMs as the dominant sequence architecture
- Self-attention connects all positions in O(1) sequential steps vs O(n) for RNNs — massively parallelizable
- This parallelism is what made scaling to billions/trillions of parameters feasible
- Every modern LLM is a Transformer descendant
- The O(L²) cost is the bottleneck that spawned an entire research field: FlashAttention, sparse attention, linear attention, DSA, etc.
Connections
- DeepSeek-V3.2 Paper — introduces DSA to make attention sub-quadratic
- GLM-5 - From Vibe Coding to Agentic Engineering — uses DSA for 200K context