Note

Attention Is All You Need

Attention Is All You Need

TL;DR

The paper that introduced the Transformer architecture. Proposes that sequence-to-sequence tasks can be solved using attention mechanisms alone, without recurrence or convolutions. This is the foundation of every modern LLM (GPT, Claude, GLM, DeepSeek, Llama, etc.).

Key Facts

  • Published: June 2017 (Google Brain / Google Research / University of Toronto)
  • Architecture: Transformer (encoder-decoder)
  • Model size: d_model=512, N=6 layers, h=8 heads, d_ff=2048 (base)
  • Results: 28.4 BLEU on EN-DE (WMT 2014), 41.8 BLEU on EN-FR
  • Training: 8x NVIDIA P100 GPUs, base model ~12 hours, big model ~3.5 days

Core Concept: Attention

Intuition: For each token in a sequence, determine how much it should "look at" every other token to understand context. In "The cat sat on the mat because it was tired" — "it" needs to attend strongly to "cat".

Every token is projected into three vectors:

  • Q (Query): "What am I looking for?"
  • K (Key): "What do I contain?"
  • V (Value): "What information do I provide?"

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Step by step:

  1. QK^T — dot product of every query with every key = similarity scores
  2. / sqrt(d_k) — scaling factor. Without it, for large d_k, dot products grow large (variance = d_k when Q and K components are independent with mean 0 and variance 1), pushing softmax into regions with extremely small gradients. Dividing by sqrt(d_k) normalizes the variance back to 1.
  3. softmax — converts scores to probability weights (sum to 1)
  4. multiply by V — weighted combination of values = context-aware output

Complexity: O(L²) — every token attends to every other token. This is the bottleneck that later work like DeepSeek Sparse Attention, FlashAttention, etc. address.

Multi-Head Attention

Instead of one attention function, run h parallel attention operations with different learned projections:

MultiHead(Q,K,V) = Concat(head_1, ..., head_h) * W^O
where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
  • Projection matrices: W_i^Q, W_i^K in R^(d_model x d_k), W_i^V in R^(d_model x d_v), W^O in R^(h*d_v x d_model)
  • Paper uses h=8, d_k = d_v = d_model/h = 64
  • Why multiple heads? "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this." Each head can learn different relationship patterns — syntax, coreference, position, etc.
  • Total compute stays similar to single-head because each head operates on smaller dimension

Self-Attention

When Q, K, V all come from the same sequence (output of previous layer), it's self-attention — tokens within a sequence attend to each other.

Three uses in the Transformer:

  1. Encoder self-attention: each position attends to all positions in the encoder
  2. Decoder self-attention: masked so tokens can't attend to future positions (preserves autoregressive property)
  3. Encoder-decoder attention (cross-attention): decoder queries attend to encoder key/value outputs

The Transformer Architecture

Encoder (N=6 identical layers)

Each layer has two sub-layers:

  1. Multi-head self-attention
  2. Position-wise feed-forward network: FFN(x) = max(0, xW1+b1)W2+b2 with d_ff=2048

Each sub-layer wrapped with: LayerNorm(x + Sublayer(x)) — residual connection + layer norm. All sub-layers and embeddings produce d_model=512 dimensions.

Decoder (N=6 identical layers)

Each layer has three sub-layers:

  1. Masked multi-head self-attention (prevents attending to future positions)
  2. Multi-head encoder-decoder attention (queries from decoder, K/V from encoder)
  3. Position-wise feed-forward network

Same residual connections and layer normalization.

Positional Encoding

Since there's no recurrence, position must be explicitly encoded. Added to input embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  • Each dimension corresponds to a sinusoid with wavelengths from 2pi to 10000*2pi (geometric progression)
  • For any fixed offset k, PE(pos+k) is a linear function of PE(pos) — enables learning relative positions
  • Tested learned positional embeddings: "nearly identical results", but sinusoidal may extrapolate to longer sequences than seen in training

Training Details

  • Optimizer: Adam (beta1=0.9, beta2=0.98, eps=1e-9)
  • LR schedule: lrate = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5)) with warmup_steps=4000. Linear warmup for first 4000 steps, then inverse sqrt decay.
  • Dropout: P_drop=0.1 applied to sub-layer outputs (before residual add + norm) and to embedding + positional encoding sums
  • Label smoothing: eps=0.1 — hurts perplexity (model becomes less certain) but improves accuracy and BLEU

Key Ablation Findings

  • Single-head vs 8-head: -0.9 BLEU. Multiple heads matter significantly.
  • Reducing d_k hurts quality: "determining compatibility is not easy" — needs enough capacity
  • Larger models > smaller models consistently
  • Learned vs sinusoidal positional encodings: nearly identical performance
  • Generalizes beyond translation: outperforms BerkeleyParser on constituency parsing even when trained on only 40K sentences

Why This Paper Changed Everything

  • Replaced RNNs/LSTMs as the dominant sequence architecture
  • Self-attention connects all positions in O(1) sequential steps vs O(n) for RNNs — massively parallelizable
  • This parallelism is what made scaling to billions/trillions of parameters feasible
  • Every modern LLM is a Transformer descendant
  • The O(L²) cost is the bottleneck that spawned an entire research field: FlashAttention, sparse attention, linear attention, DSA, etc.

Connections

My Notes