Skip to content
Lesson 3 of 5

Transformers and the Attention Mechanism

3 min read

The Problem with Sequences

Before transformers, RNNs processed text one token at a time. This created two critical problems: sequential processing (slow, can't parallelize) and long-range dependencies (information from early tokens fades as the sequence grows). Attention was the solution.

Attention: The Core Innovation

The attention mechanism answers a simple question: when processing a word, which other words in the sentence should I focus on?

For example, in "The cat sat on the mat because it was tired" — what does "it" refer to? Attention lets the model learn that "it" should attend strongly to "cat" rather than "mat."

Mathematically, attention computes three vectors for each token:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

The attention score between two tokens is the dot product of one token's Query with another's Key, scaled and softmaxed. This score determines how much of each token's Value gets mixed into the output.

Self-Attention and Multi-Head Attention

Self-attention means every token attends to every other token in the same sequence. This is what allows transformers to capture relationships regardless of distance — "it" can attend to "cat" whether they're 3 or 300 tokens apart.

Multi-head attention runs several attention computations in parallel, each with different learned weights. Different heads learn to focus on different types of relationships — one head might track syntactic dependencies, another might track coreference, another might track semantic similarity.

The Transformer Architecture

The original 2017 "Attention Is All You Need" paper described an encoder-decoder architecture:

  • Encoder: Processes the input sequence, building rich contextual representations
  • Decoder: Generates the output sequence, attending to both its own previous outputs and the encoder's representations

Modern LLMs like GPT use only the decoder part (autoregressive generation). Models like BERT use only the encoder (bidirectional understanding). T5 and the original translation models use both.

Why Transformers Won

  1. Parallelization: Unlike RNNs, all tokens are processed simultaneously during training
  2. Long-range attention: No information loss over distance
  3. Scalability: Performance improves predictably with more data and parameters
  4. Transfer learning: Pre-trained transformers adapt to new tasks with minimal fine-tuning

The transformer architecture hasn't fundamentally changed since 2017 — what changed is scale. GPT-3 has 175 billion parameters. The architecture works so well that the main research question became "how big can we make it?"

To see a transformer in action, explore GPT-Visual — an interactive 3D visualization that lets you trace how tokens flow through attention heads, projections, and feed-forward layers.