Skip to content
Lesson 4 of 5

LLMs: GPT, Claude, and Beyond

3 min read

What Makes a Language Model "Large"

A large language model (LLM) is a transformer trained on massive text datasets to predict the next token. The "large" refers to parameter count — the number of learnable weights in the network.

| Model | Parameters | Training Data | |-------|-----------|--------------| | GPT-2 (2019) | 1.5B | 40GB text | | GPT-3 (2020) | 175B | 570GB text | | GPT-4 (2023) | ~1.8T (estimated) | ~13T tokens | | Llama 3 (2024) | 8B-405B | 15T tokens |

Scaling laws (Kaplan et al., 2020) showed that model performance improves predictably with three factors: more parameters, more data, and more compute. This insight drove the "bigger is better" approach that produced today's frontier models.

The Training Pipeline

LLM training happens in stages:

1. Pre-training: The model learns language by predicting the next token across trillions of tokens from the internet, books, and code. This is the expensive stage — GPT-4's pre-training reportedly cost over $100 million in compute.

2. Supervised Fine-Tuning (SFT): The pre-trained model is fine-tuned on curated question-answer pairs to learn to follow instructions rather than just predict text.

3. RLHF (Reinforcement Learning from Human Feedback): Human raters compare multiple model outputs and rank them by quality. A reward model learns these preferences, and the LLM is optimized to produce outputs the reward model rates highly.

The combination of SFT + RLHF is what transforms a raw text predictor into a helpful assistant. Without these steps, the model would just autocomplete text without understanding instructions.

The Major Model Families

OpenAI (GPT series): Pioneered the decoder-only LLM approach. GPT-4 and GPT-5 are multimodal (text + images). Known for strong coding and reasoning.

Anthropic (Claude): Focuses on safety, honesty, and harmlessness through Constitutional AI (RLAIF). Claude 4 introduced sustained agentic capabilities. Known for nuanced analysis and long-context handling.

Google (Gemini): Natively multimodal (trained on text, images, audio, video together). Tight integration with Google ecosystem.

Meta (Llama): Open-weight models that democratized LLM access. Llama 3 405B is competitive with proprietary models. Community can fine-tune and deploy freely.

Emergent Capabilities

As models scale, they develop unexpected abilities not present in smaller versions:

  • In-context learning: Learning from examples in the prompt without weight updates
  • Chain-of-thought reasoning: Solving multi-step problems by thinking through them
  • Code generation: Writing functional programs from natural language descriptions
  • Tool use: Learning to call APIs and external tools to accomplish tasks

These emergent abilities are why LLMs feel qualitatively different from earlier AI systems. They suggest that intelligence may arise from scale and learning rather than explicit programming.

The Open vs Closed Debate

A crucial tension in the LLM ecosystem: should the most powerful models be open (Meta, Mistral) or closed (OpenAI, Anthropic)? Open models enable innovation and transparency but also lower barriers for misuse. Closed models allow safety controls but concentrate power. This debate shapes AI policy worldwide.