Dataset Engineering
Data Quality Beats Data Quantity
If there is one lesson that the fine-tuning community has learned the hard way, it is this: a small dataset of high-quality examples will outperform a large dataset of mediocre ones. Research from LIMA (Less Is More for Alignment) showed that just 1,000 carefully curated examples could produce a model competitive with those trained on 50,000+ examples. Your dataset is not just an input — it is the single most important factor determining whether your fine-tuned model succeeds or fails.
This lesson covers the entire data pipeline: collecting raw examples, cleaning them into training-ready samples, choosing the right format, and understanding how many examples you actually need.
Collecting Training Data
Manual Curation
The gold standard. Domain experts create input-output pairs that represent exactly what the model should produce. This is labor-intensive but produces the highest-quality data.
Process:
- Define your task clearly with 3-5 examples of ideal behavior
- Create a rubric that explains what makes an example "good"
- Have domain experts write 50-100 examples following the rubric
- Review every example for consistency and quality
- Iterate on the rubric based on patterns you find
Tip: Start by writing 20 examples yourself. This forces you to confront edge cases and ambiguities in your task definition before scaling to more contributors.
Existing Logs and Production Data
If your application already exists (even with prompt engineering), your production logs are a dataset waiting to happen. User queries paired with model outputs that were rated positively by users or verified by humans become training examples.
# Example: extracting training data from production logs
import json
training_examples = []
for log in production_logs:
if log["user_rating"] >= 4 and log["human_verified"]:
training_examples.append({
"instruction": log["user_query"],
"output": log["model_response"]
})
# Filter for diversity — avoid near-duplicate examples
# that would cause overfitting
Warning: Production data often has a bias toward common queries. Make sure to include examples from the tail of the distribution — the unusual requests that trip up your current system.
Domain Expert Interviews
For specialized domains, structured interviews with experts can generate high-quality examples. Ask the expert to walk through real scenarios: "What would you say to a patient asking about X?" or "How would you classify this financial transaction?"
Record the conversation, extract the input-output pairs, and have the expert validate the final formatted examples.
Synthetic Data from Stronger Models
Using a more capable model (GPT-4o, Claude) to generate training data for a smaller model is a common and effective strategy. The key is generating diverse, high-quality examples that you validate before training.
import openai
client = openai.OpenAI()
# Generate diverse training examples
seed_topics = ["contract review", "NDA analysis", "liability clause"]
for topic in seed_topics:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "You are a legal document analyst. Generate a realistic "
"user query and ideal response for the following topic."
}, {
"role": "user",
"content": f"Topic: {topic}. Generate a specific, realistic example."
}],
temperature=0.9 # Higher temperature for diversity
)
# Always validate generated examples manually
Critical rule: Always validate synthetic data. A stronger model will produce plausible-looking outputs that contain subtle errors. Budget time for human review of every synthetic example.
Cleaning Your Data
Raw data is never training-ready. Here is the cleaning pipeline you should apply:
Deduplication
Near-duplicate examples cause the model to memorize rather than generalize. Use both exact match and semantic similarity deduplication.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Compute embeddings for all examples
texts = [ex["instruction"] + " " + ex["output"] for ex in examples]
embeddings = model.encode(texts)
# Find near-duplicates (cosine similarity > 0.95)
sim_matrix = cosine_similarity(embeddings)
duplicates = set()
for i in range(len(sim_matrix)):
for j in range(i + 1, len(sim_matrix)):
if sim_matrix[i][j] > 0.95:
duplicates.add(j) # Keep the first, mark later as duplicate
cleaned = [ex for i, ex in enumerate(examples) if i not in duplicates]
print(f"Removed {len(duplicates)} near-duplicates from {len(examples)} examples")
Filtering Low-Quality Samples
Remove examples that are too short, too long, contain formatting errors, or are inconsistent with your task definition.
def quality_filter(example):
output = example["output"]
# Too short to be useful
if len(output.split()) < 20:
return False
# Too long (might be noisy or off-topic)
if len(output.split()) > 2000:
return False
# Contains obvious formatting errors
if output.count("```") % 2 != 0: # Unclosed code blocks
return False
# Starts with refusal patterns we don't want
refusal_patterns = ["I cannot", "I'm sorry", "As an AI"]
if any(output.startswith(p) for p in refusal_patterns):
return False
return True
cleaned = [ex for ex in examples if quality_filter(ex)]
Consistency Checks
All examples should follow the same conventions. If some examples use markdown headers and others do not, if some end with periods and others do not — these inconsistencies confuse the model.
Write a style guide for your dataset and enforce it programmatically where possible, manually where needed.
Data Formats
Different training objectives require different data formats. Here are the three main ones:
Instruction Format (Supervised Fine-Tuning)
The simplest format. Each example is an instruction (input) and a completion (output). Used for teaching the model to follow specific instructions.
{
"instruction": "Summarize this legal clause in plain language",
"input": "The indemnifying party shall hold harmless and indemnify...",
"output": "This clause means that one party agrees to cover any losses..."
}
Chat Format (Multi-Turn Conversations)
For conversational models, each example contains a full conversation with system, user, and assistant messages.
{
"conversations": [
{"role": "system", "content": "You are a medical triage assistant..."},
{"role": "user", "content": "I have a headache and mild fever..."},
{"role": "assistant", "content": "Based on your symptoms..."},
{"role": "user", "content": "Should I go to the ER?"},
{"role": "assistant", "content": "Given the mild nature of your symptoms..."}
]
}
Preference Format (DPO/RLHF)
For alignment training, each example includes a prompt, a chosen (preferred) response, and a rejected (worse) response.
{
"prompt": "Explain quantum computing to a 10-year-old",
"chosen": "Imagine you have a magic coin that can be both heads AND tails...",
"rejected": "Quantum computing leverages quantum mechanical phenomena such as superposition and entanglement to perform computations on qubits..."
}
How Much Data Do You Need?
There is no universal answer, but here are practical guidelines based on real-world experience:
| Task Type | Minimum Examples | Recommended | Notes | |-----------|-----------------|-------------|-------| | Style/tone adaptation | 100-200 | 500-1,000 | Consistent style is key | | Output format | 200-500 | 1,000-2,000 | Cover all edge cases | | Domain knowledge | 500-1,000 | 2,000-5,000 | Diverse examples matter | | Complex reasoning | 1,000-2,000 | 5,000-10,000 | Chain-of-thought helps | | Multi-task | 500+ per task | 1,000+ per task | Balance across tasks |
Important: These numbers assume high-quality, diverse examples. 200 carefully curated examples will outperform 2,000 noisy ones. When in doubt, invest in quality over quantity.
Practical Tip
Before starting any data collection effort, create 10 "golden examples" — perfect input-output pairs that represent exactly what you want the model to do. Use these as your north star throughout the process. Every new example should be as good as your golden set. If it is not, fix it or discard it.
In the next lesson, we will take your cleaned data and format it for specific training frameworks — Hugging Face, Axolotl, and Unsloth — with complete code examples.