Preparing Training Data
From Clean Data to Training-Ready Format
You have collected and cleaned your data. Now you need to format it so that training frameworks can consume it. Different tools expect different JSONL schemas, different models use different conversation templates, and small formatting mistakes can silently degrade training quality. This lesson gives you battle-tested code for formatting data correctly the first time.
JSONL Formats by Framework
Hugging Face (SFTTrainer)
The Hugging Face trl library's SFTTrainer expects data in one of two formats:
Format 1: Conversational (recommended)
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is..."}]}
{"messages": [{"role": "user", "content": "Explain fine-tuning"}, {"role": "assistant", "content": "Fine-tuning is the process of..."}]}
Format 2: Text field (for pre-formatted text)
{"text": "<|system|>\nYou are a helpful assistant.\n<|user|>\nWhat is LoRA?\n<|assistant|>\nLoRA is..."}
Axolotl
Axolotl supports multiple formats through its YAML configuration. The most common:
{"conversations": [{"from": "system", "value": "You are a helpful assistant."}, {"from": "human", "value": "What is LoRA?"}, {"from": "gpt", "value": "LoRA is..."}]}
Note the different key names: from/value instead of role/content, and human/gpt instead of user/assistant.
Unsloth
Unsloth works with the standard Hugging Face conversational format but also has its own optimized path:
{"conversations": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA is..."}]}
Conversation Templates
Different models expect different special tokens to wrap messages. Using the wrong template is one of the most common fine-tuning bugs — the model trains fine but produces garbled output because it was trained with mismatched tokens.
ChatML (Qwen, Mistral-Instruct)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is LoRA?<|im_end|>
<|im_start|>assistant
LoRA is...<|im_end|>
Llama 3 Template
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
What is LoRA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
LoRA is...<|eot_id|>
Alpaca Template (Legacy)
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is LoRA?
### Response:
LoRA is...
Critical tip: Always use the tokenizer's built-in apply_chat_template method when possible. It handles all the special tokens automatically:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is LoRA?"},
{"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method."}
]
# This handles all special tokens correctly
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)
Data Augmentation Techniques
When your dataset is small, augmentation can increase effective diversity without collecting new data.
Paraphrasing Instructions
Rewrite the instruction (input) side while keeping the output the same. This teaches the model that different phrasings should produce the same output.
import openai
client = openai.OpenAI()
def paraphrase_instruction(original_instruction):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Rewrite the following instruction to say the same thing "
"differently. Keep the meaning identical but change the "
"wording, structure, and phrasing. Return ONLY the rewritten "
"instruction."
}, {
"role": "user",
"content": original_instruction
}],
temperature=0.8
)
return response.choices[0].message.content
# Augment each example with 2-3 paraphrases
augmented_dataset = []
for example in original_dataset:
augmented_dataset.append(example) # Keep original
for _ in range(2):
new_instruction = paraphrase_instruction(example["instruction"])
augmented_dataset.append({
"instruction": new_instruction,
"output": example["output"]
})
Adding System Prompt Variations
If your model will be used with different system prompts in production, include variations in training:
system_prompts = [
"You are a helpful legal assistant.",
"You are an expert in contract law. Provide clear, actionable advice.",
"As a legal AI, analyze the following with precision and clarity.",
]
augmented = []
for example in dataset:
for sys_prompt in system_prompts:
augmented.append({
"messages": [
{"role": "system", "content": sys_prompt},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]}
]
})
Synthetic Data Generation at Scale
For larger synthetic datasets, use a structured pipeline:
import json
import random
from pathlib import Path
def generate_training_batch(seed_examples, num_generate, output_file):
"""Generate a batch of synthetic training examples."""
generated = []
for i in range(num_generate):
# Pick a random seed example for style reference
seed = random.choice(seed_examples)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Generate a NEW training example similar in style and "
"complexity to the reference below, but with completely "
"different content. The example should be realistic and "
"diverse.\n\n"
f"Reference:\n"
f"User: {seed['user']}\n"
f"Assistant: {seed['assistant']}\n\n"
"Return JSON with 'user' and 'assistant' keys."
)
}],
response_format={"type": "json_object"},
temperature=1.0
)
try:
example = json.loads(response.choices[0].message.content)
generated.append(example)
except json.JSONDecodeError:
continue # Skip malformed outputs
# Write to JSONL
with open(output_file, "w") as f:
for ex in generated:
f.write(json.dumps(ex) + "\n")
return generated
Train/Validation/Test Splits
Never train on all your data. You need held-out sets for evaluation.
import random
from collections import defaultdict
def stratified_split(dataset, train_ratio=0.85, val_ratio=0.10, test_ratio=0.05):
"""Split dataset maintaining distribution of categories if present."""
random.shuffle(dataset)
n = len(dataset)
train_end = int(n * train_ratio)
val_end = train_end + int(n * val_ratio)
train = dataset[:train_end]
val = dataset[train_end:val_end]
test = dataset[val_end:]
print(f"Train: {len(train)}, Validation: {len(val)}, Test: {len(test)}")
return train, val, test
train_data, val_data, test_data = stratified_split(full_dataset)
# Save each split
for split_name, split_data in [("train", train_data), ("val", val_data), ("test", test_data)]:
with open(f"data/{split_name}.jsonl", "w") as f:
for example in split_data:
f.write(json.dumps(example) + "\n")
Guidelines for splits:
- Training (85%): The data the model learns from
- Validation (10%): Used during training to detect overfitting
- Test (5%): Never seen during training. Used for final evaluation only
For small datasets (under 500 examples), use 90/10 train/val and manually curate a separate test set of 20-30 high-quality examples.
Tokenization Considerations
Tokenization affects training more than most people realize. Key points:
Sequence Length
Most fine-tuning runs use a maximum sequence length (e.g., 2048 or 4096 tokens). Examples that exceed this length get truncated, potentially cutting off the model's response. Always check your dataset's token length distribution:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lengths = []
for example in dataset:
text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
tokens = tokenizer.encode(text)
lengths.append(len(tokens))
print(f"Token length stats:")
print(f" Min: {min(lengths)}, Max: {max(lengths)}, Mean: {sum(lengths)/len(lengths):.0f}")
print(f" Examples > 2048 tokens: {sum(1 for l in lengths if l > 2048)}")
print(f" Examples > 4096 tokens: {sum(1 for l in lengths if l > 4096)}")
Padding and Packing
Two strategies for handling variable-length sequences:
- Padding: Pad shorter sequences to the batch's maximum length. Simple but wastes compute on padding tokens.
- Packing: Concatenate multiple short examples into a single sequence. More efficient but requires careful implementation to avoid cross-contamination between examples.
Most modern training frameworks handle this automatically. SFTTrainer's packing=True option enables sequence packing.
Complete Formatting Pipeline
Here is a complete script that takes raw data and produces training-ready files:
import json
import random
from pathlib import Path
from transformers import AutoTokenizer
def format_dataset(input_file, output_dir, model_name, max_seq_length=2048):
"""Complete pipeline: load, format, validate, split, and save."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Load raw data
with open(input_file) as f:
raw_data = [json.loads(line) for line in f]
# Format and validate
formatted = []
skipped = 0
for example in raw_data:
messages = example["messages"]
text = tokenizer.apply_chat_template(messages, tokenize=False)
token_count = len(tokenizer.encode(text))
if token_count > max_seq_length:
skipped += 1
continue
formatted.append(example)
print(f"Kept {len(formatted)}/{len(raw_data)} examples ({skipped} too long)")
# Split
random.shuffle(formatted)
train_end = int(len(formatted) * 0.9)
train_data = formatted[:train_end]
val_data = formatted[train_end:]
# Save
for name, data in [("train", train_data), ("val", val_data)]:
path = output_dir / f"{name}.jsonl"
with open(path, "w") as f:
for ex in data:
f.write(json.dumps(ex) + "\n")
print(f"Saved {len(data)} examples to {path}")
format_dataset("raw_data.jsonl", "prepared_data/", "meta-llama/Llama-3.1-8B-Instruct")
In the next lesson, we dive deep into the mechanics of LoRA and QLoRA — understanding exactly how they work will help you configure them optimally for your use case.