Skip to content
Lesson 4 of 12

Preparing Training Data

7 min read

From Clean Data to Training-Ready Format

You have collected and cleaned your data. Now you need to format it so that training frameworks can consume it. Different tools expect different JSONL schemas, different models use different conversation templates, and small formatting mistakes can silently degrade training quality. This lesson gives you battle-tested code for formatting data correctly the first time.

JSONL Formats by Framework

Hugging Face (SFTTrainer)

The Hugging Face trl library's SFTTrainer expects data in one of two formats:

Format 1: Conversational (recommended)

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is..."}]}
{"messages": [{"role": "user", "content": "Explain fine-tuning"}, {"role": "assistant", "content": "Fine-tuning is the process of..."}]}

Format 2: Text field (for pre-formatted text)

{"text": "<|system|>\nYou are a helpful assistant.\n<|user|>\nWhat is LoRA?\n<|assistant|>\nLoRA is..."}

Axolotl

Axolotl supports multiple formats through its YAML configuration. The most common:

{"conversations": [{"from": "system", "value": "You are a helpful assistant."}, {"from": "human", "value": "What is LoRA?"}, {"from": "gpt", "value": "LoRA is..."}]}

Note the different key names: from/value instead of role/content, and human/gpt instead of user/assistant.

Unsloth

Unsloth works with the standard Hugging Face conversational format but also has its own optimized path:

{"conversations": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA is..."}]}

Conversation Templates

Different models expect different special tokens to wrap messages. Using the wrong template is one of the most common fine-tuning bugs — the model trains fine but produces garbled output because it was trained with mismatched tokens.

ChatML (Qwen, Mistral-Instruct)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is LoRA?<|im_end|>
<|im_start|>assistant
LoRA is...<|im_end|>

Llama 3 Template

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is LoRA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

LoRA is...<|eot_id|>

Alpaca Template (Legacy)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is LoRA?

### Response:
LoRA is...

Critical tip: Always use the tokenizer's built-in apply_chat_template method when possible. It handles all the special tokens automatically:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is LoRA?"},
    {"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method."}
]

# This handles all special tokens correctly
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

Data Augmentation Techniques

When your dataset is small, augmentation can increase effective diversity without collecting new data.

Paraphrasing Instructions

Rewrite the instruction (input) side while keeping the output the same. This teaches the model that different phrasings should produce the same output.

import openai

client = openai.OpenAI()

def paraphrase_instruction(original_instruction):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Rewrite the following instruction to say the same thing "
                       "differently. Keep the meaning identical but change the "
                       "wording, structure, and phrasing. Return ONLY the rewritten "
                       "instruction."
        }, {
            "role": "user",
            "content": original_instruction
        }],
        temperature=0.8
    )
    return response.choices[0].message.content

# Augment each example with 2-3 paraphrases
augmented_dataset = []
for example in original_dataset:
    augmented_dataset.append(example)  # Keep original
    for _ in range(2):
        new_instruction = paraphrase_instruction(example["instruction"])
        augmented_dataset.append({
            "instruction": new_instruction,
            "output": example["output"]
        })

Adding System Prompt Variations

If your model will be used with different system prompts in production, include variations in training:

system_prompts = [
    "You are a helpful legal assistant.",
    "You are an expert in contract law. Provide clear, actionable advice.",
    "As a legal AI, analyze the following with precision and clarity.",
]

augmented = []
for example in dataset:
    for sys_prompt in system_prompts:
        augmented.append({
            "messages": [
                {"role": "system", "content": sys_prompt},
                {"role": "user", "content": example["user"]},
                {"role": "assistant", "content": example["assistant"]}
            ]
        })

Synthetic Data Generation at Scale

For larger synthetic datasets, use a structured pipeline:

import json
import random
from pathlib import Path

def generate_training_batch(seed_examples, num_generate, output_file):
    """Generate a batch of synthetic training examples."""
    generated = []

    for i in range(num_generate):
        # Pick a random seed example for style reference
        seed = random.choice(seed_examples)

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": (
                    "Generate a NEW training example similar in style and "
                    "complexity to the reference below, but with completely "
                    "different content. The example should be realistic and "
                    "diverse.\n\n"
                    f"Reference:\n"
                    f"User: {seed['user']}\n"
                    f"Assistant: {seed['assistant']}\n\n"
                    "Return JSON with 'user' and 'assistant' keys."
                )
            }],
            response_format={"type": "json_object"},
            temperature=1.0
        )

        try:
            example = json.loads(response.choices[0].message.content)
            generated.append(example)
        except json.JSONDecodeError:
            continue  # Skip malformed outputs

    # Write to JSONL
    with open(output_file, "w") as f:
        for ex in generated:
            f.write(json.dumps(ex) + "\n")

    return generated

Train/Validation/Test Splits

Never train on all your data. You need held-out sets for evaluation.

import random
from collections import defaultdict

def stratified_split(dataset, train_ratio=0.85, val_ratio=0.10, test_ratio=0.05):
    """Split dataset maintaining distribution of categories if present."""
    random.shuffle(dataset)

    n = len(dataset)
    train_end = int(n * train_ratio)
    val_end = train_end + int(n * val_ratio)

    train = dataset[:train_end]
    val = dataset[train_end:val_end]
    test = dataset[val_end:]

    print(f"Train: {len(train)}, Validation: {len(val)}, Test: {len(test)}")
    return train, val, test

train_data, val_data, test_data = stratified_split(full_dataset)

# Save each split
for split_name, split_data in [("train", train_data), ("val", val_data), ("test", test_data)]:
    with open(f"data/{split_name}.jsonl", "w") as f:
        for example in split_data:
            f.write(json.dumps(example) + "\n")

Guidelines for splits:

  • Training (85%): The data the model learns from
  • Validation (10%): Used during training to detect overfitting
  • Test (5%): Never seen during training. Used for final evaluation only

For small datasets (under 500 examples), use 90/10 train/val and manually curate a separate test set of 20-30 high-quality examples.

Tokenization Considerations

Tokenization affects training more than most people realize. Key points:

Sequence Length

Most fine-tuning runs use a maximum sequence length (e.g., 2048 or 4096 tokens). Examples that exceed this length get truncated, potentially cutting off the model's response. Always check your dataset's token length distribution:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

lengths = []
for example in dataset:
    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
    tokens = tokenizer.encode(text)
    lengths.append(len(tokens))

print(f"Token length stats:")
print(f"  Min: {min(lengths)}, Max: {max(lengths)}, Mean: {sum(lengths)/len(lengths):.0f}")
print(f"  Examples > 2048 tokens: {sum(1 for l in lengths if l > 2048)}")
print(f"  Examples > 4096 tokens: {sum(1 for l in lengths if l > 4096)}")

Padding and Packing

Two strategies for handling variable-length sequences:

  • Padding: Pad shorter sequences to the batch's maximum length. Simple but wastes compute on padding tokens.
  • Packing: Concatenate multiple short examples into a single sequence. More efficient but requires careful implementation to avoid cross-contamination between examples.

Most modern training frameworks handle this automatically. SFTTrainer's packing=True option enables sequence packing.

Complete Formatting Pipeline

Here is a complete script that takes raw data and produces training-ready files:

import json
import random
from pathlib import Path
from transformers import AutoTokenizer

def format_dataset(input_file, output_dir, model_name, max_seq_length=2048):
    """Complete pipeline: load, format, validate, split, and save."""

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Load raw data
    with open(input_file) as f:
        raw_data = [json.loads(line) for line in f]

    # Format and validate
    formatted = []
    skipped = 0
    for example in raw_data:
        messages = example["messages"]
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        token_count = len(tokenizer.encode(text))

        if token_count > max_seq_length:
            skipped += 1
            continue

        formatted.append(example)

    print(f"Kept {len(formatted)}/{len(raw_data)} examples ({skipped} too long)")

    # Split
    random.shuffle(formatted)
    train_end = int(len(formatted) * 0.9)
    train_data = formatted[:train_end]
    val_data = formatted[train_end:]

    # Save
    for name, data in [("train", train_data), ("val", val_data)]:
        path = output_dir / f"{name}.jsonl"
        with open(path, "w") as f:
            for ex in data:
                f.write(json.dumps(ex) + "\n")
        print(f"Saved {len(data)} examples to {path}")

format_dataset("raw_data.jsonl", "prepared_data/", "meta-llama/Llama-3.1-8B-Instruct")

In the next lesson, we dive deep into the mechanics of LoRA and QLoRA — understanding exactly how they work will help you configure them optimally for your use case.