Project: Fine-Tune Your Own Model — Fine-Tuning LLMs: From Data to Deployment

Your End-to-End Fine-Tuning Project

This is the capstone lesson. You will apply everything from the previous 11 lessons in a single, complete project: choose a use case, create a dataset, fine-tune a model with QLoRA, evaluate it against the base model, merge the LoRA adapter, convert to GGUF, deploy with Ollama, and test through an API. Every step includes full, runnable code.

Step 1: Choose Your Use Case

For this project, we will build a technical documentation assistant — a model fine-tuned to generate clear, structured explanations of technical concepts in a specific format. You can adapt this to your own use case by changing the dataset.

Target behavior:

Always start with a one-sentence summary
Use bullet points for key details
Include a practical example when relevant
End with a "Common Pitfalls" section
Keep responses concise (200-400 words)

Step 2: Create Your Dataset

We will generate 300 synthetic training examples using GPT-4o, then manually validate a subset.

"""
Step 2: Generate training dataset.
Requires: pip install openai
"""
import openai
import json
import random
import time
from pathlib import Path

client = openai.OpenAI()

# Topics to generate training examples for
topics = [
    "REST API authentication methods",
    "Database indexing strategies",
    "Docker container networking",
    "Git branching strategies",
    "Load balancer algorithms",
    "Caching strategies (Redis, Memcached)",
    "Message queues (RabbitMQ, Kafka)",
    "CI/CD pipeline design",
    "Kubernetes pod scheduling",
    "SSL/TLS certificate management",
    "WebSocket vs Server-Sent Events",
    "Database migration strategies",
    "API rate limiting techniques",
    "Microservice communication patterns",
    "Log aggregation and monitoring",
    # Add 30+ more topics for diversity
]

system_prompt = """You are a technical documentation writer. When explaining a topic:
1. Start with a one-sentence summary
2. Use bullet points for key details
3. Include a practical example when relevant
4. End with a "Common Pitfalls" section listing 2-3 mistakes
5. Keep your response between 200-400 words
6. Be precise and practical, not theoretical"""

def generate_example(topic):
    """Generate a single training example."""
    # Create a realistic user question about this topic
    question_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Write a realistic, specific technical question about: {topic}. "
                       f"Return ONLY the question, nothing else."
        }],
        temperature=0.9,
    )
    question = question_response.choices[0].message.content.strip()

    # Generate the ideal answer
    answer_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
        temperature=0.7,
    )
    answer = answer_response.choices[0].message.content.strip()

    return {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
    }

# Generate examples
Path("data").mkdir(exist_ok=True)
examples = []
for i in range(300):
    topic = random.choice(topics)
    try:
        example = generate_example(topic)
        examples.append(example)
        if (i + 1) % 10 == 0:
            print(f"Generated {i + 1}/300 examples")
    except Exception as e:
        print(f"Error at {i}: {e}")
        continue
    time.sleep(0.5)  # Rate limiting

# Split into train/val
random.shuffle(examples)
train_data = examples[:270]
val_data = examples[270:]

# Save
for name, data in [("train", train_data), ("val", val_data)]:
    with open(f"data/{name}.jsonl", "w") as f:
        for ex in data:
            f.write(json.dumps(ex) + "\n")

print(f"Saved {len(train_data)} train and {len(val_data)} val examples")

Step 3: Validate Your Data

Always spot-check before training:

"""
Step 3: Validate dataset quality.
"""
import json

with open("data/train.jsonl") as f:
    examples = [json.loads(line) for line in f]

# Check 5 random examples
import random
for ex in random.sample(examples, 5):
    user_msg = ex["messages"][1]["content"]
    assistant_msg = ex["messages"][2]["content"]
    word_count = len(assistant_msg.split())

    print(f"Q: {user_msg[:80]}...")
    print(f"A: ({word_count} words) {assistant_msg[:100]}...")
    print(f"Has bullet points: {'- ' in assistant_msg or '* ' in assistant_msg}")
    print(f"Has Common Pitfalls: {'Common Pitfalls' in assistant_msg or 'pitfall' in assistant_msg.lower()}")
    print("---")

Step 4: Fine-Tune with QLoRA

"""
Step 4: Fine-tune Llama 3.1 8B with QLoRA using Unsloth.
Requires: pip install unsloth trl datasets
Hardware: 16GB+ VRAM
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Load dataset
dataset = load_dataset("json", data_files={
    "train": "data/train.jsonl",
    "validation": "data/val.jsonl",
})

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=TrainingArguments(
        output_dir="./output/tech-docs-assistant",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        optim="adamw_8bit",
        weight_decay=0.01,
        bf16=True,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=25,
        save_strategy="steps",
        save_steps=50,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        seed=42,
        report_to="tensorboard",
    ),
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    max_seq_length=2048,
)

trainer.train()

# Save adapter
model.save_pretrained("./output/tech-docs-assistant/final")
tokenizer.save_pretrained("./output/tech-docs-assistant/final")
print("Training complete. Adapter saved.")

Step 5: Evaluate Against Base Model

"""
Step 5: Compare fine-tuned model vs base model.
"""
import json
import openai

client = openai.OpenAI()

# Test prompts (NOT from training data)
test_prompts = [
    "How does connection pooling work in PostgreSQL?",
    "What are the tradeoffs between gRPC and REST?",
    "Explain blue-green deployments vs canary releases.",
    "How do you handle database deadlocks in production?",
    "What is the difference between horizontal and vertical scaling?",
]

# Generate outputs from fine-tuned model
FastLanguageModel.for_inference(model)

finetuned_outputs = []
for prompt in test_prompts:
    messages = [
        {"role": "system", "content": "You are a technical documentation writer..."},
        {"role": "user", "content": prompt},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to("cuda")
    outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    finetuned_outputs.append(response)

# Use LLM-as-judge to compare
for i, prompt in enumerate(test_prompts):
    judge_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Rate this technical response on a 1-10 scale for:
- Format compliance (follows the summary/bullets/example/pitfalls structure)
- Technical accuracy
- Conciseness (200-400 words)

Question: {prompt}

Response: {finetuned_outputs[i]}

Return JSON: {{"format": X, "accuracy": X, "conciseness": X, "overall": X}}"""
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    scores = json.loads(judge_response.choices[0].message.content)
    print(f"Q: {prompt[:60]}...")
    print(f"Scores: {scores}")
    print("---")

Step 6: Merge LoRA Adapter

"""
Step 6: Merge LoRA into base model.
"""
# Using Unsloth's built-in merge
model.save_pretrained_merged(
    "./output/tech-docs-merged",
    tokenizer,
    save_method="merged_16bit",
)
print("Merged model saved in float16.")

Step 7: Convert to GGUF

"""
Step 7: Convert to GGUF for Ollama.
"""
model.save_pretrained_gguf(
    "./output/tech-docs-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)
print("GGUF model saved with Q4_K_M quantization.")

Step 8: Deploy with Ollama

# Create the Modelfile
cat > Modelfile << 'MODELEOF'
FROM ./output/tech-docs-gguf/unsloth.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

SYSTEM """You are a technical documentation writer. When explaining a topic:
1. Start with a one-sentence summary
2. Use bullet points for key details
3. Include a practical example when relevant
4. End with a Common Pitfalls section listing 2-3 mistakes
5. Keep your response between 200-400 words"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
MODELEOF

# Create and test the model
ollama create tech-docs -f Modelfile
ollama run tech-docs "Explain how database connection pooling works"

Step 9: Test via API

"""
Step 9: Test the deployed model through the API.
"""
import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

test_questions = [
    "What is circuit breaking in microservices?",
    "How do you implement retry logic with exponential backoff?",
    "Explain the CAP theorem with a practical example.",
]

for question in test_questions:
    response = client.chat.completions.create(
        model="tech-docs",
        messages=[{"role": "user", "content": question}],
        temperature=0.7,
        max_tokens=512,
    )

    answer = response.choices[0].message.content
    print(f"Q: {question}")
    print(f"A: {answer}")
    print(f"Word count: {len(answer.split())}")
    print("=" * 60)

Checklist: What You Have Built

By completing this project, you have:

[ ] Defined a clear use case with measurable behavior requirements
[ ] Generated and validated a training dataset (300 examples)
[ ] Fine-tuned Llama 3.1 8B with QLoRA (rank 16)
[ ] Evaluated the fine-tuned model against the base model using LLM-as-judge
[ ] Merged the LoRA adapter into the base model
[ ] Converted to GGUF format (Q4_K_M quantization)
[ ] Deployed with Ollama as a local service
[ ] Tested through an OpenAI-compatible API

Next Steps

With this foundation, you can:

Scale your dataset to 1,000-2,000 examples for even better quality
Try DPO training with preference pairs to further align behavior
Deploy to the cloud with vLLM for production throughput
Publish to Hugging Face Hub to share with the community
Train on your own domain — swap the topics and system prompt for your specific use case

The entire pipeline you built in this project is reusable. Change the dataset and you can fine-tune a model for any domain: medical documentation, customer support, code review, financial analysis, or anything else where consistent, high-quality LLM behavior matters.

Congratulations on completing the course. You now have the practical skills to take any LLM from general-purpose to specialized — from data all the way to deployment.