Merging and Exporting Models — Fine-Tuning LLMs: From Data to Deployment

From LoRA Adapter to Deployable Model

After training, you have a LoRA adapter — a small file (10-100MB) that contains only the learned weight changes. To deploy, you need to either serve the adapter alongside the base model or merge it into a standalone model. This lesson covers all the merging strategies, export formats, and publishing workflows you need.

LoRA Merging Strategies

Basic Merge: merge_and_unload

The simplest and most common approach. It adds the LoRA weights to the base model weights and removes the LoRA layers.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (in full precision for merging)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./output/lora-adapter")

# Merge and unload
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./output/merged-model")
tokenizer.save_pretrained("./output/merged-model")

Important: When merging a QLoRA adapter, you must first dequantize the base model. Loading the base model in float16 (not 4-bit) before merging avoids quantization artifacts in the final model.

TIES Merging

TIES (TrIM, Elect Sign & Merge) is an advanced merging strategy for combining multiple LoRA adapters. It resolves conflicts between adapters by trimming small values, resolving sign conflicts, and merging the remaining parameters.

from peft import PeftModel, PeftConfig
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)

# TIES merging with multiple adapters
adapters = ["./adapter-legal", "./adapter-medical", "./adapter-finance"]
weights = [0.4, 0.3, 0.3]  # Weight each adapter's contribution
density = 0.5  # Keep top 50% of parameters by magnitude

model = PeftModel.from_pretrained(base_model, adapters[0], adapter_name="legal")
model.load_adapter(adapters[1], adapter_name="medical")
model.load_adapter(adapters[2], adapter_name="finance")

# Merge using TIES
model.add_weighted_adapter(
    adapters=["legal", "medical", "finance"],
    weights=weights,
    adapter_name="merged",
    combination_type="ties",
    density=density,
)
model.set_adapter("merged")
merged = model.merge_and_unload()

DARE Merging

DARE (Drop And REscale) randomly drops parameters before merging, which can improve generalization when combining multiple adapters:

model.add_weighted_adapter(
    adapters=["legal", "medical", "finance"],
    weights=weights,
    adapter_name="merged",
    combination_type="dare_ties",  # DARE combined with TIES
    density=0.5,
)

When to use multi-adapter merging: When you have trained separate LoRA adapters for different tasks or domains and want to create a single model that combines all capabilities.

Export Formats

Safetensors (Default)

The standard format for Hugging Face models. Safe, fast to load, and supports memory mapping:

# Already saved in safetensors format by default
merged_model.save_pretrained("./output/merged-model")
# Creates: model-00001-of-00002.safetensors, model-00002-of-00002.safetensors, etc.

GGUF (for llama.cpp and Ollama)

GGUF is the format used by llama.cpp, Ollama, LM Studio, and other local inference tools. It includes built-in quantization for efficient CPU and GPU inference.

Method 1: Using Unsloth (easiest)

# If you trained with Unsloth
model.save_pretrained_gguf(
    "output/gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of quality and size
)

Method 2: Using llama.cpp directly

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert from HF format to GGUF
python convert_hf_to_gguf.py ../output/merged-model --outfile model-f16.gguf --outtype f16

# Quantize to smaller sizes
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
./llama-quantize model-f16.gguf model-q8_0.gguf Q8_0

GGUF quantization methods explained:

| Method | Size (7B) | Quality | Speed | Use Case | |--------|----------|---------|-------|----------| | Q4_K_M | ~4.1 GB | Good | Fast | Best balance for most users | | Q5_K_M | ~4.8 GB | Better | Fast | When you need higher quality | | Q6_K | ~5.5 GB | Very good | Medium | Quality-sensitive tasks | | Q8_0 | ~7.2 GB | Near-FP16 | Slower | When quality is critical | | F16 | ~14 GB | Baseline | Slowest | Reference / development |

AWQ (Activation-aware Weight Quantization)

AWQ produces 4-bit quantized models optimized for GPU inference with vLLM:

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "./output/merged-model",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./output/merged-model")

# Calibration dataset (use a representative sample)
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./output/merged-model-awq")
tokenizer.save_pretrained("./output/merged-model-awq")

GPTQ

Similar to AWQ but uses a different quantization algorithm:

from transformers import GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",  # Calibration dataset
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    "./output/merged-model",
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("./output/merged-model-gptq")

Publishing to Hugging Face Hub

Creating a Model Card

Every published model should have a model card (README.md) that documents its purpose, training details, and limitations:

card_content = """
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
  - fine-tuned
  - legal
  - LoRA
---

# Legal Assistant - Llama 3.1 8B Fine-Tuned

## Model Description
Fine-tuned Llama 3.1 8B for legal document analysis, trained on 2,000 examples
of contract review, clause summarization, and risk assessment.

## Training Details
- **Base model:** meta-llama/Llama-3.1-8B-Instruct
- **Method:** QLoRA (4-bit, rank 16, alpha 32)
- **Dataset:** 2,000 curated legal examples
- **Epochs:** 3
- **Hardware:** 1x RTX 3090

## Intended Use
Legal document analysis for English-language contracts and agreements.

## Limitations
- Not a substitute for professional legal advice
- Trained primarily on US contract law
- May not generalize to other legal systems
"""

Pushing to the Hub

from huggingface_hub import HfApi

api = HfApi()

# Create repo and upload merged model
api.create_repo("your-username/legal-llama3-8b", private=False)

merged_model.push_to_hub("your-username/legal-llama3-8b")
tokenizer.push_to_hub("your-username/legal-llama3-8b")

# Upload GGUF separately (optional)
api.upload_file(
    path_or_fileobj="output/gguf/model-q4_k_m.gguf",
    path_in_repo="legal-llama3-8b-q4_k_m.gguf",
    repo_id="your-username/legal-llama3-8b-gguf",
)

Licensing Considerations

When publishing fine-tuned models, you must comply with the base model's license:

Llama 3.1: Meta's Community License. Commercial use allowed. Must include attribution. Must accept the license to use.
Mistral: Apache 2.0. Very permissive. Commercial use allowed.
Qwen: Apache 2.0 or Qwen License depending on version.
Gemma: Google's Terms of Use. Commercial use allowed with restrictions.

Always check the base model's license before publishing your fine-tuned version. Your fine-tuned model inherits the base model's license restrictions.

Version Control for Models

Track your models systematically:

models/
  legal-assistant/
    v1.0/
      adapter/         # LoRA adapter files
      merged/          # Full merged model
      gguf/            # GGUF quantized versions
      eval_results.json  # Evaluation scores
      training_config.json  # Hyperparameters used
    v1.1/
      ...

Store training configurations alongside models so you can reproduce any version:

import json

config = {
    "base_model": "meta-llama/Llama-3.1-8B-Instruct",
    "lora_rank": 16,
    "lora_alpha": 32,
    "learning_rate": 2e-4,
    "epochs": 3,
    "dataset": "legal-contracts-v2",
    "dataset_size": 2000,
    "eval_loss": 0.847,
    "eval_win_rate_vs_base": 0.78,
}

with open("models/legal-assistant/v1.0/training_config.json", "w") as f:
    json.dump(config, f, indent=2)

In the next lesson, we will take your exported model and deploy it with production inference engines — vLLM for high throughput and Ollama for local deployment.