Deployment and Inference — Fine-Tuning LLMs: From Data to Deployment

From Trained Model to Production API

Your model is fine-tuned, evaluated, merged, and exported. Now you need to serve it to users. The inference engine you choose determines your latency, throughput, cost, and operational complexity. This lesson covers the four main deployment paths and helps you choose the right one for your use case.

Inference Engine Comparison

| Engine | Best For | Throughput | Ease of Use | GPU Required | |--------|----------|-----------|-------------|-------------| | vLLM | High-throughput production | Highest | Medium | Yes | | TGI (HuggingFace) | HF ecosystem integration | High | Medium | Yes | | Ollama | Local dev / small-scale | Medium | Easiest | Optional | | llama.cpp | Edge / CPU inference | Low-Medium | Medium | No |

vLLM: High-Throughput Production

vLLM uses PagedAttention and continuous batching to achieve the highest throughput among open-source inference engines. It is the go-to choice for production deployments handling many concurrent requests.

Installation and Basic Serving

pip install vllm

# Serve your fine-tuned model
vllm serve ./output/merged-model \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --dtype auto

This starts an OpenAI-compatible API server. Your existing code that calls GPT-4 can switch to your fine-tuned model by changing the base URL:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM does not require auth by default
)

response = client.chat.completions.create(
    model="./output/merged-model",
    messages=[
        {"role": "system", "content": "You are a legal document analyst."},
        {"role": "user", "content": "Summarize this contract clause..."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

AWQ Models with vLLM

For maximum efficiency, serve AWQ-quantized models:

vllm serve ./output/merged-model-awq \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

Key vLLM Configuration

vllm serve ./output/merged-model \
    --tensor-parallel-size 2 \        # Split across 2 GPUs
    --max-num-seqs 256 \              # Max concurrent sequences
    --max-num-batched-tokens 8192 \   # Max tokens per batch
    --enable-prefix-caching \         # Cache common prefixes
    --gpu-memory-utilization 0.90     # Use 90% of GPU memory

Text-Generation-Inference (TGI)

Hugging Face's production inference server. Excellent integration with the HF ecosystem and Docker-native deployment.

# Using Docker (recommended)
docker run --gpus all -p 8080:80 \
    -v ./output/merged-model:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /data \
    --max-input-length 2048 \
    --max-total-tokens 4096 \
    --max-batch-prefill-tokens 4096

# Client usage
import requests

response = requests.post(
    "http://localhost:8080/generate",
    json={
        "inputs": "Summarize this contract clause: ...",
        "parameters": {
            "max_new_tokens": 256,
            "temperature": 0.7,
        }
    }
)
print(response.json()["generated_text"])

Ollama: Local and Small-Scale Deployment

Ollama is the easiest way to run models locally. It handles model management, quantization, and serving through a simple CLI.

Creating a Custom Model

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./output/gguf/model-q4_k_m.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

SYSTEM "You are a legal document analyst specializing in contract review."

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
EOF

# Create the model
ollama create legal-assistant -f Modelfile

# Run it
ollama run legal-assistant "Summarize this NDA clause..."

Ollama API

Ollama exposes a REST API on port 11434:

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "legal-assistant",
        "messages": [
            {"role": "user", "content": "Summarize this clause..."}
        ],
        "stream": False,
    }
)
print(response.json()["message"]["content"])

OpenAI-Compatible Endpoint

Ollama also provides an OpenAI-compatible API:

import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="legal-assistant",
    messages=[{"role": "user", "content": "Analyze this contract..."}],
)

llama.cpp: Edge and CPU Inference

For environments without GPUs or for edge deployment:

# Run the GGUF model directly
./llama-cli -m model-q4_k_m.gguf \
    -p "Summarize this legal clause:" \
    -n 256 \
    --temp 0.7 \
    -ngl 0  # 0 = CPU only, increase for GPU layers

For a server setup:

./llama-server -m model-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

Building a FastAPI Wrapper

For custom API logic, wrap your inference engine in a FastAPI application:

from fastapi import FastAPI
from pydantic import BaseModel
import openai

app = FastAPI()

# Connect to your inference backend (vLLM, Ollama, etc.)
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "You are a helpful legal assistant."
    temperature: float = 0.7
    max_tokens: int = 512

class ChatResponse(BaseModel):
    response: str
    model: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    response = client.chat.completions.create(
        model="legal-assistant",
        messages=[
            {"role": "system", "content": request.system_prompt},
            {"role": "user", "content": request.message},
        ],
        temperature=request.temperature,
        max_tokens=request.max_tokens,
    )

    return ChatResponse(
        response=response.choices[0].message.content,
        model=response.model,
        tokens_used=response.usage.total_tokens,
    )

uvicorn api:app --host 0.0.0.0 --port 3000

Quantization for Production

Choosing the right quantization for deployment:

GPU with plenty of VRAM: Use AWQ 4-bit with vLLM. Best throughput.
GPU with limited VRAM: Use GGUF Q4_K_M with Ollama or llama.cpp with GPU offloading.
CPU only: Use GGUF Q4_K_M with llama.cpp. Slower but functional.
Edge devices: Use GGUF Q4_0 or Q3_K_S for minimum size.

Monitoring Latency and Throughput

Track these production metrics:

import time
import statistics

latencies = []

for prompt in test_prompts:
    start = time.time()
    response = client.chat.completions.create(
        model="legal-assistant",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
    )
    latency = time.time() - start
    latencies.append(latency)

print(f"Median latency: {statistics.median(latencies):.2f}s")
print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.2f}s")
print(f"Throughput: {len(latencies)/sum(latencies):.1f} requests/sec")

Cost Optimization

Key strategies for reducing inference costs:

Quantize aggressively. Q4_K_M is usually sufficient and cuts memory by 75%.
Use shorter system prompts. Fine-tuning bakes behavior in, so you need fewer prompt tokens.
Batch requests. vLLM's continuous batching handles this automatically.
Cache common prefixes. vLLM's prefix caching avoids recomputing shared system prompts.
Right-size your GPU. An A10G (24GB, ~$0.75/hr on AWS) handles a 7B model at Q4 with room to spare.
Use spot instances. For non-latency-critical workloads, spot instances cut costs by 60-70%.

In the final lesson, we bring everything together in a complete capstone project — from dataset creation to deployed API.