Deployment and Inference
From Trained Model to Production API
Your model is fine-tuned, evaluated, merged, and exported. Now you need to serve it to users. The inference engine you choose determines your latency, throughput, cost, and operational complexity. This lesson covers the four main deployment paths and helps you choose the right one for your use case.
Inference Engine Comparison
| Engine | Best For | Throughput | Ease of Use | GPU Required | |--------|----------|-----------|-------------|-------------| | vLLM | High-throughput production | Highest | Medium | Yes | | TGI (HuggingFace) | HF ecosystem integration | High | Medium | Yes | | Ollama | Local dev / small-scale | Medium | Easiest | Optional | | llama.cpp | Edge / CPU inference | Low-Medium | Medium | No |
vLLM: High-Throughput Production
vLLM uses PagedAttention and continuous batching to achieve the highest throughput among open-source inference engines. It is the go-to choice for production deployments handling many concurrent requests.
Installation and Basic Serving
pip install vllm
# Serve your fine-tuned model
vllm serve ./output/merged-model \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype auto
This starts an OpenAI-compatible API server. Your existing code that calls GPT-4 can switch to your fine-tuned model by changing the base URL:
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM does not require auth by default
)
response = client.chat.completions.create(
model="./output/merged-model",
messages=[
{"role": "system", "content": "You are a legal document analyst."},
{"role": "user", "content": "Summarize this contract clause..."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
AWQ Models with vLLM
For maximum efficiency, serve AWQ-quantized models:
vllm serve ./output/merged-model-awq \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
Key vLLM Configuration
vllm serve ./output/merged-model \
--tensor-parallel-size 2 \ # Split across 2 GPUs
--max-num-seqs 256 \ # Max concurrent sequences
--max-num-batched-tokens 8192 \ # Max tokens per batch
--enable-prefix-caching \ # Cache common prefixes
--gpu-memory-utilization 0.90 # Use 90% of GPU memory
Text-Generation-Inference (TGI)
Hugging Face's production inference server. Excellent integration with the HF ecosystem and Docker-native deployment.
# Using Docker (recommended)
docker run --gpus all -p 8080:80 \
-v ./output/merged-model:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /data \
--max-input-length 2048 \
--max-total-tokens 4096 \
--max-batch-prefill-tokens 4096
# Client usage
import requests
response = requests.post(
"http://localhost:8080/generate",
json={
"inputs": "Summarize this contract clause: ...",
"parameters": {
"max_new_tokens": 256,
"temperature": 0.7,
}
}
)
print(response.json()["generated_text"])
Ollama: Local and Small-Scale Deployment
Ollama is the easiest way to run models locally. It handles model management, quantization, and serving through a simple CLI.
Creating a Custom Model
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./output/gguf/model-q4_k_m.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
SYSTEM "You are a legal document analyst specializing in contract review."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
EOF
# Create the model
ollama create legal-assistant -f Modelfile
# Run it
ollama run legal-assistant "Summarize this NDA clause..."
Ollama API
Ollama exposes a REST API on port 11434:
import requests
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "legal-assistant",
"messages": [
{"role": "user", "content": "Summarize this clause..."}
],
"stream": False,
}
)
print(response.json()["message"]["content"])
OpenAI-Compatible Endpoint
Ollama also provides an OpenAI-compatible API:
import openai
client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="legal-assistant",
messages=[{"role": "user", "content": "Analyze this contract..."}],
)
llama.cpp: Edge and CPU Inference
For environments without GPUs or for edge deployment:
# Run the GGUF model directly
./llama-cli -m model-q4_k_m.gguf \
-p "Summarize this legal clause:" \
-n 256 \
--temp 0.7 \
-ngl 0 # 0 = CPU only, increase for GPU layers
For a server setup:
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
Building a FastAPI Wrapper
For custom API logic, wrap your inference engine in a FastAPI application:
from fastapi import FastAPI
from pydantic import BaseModel
import openai
app = FastAPI()
# Connect to your inference backend (vLLM, Ollama, etc.)
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
class ChatRequest(BaseModel):
message: str
system_prompt: str = "You are a helpful legal assistant."
temperature: float = 0.7
max_tokens: int = 512
class ChatResponse(BaseModel):
response: str
model: str
tokens_used: int
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
response = client.chat.completions.create(
model="legal-assistant",
messages=[
{"role": "system", "content": request.system_prompt},
{"role": "user", "content": request.message},
],
temperature=request.temperature,
max_tokens=request.max_tokens,
)
return ChatResponse(
response=response.choices[0].message.content,
model=response.model,
tokens_used=response.usage.total_tokens,
)
uvicorn api:app --host 0.0.0.0 --port 3000
Quantization for Production
Choosing the right quantization for deployment:
- GPU with plenty of VRAM: Use AWQ 4-bit with vLLM. Best throughput.
- GPU with limited VRAM: Use GGUF Q4_K_M with Ollama or llama.cpp with GPU offloading.
- CPU only: Use GGUF Q4_K_M with llama.cpp. Slower but functional.
- Edge devices: Use GGUF Q4_0 or Q3_K_S for minimum size.
Monitoring Latency and Throughput
Track these production metrics:
import time
import statistics
latencies = []
for prompt in test_prompts:
start = time.time()
response = client.chat.completions.create(
model="legal-assistant",
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
)
latency = time.time() - start
latencies.append(latency)
print(f"Median latency: {statistics.median(latencies):.2f}s")
print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]:.2f}s")
print(f"Throughput: {len(latencies)/sum(latencies):.1f} requests/sec")
Cost Optimization
Key strategies for reducing inference costs:
- Quantize aggressively. Q4_K_M is usually sufficient and cuts memory by 75%.
- Use shorter system prompts. Fine-tuning bakes behavior in, so you need fewer prompt tokens.
- Batch requests. vLLM's continuous batching handles this automatically.
- Cache common prefixes. vLLM's prefix caching avoids recomputing shared system prompts.
- Right-size your GPU. An A10G (24GB, ~$0.75/hr on AWS) handles a 7B model at Q4 with room to spare.
- Use spot instances. For non-latency-critical workloads, spot instances cut costs by 60-70%.
In the final lesson, we bring everything together in a complete capstone project — from dataset creation to deployed API.