The Model Is Trained — Now What?
So far you’ve fine-tuned the model, evaluated it, and you’re happy with the results. But a model sitting on your laptop isn’t useful to anyone. It needs to be served — meaning accessible through an API.
In this episode, we’ll walk through the entire path from saving the model to deploying it in a production environment.
Model Storage Formats
1. SafeTensors
The standard Hugging Face format. Safe, fast, and reliable:
# Save in safetensors format (default)
model.save_pretrained("./my-model", safe_serialization=True)
tokenizer.save_pretrained("./my-model")
# Approximate size (8B model):
# FP16: ~16 GB
# FP32: ~32 GB
2. GGUF — For llama.cpp
GGUF is the format used by llama.cpp and Ollama. Its advantage is that you can store the model in a quantized format:
# Save in GGUF format with Unsloth
model.save_pretrained_gguf(
"my-model-gguf",
tokenizer,
quantization_method="q4_k_m", # 4-bit quantization
)
# Quantization methods:
# "q4_k_m" -> ~4.5 GB (recommended — balance of quality and size)
# "q5_k_m" -> ~5.5 GB (higher quality)
# "q8_0" -> ~8.5 GB (close to original)
# "f16" -> ~16 GB (no quantization)
# Or with llama.cpp directly:
# python convert_hf_to_gguf.py ./my-model --outtype q4_k_m
Choosing the Right Format
- SafeTensors: For serving with vLLM, TGI, or Hugging Face
- GGUF: For serving with llama.cpp, Ollama, or on CPU
Serving with vLLM
vLLM is the fastest and most optimized tool for serving language models. It uses PagedAttention and achieves very high throughput:
# Installation
# pip install vllm
# Start server
# vllm serve ./my-model --port 8000
# Or in Python:
from vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="./my-model",
tensor_parallel_size=1, # Number of GPUs
gpu_memory_utilization=0.9,
max_model_len=4096,
)
# Configure generation parameters
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=512,
top_p=0.9,
)
# Generate response
prompts = ["Hello! I have a question..."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
API Server with vLLM
# Start an OpenAI-compatible server
# vllm serve ./my-model \
# --host 0.0.0.0 \
# --port 8000 \
# --api-key my-secret-key
# Use the API (compatible with OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="my-secret-key",
)
response = client.chat.completions.create(
model="./my-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain LoRA."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
Building an API with FastAPI
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from vllm import LLM, SamplingParams
import uvicorn
app = FastAPI(title="My Fine-tuned Model API")
# Load model once
llm = LLM(model="./my-model", gpu_memory_utilization=0.9)
class ChatRequest(BaseModel):
messages: list[dict]
temperature: float = 0.7
max_tokens: int = 512
class ChatResponse(BaseModel):
response: str
tokens_used: int
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
# Build prompt from messages
prompt = format_messages(request.messages)
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens,
)
outputs = llm.generate([prompt], sampling_params)
response_text = outputs[0].outputs[0].text
return ChatResponse(
response=response_text,
tokens_used=len(outputs[0].outputs[0].token_ids),
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy", "model": "my-fine-tuned-model"}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Deploy with Docker
# Dockerfile
"""
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
WORKDIR /app
# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install vllm fastapi uvicorn
# Copy model and code
COPY ./my-model /app/model
COPY ./server.py /app/server.py
# Expose port
EXPOSE 8000
# Run
CMD ["python3", "server.py"]
"""
# docker-compose.yml
"""
version: '3.8'
services:
model-server:
build: .
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./my-model:/app/model
environment:
- CUDA_VISIBLE_DEVICES=0
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
"""
# Build and run
# docker compose build
# docker compose up -d
# Check status
# docker compose logs -f
# curl http://localhost:8000/health
Deploy with Ollama (Simplest Method)
# If you've converted the model to GGUF format:
# 1. Create Modelfile
"""
# Modelfile
FROM ./my-model.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."
"""
# 2. Create Ollama model
# ollama create my-model -f Modelfile
# 3. Run
# ollama run my-model
# 4. API (automatic)
# curl http://localhost:11434/api/chat -d '{
# "model": "my-model",
# "messages": [{"role": "user", "content": "Hello"}]
# }'
Monitoring
import time
import logging
from collections import defaultdict
# Simplest form of monitoring
class ModelMonitor:
def __init__(self):
self.metrics = defaultdict(list)
self.logger = logging.getLogger("model_monitor")
def log_request(self, prompt, response, latency, tokens):
self.metrics["latency"].append(latency)
self.metrics["tokens"].append(tokens)
self.metrics["requests"].append(time.time())
# Alert for high latency
if latency > 5.0:
self.logger.warning(f"High latency: {latency:.2f}s")
def get_stats(self):
latencies = self.metrics["latency"]
if not latencies:
return {}
return {
"total_requests": len(latencies),
"avg_latency": sum(latencies) / len(latencies),
"p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
"avg_tokens": sum(self.metrics["tokens"]) / len(self.metrics["tokens"]),
}
monitor = ModelMonitor()
Cost Optimization
# Cost reduction strategies
# 1. Model quantization — reduce GPU usage
# GGUF Q4_K_M: ~4x memory reduction with good quality
# 2. Batching — process multiple requests simultaneously
# vLLM automatically does continuous batching
# 3. Smaller model — if quality is acceptable
# 8B vs 70B: GPU cost is approximately 10x lower
# 4. Caching — store repeated responses
from functools import lru_cache
import hashlib
response_cache = {}
def get_cached_response(prompt, temperature=0.0):
"""Cache only for temperature=0 (deterministic)"""
if temperature > 0:
return None
key = hashlib.md5(prompt.encode()).hexdigest()
return response_cache.get(key)
# 5. Auto-scaling — scale resources based on traffic
# With Kubernetes and GPU autoscaler
Deployment Checklist
- Model saved in the appropriate format (SafeTensors or GGUF)
- API endpoint works and returns correct responses
- Health check endpoint is in place
- Monitoring is active (latency, error rate, throughput)
- Rate limiting is enabled
- Authentication/API key is set up
- Error handling is properly implemented
- Docker image is built and tested
- Backup of model and config exists
Series Summary
Congratulations! You’ve completed the entire fine-tuning journey:
- Episode 1: Understood what fine-tuning is and when it’s needed
- Episode 2: Learned the three stages of model training
- Episode 3: Mastered LoRA — efficient fine-tuning
- Episode 4: QLoRA — fine-tuning with a regular GPU
- Episode 5: Dataset preparation — the most critical part
- Episode 6: Unsloth — fast and practical tooling
- Episode 7: Evaluating your fine-tuned model
- Episode 8: DPO — simple alignment
- Episode 9: Persian language challenges
- Episode 10: From fine-tuning to deployment
Fine-tuning is a practical skill. Reading isn’t enough — you need to get your hands dirty with code. Pick a small model, build a simple dataset, and start. Making mistakes is part of the learning process.
You now have the knowledge and tools you need. Go ahead and build your model your way!