From Fine-Tuning to Deployment — Serve Your Model

The Model Is Trained — Now What?

So far you’ve fine-tuned the model, evaluated it, and you’re happy with the results. But a model sitting on your laptop isn’t useful to anyone. It needs to be served — meaning accessible through an API.

In this episode, we’ll walk through the entire path from saving the model to deploying it in a production environment.

Model Storage Formats

1. SafeTensors

The standard Hugging Face format. Safe, fast, and reliable:

# Save in safetensors format (default)
model.save_pretrained("./my-model", safe_serialization=True)
tokenizer.save_pretrained("./my-model")

# Approximate size (8B model):
# FP16: ~16 GB
# FP32: ~32 GB

2. GGUF — For llama.cpp

GGUF is the format used by llama.cpp and Ollama. Its advantage is that you can store the model in a quantized format:

# Save in GGUF format with Unsloth
model.save_pretrained_gguf(
    "my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # 4-bit quantization
)

# Quantization methods:
# "q4_k_m"  -> ~4.5 GB (recommended — balance of quality and size)
# "q5_k_m"  -> ~5.5 GB (higher quality)
# "q8_0"    -> ~8.5 GB (close to original)
# "f16"     -> ~16 GB (no quantization)

# Or with llama.cpp directly:
# python convert_hf_to_gguf.py ./my-model --outtype q4_k_m

Choosing the Right Format

SafeTensors: For serving with vLLM, TGI, or Hugging Face
GGUF: For serving with llama.cpp, Ollama, or on CPU

Serving with vLLM

vLLM is the fastest and most optimized tool for serving language models. It uses PagedAttention and achieves very high throughput:

# Installation
# pip install vllm

# Start server
# vllm serve ./my-model --port 8000

# Or in Python:
from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="./my-model",
    tensor_parallel_size=1,    # Number of GPUs
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# Configure generation parameters
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
)

# Generate response
prompts = ["Hello! I have a question..."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

API Server with vLLM

# Start an OpenAI-compatible server
# vllm serve ./my-model \
#   --host 0.0.0.0 \
#   --port 8000 \
#   --api-key my-secret-key

# Use the API (compatible with OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="my-secret-key",
)

response = client.chat.completions.create(
    model="./my-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain LoRA."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

Building an API with FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from vllm import LLM, SamplingParams
import uvicorn

app = FastAPI(title="My Fine-tuned Model API")

# Load model once
llm = LLM(model="./my-model", gpu_memory_utilization=0.9)

class ChatRequest(BaseModel):
    messages: list[dict]
    temperature: float = 0.7
    max_tokens: int = 512

class ChatResponse(BaseModel):
    response: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        # Build prompt from messages
        prompt = format_messages(request.messages)
        
        sampling_params = SamplingParams(
            temperature=request.temperature,
            max_tokens=request.max_tokens,
        )
        
        outputs = llm.generate([prompt], sampling_params)
        response_text = outputs[0].outputs[0].text
        
        return ChatResponse(
            response=response_text,
            tokens_used=len(outputs[0].outputs[0].token_ids),
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model": "my-fine-tuned-model"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Deploy with Docker

# Dockerfile
"""
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

WORKDIR /app

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install vllm fastapi uvicorn

# Copy model and code
COPY ./my-model /app/model
COPY ./server.py /app/server.py

# Expose port
EXPOSE 8000

# Run
CMD ["python3", "server.py"]
"""

# docker-compose.yml
"""
version: '3.8'
services:
  model-server:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./my-model:/app/model
    environment:
      - CUDA_VISIBLE_DEVICES=0
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
"""

# Build and run
# docker compose build
# docker compose up -d

# Check status
# docker compose logs -f
# curl http://localhost:8000/health

Deploy with Ollama (Simplest Method)

# If you've converted the model to GGUF format:

# 1. Create Modelfile
"""
# Modelfile
FROM ./my-model.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful assistant."
"""

# 2. Create Ollama model
# ollama create my-model -f Modelfile

# 3. Run
# ollama run my-model

# 4. API (automatic)
# curl http://localhost:11434/api/chat -d '{
#   "model": "my-model",
#   "messages": [{"role": "user", "content": "Hello"}]
# }'

Monitoring

import time
import logging
from collections import defaultdict

# Simplest form of monitoring
class ModelMonitor:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.logger = logging.getLogger("model_monitor")
    
    def log_request(self, prompt, response, latency, tokens):
        self.metrics["latency"].append(latency)
        self.metrics["tokens"].append(tokens)
        self.metrics["requests"].append(time.time())
        
        # Alert for high latency
        if latency > 5.0:
            self.logger.warning(f"High latency: {latency:.2f}s")
    
    def get_stats(self):
        latencies = self.metrics["latency"]
        if not latencies:
            return {}
        
        return {
            "total_requests": len(latencies),
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "avg_tokens": sum(self.metrics["tokens"]) / len(self.metrics["tokens"]),
        }

monitor = ModelMonitor()

Cost Optimization

# Cost reduction strategies

# 1. Model quantization — reduce GPU usage
# GGUF Q4_K_M: ~4x memory reduction with good quality

# 2. Batching — process multiple requests simultaneously
# vLLM automatically does continuous batching

# 3. Smaller model — if quality is acceptable
# 8B vs 70B: GPU cost is approximately 10x lower

# 4. Caching — store repeated responses
from functools import lru_cache
import hashlib

response_cache = {}

def get_cached_response(prompt, temperature=0.0):
    """Cache only for temperature=0 (deterministic)"""
    if temperature > 0:
        return None
    
    key = hashlib.md5(prompt.encode()).hexdigest()
    return response_cache.get(key)

# 5. Auto-scaling — scale resources based on traffic
# With Kubernetes and GPU autoscaler

Deployment Checklist

Model saved in the appropriate format (SafeTensors or GGUF)
API endpoint works and returns correct responses
Health check endpoint is in place
Monitoring is active (latency, error rate, throughput)
Rate limiting is enabled
Authentication/API key is set up
Error handling is properly implemented
Docker image is built and tested
Backup of model and config exists

Series Summary

Congratulations! You’ve completed the entire fine-tuning journey:

Episode 1: Understood what fine-tuning is and when it’s needed
Episode 2: Learned the three stages of model training
Episode 3: Mastered LoRA — efficient fine-tuning
Episode 4: QLoRA — fine-tuning with a regular GPU
Episode 5: Dataset preparation — the most critical part
Episode 6: Unsloth — fast and practical tooling
Episode 7: Evaluating your fine-tuned model
Episode 8: DPO — simple alignment
Episode 9: Persian language challenges
Episode 10: From fine-tuning to deployment

Fine-tuning is a practical skill. Reading isn’t enough — you need to get your hands dirty with code. Pick a small model, build a simple dataset, and start. Making mistakes is part of the learning process.

You now have the knowledge and tools you need. Go ahead and build your model your way!