از Fine-tune تا Deploy — مدلت رو سرو کن

مدل آموزش دیده — حالا چی؟

تا اینجا مدل رو Fine-tune کردی، ارزیابی کردی و از نتیجه راضی هستی. ولی یه مدل روی لپ‌تاپت فایده‌ای برای کسی نداره. باید سرو بشه — یعنی از طریق یه API قابل دسترس باشه.

توی این اپیزود، کل مسیر از ذخیره مدل تا deploy در محیط واقعی رو طی می‌کنیم.

فرمت‌های ذخیره مدل

۱. SafeTensors

فرمت استاندارد Hugging Face. امن، سریع و قابل اعتماد:

# ذخیره به فرمت safetensors (پیش‌فرض)
model.save_pretrained("./my-model", safe_serialization=True)
tokenizer.save_pretrained("./my-model")

# حجم تقریبی (8B model):
# FP16: ~16 GB
# FP32: ~32 GB

۲. GGUF — برای llama.cpp

GGUF فرمتیه که llama.cpp و Ollama ازش استفاده می‌کنن. مزیتش اینه که می‌تونی مدل رو quantize شده ذخیره کنی:

# ذخیره به فرمت GGUF با Unsloth
model.save_pretrained_gguf(
    "my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # 4-bit quantization
)

# quantization methods:
# "q4_k_m"  → ~4.5 GB (توصیه‌شده — تعادل کیفیت و حجم)
# "q5_k_m"  → ~5.5 GB (کیفیت بالاتر)
# "q8_0"    → ~8.5 GB (نزدیک به اصلی)
# "f16"     → ~16 GB (بدون quantization)

# یا با llama.cpp مستقیم:
# python convert_hf_to_gguf.py ./my-model --outtype q4_k_m

انتخاب فرمت مناسب

SafeTensors: برای سرو با vLLM، TGI، یا Hugging Face
GGUF: برای سرو با llama.cpp، Ollama، یا روی CPU

سرو با vLLM

vLLM سریع‌ترین و بهینه‌ترین ابزار برای سرو مدل‌های زبانیه. از PagedAttention استفاده می‌کنه و throughput خیلی بالایی داره:

# نصب
# pip install vllm

# راه‌اندازی سرور
# vllm serve ./my-model --port 8000

# یا در پایتون:
from vllm import LLM, SamplingParams

# لود مدل
llm = LLM(
    model="./my-model",
    tensor_parallel_size=1,    # تعداد GPU
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# تنظیم پارامترهای تولید
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
)

# تولید جواب
prompts = ["سلام! من یه سوال دارم..."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

سرور API با vLLM

# راه‌اندازی سرور OpenAI-compatible
# vllm serve ./my-model \
#   --host 0.0.0.0 \
#   --port 8000 \
#   --api-key my-secret-key

# استفاده از API (سازگار با OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="my-secret-key",
)

response = client.chat.completions.create(
    model="./my-model",
    messages=[
        {"role": "system", "content": "تو یه دستیار هوشمند هستی."},
        {"role": "user", "content": "LoRA رو توضیح بده."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

ساخت API با FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from vllm import LLM, SamplingParams
import uvicorn

app = FastAPI(title="My Fine-tuned Model API")

# لود مدل یک بار
llm = LLM(model="./my-model", gpu_memory_utilization=0.9)

class ChatRequest(BaseModel):
    messages: list[dict]
    temperature: float = 0.7
    max_tokens: int = 512

class ChatResponse(BaseModel):
    response: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        # ساخت prompt از messages
        prompt = format_messages(request.messages)
        
        sampling_params = SamplingParams(
            temperature=request.temperature,
            max_tokens=request.max_tokens,
        )
        
        outputs = llm.generate([prompt], sampling_params)
        response_text = outputs[0].outputs[0].text
        
        return ChatResponse(
            response=response_text,
            tokens_used=len(outputs[0].outputs[0].token_ids),
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model": "my-fine-tuned-model"}

def format_messages(messages):
    """تبدیل messages به prompt"""
    formatted = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        formatted += f"<|{role}|>\n{content}\n"
    formatted += "<|assistant|>\n"
    return formatted

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Deploy با Docker

# Dockerfile
"""
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

WORKDIR /app

# نصب Python و وابستگی‌ها
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install vllm fastapi uvicorn

# کپی مدل و کد
COPY ./my-model /app/model
COPY ./server.py /app/server.py

# expose port
EXPOSE 8000

# اجرا
CMD ["python3", "server.py"]
"""

# docker-compose.yml
"""
version: '3.8'
services:
  model-server:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./my-model:/app/model
    environment:
      - CUDA_VISIBLE_DEVICES=0
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
"""

# Build و اجرا
# docker compose build
# docker compose up -d

# بررسی وضعیت
# docker compose logs -f
# curl http://localhost:8000/health

Deploy با Ollama (ساده‌ترین روش)

# اگه مدل رو به فرمت GGUF تبدیل کردی:

# ۱. ساخت Modelfile
"""
# Modelfile
FROM ./my-model.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """تو یه دستیار هوشمند فارسی‌زبان هستی."""
"""

# ۲. ساخت مدل Ollama
# ollama create my-model -f Modelfile

# ۳. اجرا
# ollama run my-model

# ۴. API (خودکار)
# curl http://localhost:11434/api/chat -d '{
#   "model": "my-model",
#   "messages": [{"role": "user", "content": "سلام"}]
# }'

مانیتورینگ

import time
import logging
from collections import defaultdict

# ساده‌ترین شکل مانیتورینگ
class ModelMonitor:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.logger = logging.getLogger("model_monitor")
    
    def log_request(self, prompt, response, latency, tokens):
        self.metrics["latency"].append(latency)
        self.metrics["tokens"].append(tokens)
        self.metrics["requests"].append(time.time())
        
        # هشدار برای latency بالا
        if latency > 5.0:
            self.logger.warning(f"High latency: {latency:.2f}s")
    
    def get_stats(self):
        latencies = self.metrics["latency"]
        if not latencies:
            return {}
        
        return {
            "total_requests": len(latencies),
            "avg_latency": sum(latencies) / len(latencies),
            "p95_latency": sorted(latencies)[int(len(latencies) * 0.95)],
            "avg_tokens": sum(self.metrics["tokens"]) / len(self.metrics["tokens"]),
        }

monitor = ModelMonitor()

# استفاده در API
@app.post("/chat")
async def chat(request: ChatRequest):
    start = time.time()
    
    # ... تولید جواب ...
    
    latency = time.time() - start
    monitor.log_request(
        prompt=request.messages[-1]["content"],
        response=response_text,
        latency=latency,
        tokens=len(token_ids),
    )
    
    return ChatResponse(response=response_text, tokens_used=len(token_ids))

@app.get("/metrics")
async def metrics():
    return monitor.get_stats()

بهینه‌سازی هزینه

# استراتژی‌های کاهش هزینه

# ۱. Quantization مدل — کاهش مصرف GPU
# GGUF Q4_K_M: ~4x کاهش حافظه با کیفیت خوب

# ۲. Batching — پردازش همزمان چند request
# vLLM خودکار continuous batching انجام می‌ده

# ۳. مدل کوچک‌تر — اگه کیفیت قابل قبوله
# 8B vs 70B: هزینه GPU تقریباً 10x کمتر

# ۴. Caching — ذخیره جواب‌های تکراری
from functools import lru_cache
import hashlib

response_cache = {}

def get_cached_response(prompt, temperature=0.0):
    """Cache فقط برای temperature=0 (قطعی)"""
    if temperature > 0:
        return None
    
    key = hashlib.md5(prompt.encode()).hexdigest()
    return response_cache.get(key)

def cache_response(prompt, response, temperature=0.0):
    if temperature == 0:
        key = hashlib.md5(prompt.encode()).hexdigest()
        response_cache[key] = response

# ۵. Auto-scaling — کم و زیاد کردن منابع بر اساس ترافیک
# با Kubernetes و GPU autoscaler

چک‌لیست Deploy

مدل در فرمت مناسب ذخیره شده (SafeTensors یا GGUF)
API endpoint کار می‌کنه و جواب درست می‌ده
Health check endpoint داری
مانیتورینگ فعاله (latency, error rate, throughput)
Rate limiting فعاله
Authentication/API key داری
Error handling درست پیاده شده
Docker image ساخته شده و تست شده
Backup از مدل و config داری

جمع‌بندی سری

تبریک! کل مسیر Fine-tuning رو طی کردی:

اپیزود ۱: فهمیدی Fine-tuning چیه و کِی لازمه
اپیزود ۲: سه مرحله آموزش مدل رو شناختی
اپیزود ۳: LoRA رو یاد گرفتی — Fine-tuning بهینه
اپیزود ۴: QLoRA — Fine-tuning با GPU معمولی
اپیزود ۵: آماده‌سازی دیتاست — مهم‌ترین بخش
اپیزود ۶: Unsloth — ابزار سریع و عملی
اپیزود ۷: ارزیابی مدل Fine-tune شده
اپیزود ۸: DPO — هم‌راستاسازی ساده
اپیزود ۹: چالش‌های زبان فارسی
اپیزود ۱۰: از Fine-tune تا Deploy

Fine-tuning یه مهارت عملیه. خوندن کافی نیست — باید دست به کد بشی. یه مدل کوچیک بگیر، یه دیتاست ساده بساز و شروع کن. اشتباه کردن بخشی از یادگیریه.

حالا دانش و ابزار لازم رو داری. برو و مدلت رو به سبک خودت در بیار!