model-deployment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Model Deployment

模型部署

Deploy LLMs to production with optimal performance.

将大语言模型（LLM）部署到生产环境并实现最优性能。

Quick Start

快速开始

vLLM Server

vLLM 服务器

bash

undefined

bash

undefined

Install

pip install vllm

Start server

python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1

Query (OpenAI-compatible)

curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'

undefined

curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'

undefined

Text Generation Inference (TGI)

bash

undefined

bash

undefined

Docker deployment

docker run --gpus all -p 8080:80
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192

Query

curl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'

undefined

curl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'

undefined

Ollama (Local Deployment)

Ollama（本地部署）

bash

undefined

bash

undefined

Install and run

curl -fsSL https://ollama.ai/install.sh | sh ollama run llama2

API usage

curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?" }'

undefined

curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?" }'

undefined

Deployment Options Comparison

部署方案对比

Platform	Ease	Cost	Scale	Latency	Best For
vLLM	⭐⭐	Self-host	High	Low	Production
TGI	⭐⭐	Self-host	High	Low	HuggingFace ecosystem
Ollama	⭐⭐⭐	Free	Low	Medium	Local dev
OpenAI	⭐⭐⭐	Pay-per-token	Very High	Low	Quick start
AWS Bedrock	⭐⭐	Pay-per-token	Very High	Medium	Enterprise
Replicate	⭐⭐⭐	Pay-per-second	High	Medium	Prototyping

平台	易用性	成本	扩展性	延迟	适用场景
vLLM	⭐⭐	自托管	高	低	生产环境
TGI	⭐⭐	自托管	高	低	HuggingFace 生态系统
Ollama	⭐⭐⭐	免费	低	中等	本地开发
OpenAI	⭐⭐⭐	按令牌付费	极高	低	快速启动
AWS Bedrock	⭐⭐	按令牌付费	极高	中等	企业级
Replicate	⭐⭐⭐	按秒付费	高	中等	原型开发

FastAPI Inference Server

FastAPI 推理服务器

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

Load model at startup

model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )

class GenerateRequest(BaseModel): prompt: str max_tokens: int = 100 temperature: float = 0.7 top_p: float = 0.9

class GenerateResponse(BaseModel): text: str tokens_used: int

@app.post("/generate", response_model=GenerateResponse) async def generate(request: GenerateRequest): try: inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return GenerateResponse(
        text=generated,
        tokens_used=new_tokens
    )
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))

@app.get("/health") async def health(): return {"status": "healthy", "model": model_name}

undefined

class GenerateRequest(BaseModel): prompt: str max_tokens: int = 100 temperature: float = 0.7 top_p: float = 0.9

class GenerateResponse(BaseModel): text: str tokens_used: int

@app.post("/generate", response_model=GenerateResponse) async def generate(request: GenerateRequest): try: inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return GenerateResponse(
        text=generated,
        tokens_used=new_tokens
    )
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))

@app.get("/health") async def health(): return {"status": "healthy", "model": model_name}

undefined

Docker Deployment

Docker 部署

Dockerfile

dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app

dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app

Install Python

RUN apt-get update && apt-get install -y python3 python3-pip

Install dependencies

COPY requirements.txt . RUN pip3 install --no-cache-dir -r requirements.txt

Copy application

COPY . .

Download model (or mount volume)

RUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

undefined

RUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

undefined

Docker Compose

yaml

version: '3.8'
services:
  llm-server:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

yaml

version: '3.8'
services:
  llm-server:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Kubernetes Deployment

Kubernetes 部署

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: llm
        image: llm-inference:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: llm
        image: llm-inference:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Optimization Techniques

优化技巧

Quantization

量化

python

from transformers import BitsAndBytesConfig

python

from transformers import BitsAndBytesConfig

8-bit quantization

config_8bit = BitsAndBytesConfig(load_in_8bit=True)

4-bit quantization (QLoRA-style)

config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=config_4bit, device_map="auto" )

undefined

config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=config_4bit, device_map="auto" )

undefined

Batching

批处理

python

undefined

python

undefined

Dynamic batching with vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")

vLLM automatically batches concurrent requests

prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] sampling = SamplingParams(temperature=0.7, max_tokens=100)

outputs = llm.generate(prompts, sampling) # Batched execution

undefined

prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] sampling = SamplingParams(temperature=0.7, max_tokens=100)

outputs = llm.generate(prompts, sampling) # Batched execution

undefined

KV Cache Optimization

KV缓存优化

python

undefined

python

undefined

vLLM with PagedAttention

llm = LLM( model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9, max_num_batched_tokens=4096 )

undefined

llm = LLM( model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9, max_num_batched_tokens=4096 )

undefined

Monitoring

监控

python

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    REQUEST_COUNT.inc()
    start = time.time()
    response = await call_next(request)
    LATENCY.observe(time.time() - start)
    return response

python

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    REQUEST_COUNT.inc()
    start = time.time()
    response = await call_next(request)
    LATENCY.observe(time.time() - start)
    return response

Start metrics server

start_http_server(9090)

undefined

start_http_server(9090)

undefined

Best Practices

最佳实践

Use quantization: 4-bit for dev, 8-bit for production
Implement batching: vLLM/TGI handle this automatically
Monitor everything: Latency, throughput, errors, GPU utilization
Cache responses: For repeated queries
Set timeouts: Prevent hung requests
Load balance: Multiple replicas for high availability

使用量化：开发环境用4位量化，生产环境用8位量化
实现批处理：vLLM/TGI会自动处理批处理
全面监控：监控延迟、吞吐量、错误、GPU利用率
响应缓存：针对重复查询进行缓存
设置超时：防止请求挂起
负载均衡：多副本部署以实现高可用性

Error Handling & Retry

错误处理与重试

python

from tenacity import retry, stop_after_attempt, wait_exponential
import httpx

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/generate",
            json={"prompt": prompt},
            timeout=30.0
        )
        return response.json()

python

from tenacity import retry, stop_after_attempt, wait_exponential
import httpx

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/generate",
            json={"prompt": prompt},
            timeout=30.0
        )
        return response.json()

Troubleshooting

故障排查

Symptom	Cause	Solution
OOM on load	Model too large	Use quantization
High latency	No batching	Enable vLLM batching
Connection refused	Server not started	Check health endpoint

症状	原因	解决方案
加载时内存不足（OOM）	模型过大	使用量化技术
延迟过高	未启用批处理	开启vLLM批处理
连接被拒绝	服务器未启动	检查健康检查端点

Unit Test Template

单元测试模板

python

def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

python

def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"