model-deployment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Deployment

模型部署

Deploy LLMs to production with optimal performance.
将大语言模型(LLM)部署到生产环境并实现最优性能。

Quick Start

快速开始

vLLM Server

vLLM 服务器

bash
undefined
bash
undefined

Install

Install

pip install vllm
pip install vllm

Start server

Start server

python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000
--tensor-parallel-size 1

Query (OpenAI-compatible)

Query (OpenAI-compatible)

curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
undefined
curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "Hello, how are you?", "max_tokens": 100 }'
undefined

Text Generation Inference (TGI)

Text Generation Inference (TGI)

bash
undefined
bash
undefined

Docker deployment

Docker deployment

docker run --gpus all -p 8080:80
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192
docker run --gpus all -p 8080:80
-v ./data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id meta-llama/Llama-2-7b-chat-hf
--quantize bitsandbytes-nf4
--max-input-length 4096
--max-total-tokens 8192

Query

Query

curl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
undefined
curl http://localhost:8080/generate
-H "Content-Type: application/json"
-d '{"inputs": "What is AI?", "parameters": {"max_new_tokens": 100}}'
undefined

Ollama (Local Deployment)

Ollama(本地部署)

bash
undefined
bash
undefined

Install and run

Install and run

curl -fsSL https://ollama.ai/install.sh | sh ollama run llama2
curl -fsSL https://ollama.ai/install.sh | sh ollama run llama2

API usage

API usage

curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?" }'
undefined
curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?" }'
undefined

Deployment Options Comparison

部署方案对比

PlatformEaseCostScaleLatencyBest For
vLLM⭐⭐Self-hostHighLowProduction
TGI⭐⭐Self-hostHighLowHuggingFace ecosystem
Ollama⭐⭐⭐FreeLowMediumLocal dev
OpenAI⭐⭐⭐Pay-per-tokenVery HighLowQuick start
AWS Bedrock⭐⭐Pay-per-tokenVery HighMediumEnterprise
Replicate⭐⭐⭐Pay-per-secondHighMediumPrototyping
平台易用性成本扩展性延迟适用场景
vLLM⭐⭐自托管生产环境
TGI⭐⭐自托管HuggingFace 生态系统
Ollama⭐⭐⭐免费中等本地开发
OpenAI⭐⭐⭐按令牌付费极高快速启动
AWS Bedrock⭐⭐按令牌付费极高中等企业级
Replicate⭐⭐⭐按秒付费中等原型开发

FastAPI Inference Server

FastAPI 推理服务器

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

Load model at startup

Load model at startup

model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )
class GenerateRequest(BaseModel): prompt: str max_tokens: int = 100 temperature: float = 0.7 top_p: float = 0.9
class GenerateResponse(BaseModel): text: str tokens_used: int
@app.post("/generate", response_model=GenerateResponse) async def generate(request: GenerateRequest): try: inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return GenerateResponse(
        text=generated,
        tokens_used=new_tokens
    )
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))
@app.get("/health") async def health(): return {"status": "healthy", "model": model_name}
undefined
model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )
class GenerateRequest(BaseModel): prompt: str max_tokens: int = 100 temperature: float = 0.7 top_p: float = 0.9
class GenerateResponse(BaseModel): text: str tokens_used: int
@app.post("/generate", response_model=GenerateResponse) async def generate(request: GenerateRequest): try: inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    new_tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return GenerateResponse(
        text=generated,
        tokens_used=new_tokens
    )
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))
@app.get("/health") async def health(): return {"status": "healthy", "model": model_name}
undefined

Docker Deployment

Docker 部署

Dockerfile

Dockerfile

dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app
dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app

Install Python

Install Python

RUN apt-get update && apt-get install -y python3 python3-pip
RUN apt-get update && apt-get install -y python3 python3-pip

Install dependencies

Install dependencies

COPY requirements.txt . RUN pip3 install --no-cache-dir -r requirements.txt
COPY requirements.txt . RUN pip3 install --no-cache-dir -r requirements.txt

Copy application

Copy application

COPY . .
COPY . .

Download model (or mount volume)

Download model (or mount volume)

RUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
undefined
RUN python3 -c "from transformers import AutoModelForCausalLM;
AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
undefined

Docker Compose

Docker Compose

yaml
version: '3.8'
services:
  llm-server:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
yaml
version: '3.8'
services:
  llm-server:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Kubernetes Deployment

Kubernetes 部署

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: llm
        image: llm-inference:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: llm
        image: llm-inference:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Optimization Techniques

优化技巧

Quantization

量化

python
from transformers import BitsAndBytesConfig
python
from transformers import BitsAndBytesConfig

8-bit quantization

8-bit quantization

config_8bit = BitsAndBytesConfig(load_in_8bit=True)
config_8bit = BitsAndBytesConfig(load_in_8bit=True)

4-bit quantization (QLoRA-style)

4-bit quantization (QLoRA-style)

config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )
model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=config_4bit, device_map="auto" )
undefined
config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )
model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=config_4bit, device_map="auto" )
undefined

Batching

批处理

python
undefined
python
undefined

Dynamic batching with vLLM

Dynamic batching with vLLM

from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")

vLLM automatically batches concurrent requests

vLLM automatically batches concurrent requests

prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] sampling = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling) # Batched execution
undefined
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] sampling = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling) # Batched execution
undefined

KV Cache Optimization

KV缓存优化

python
undefined
python
undefined

vLLM with PagedAttention

vLLM with PagedAttention

llm = LLM( model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9, max_num_batched_tokens=4096 )
undefined
llm = LLM( model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9, max_num_batched_tokens=4096 )
undefined

Monitoring

监控

python
from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    REQUEST_COUNT.inc()
    start = time.time()
    response = await call_next(request)
    LATENCY.observe(time.time() - start)
    return response
python
from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter('inference_requests_total', 'Total requests')
LATENCY = Histogram('inference_latency_seconds', 'Request latency')
TOKENS = Counter('tokens_generated_total', 'Total tokens generated')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    REQUEST_COUNT.inc()
    start = time.time()
    response = await call_next(request)
    LATENCY.observe(time.time() - start)
    return response

Start metrics server

Start metrics server

start_http_server(9090)
undefined
start_http_server(9090)
undefined

Best Practices

最佳实践

  1. Use quantization: 4-bit for dev, 8-bit for production
  2. Implement batching: vLLM/TGI handle this automatically
  3. Monitor everything: Latency, throughput, errors, GPU utilization
  4. Cache responses: For repeated queries
  5. Set timeouts: Prevent hung requests
  6. Load balance: Multiple replicas for high availability
  1. 使用量化:开发环境用4位量化,生产环境用8位量化
  2. 实现批处理:vLLM/TGI会自动处理批处理
  3. 全面监控:监控延迟、吞吐量、错误、GPU利用率
  4. 响应缓存:针对重复查询进行缓存
  5. 设置超时:防止请求挂起
  6. 负载均衡:多副本部署以实现高可用性

Error Handling & Retry

错误处理与重试

python
from tenacity import retry, stop_after_attempt, wait_exponential
import httpx

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/generate",
            json={"prompt": prompt},
            timeout=30.0
        )
        return response.json()
python
from tenacity import retry, stop_after_attempt, wait_exponential
import httpx

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def call_inference_api(prompt: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/generate",
            json={"prompt": prompt},
            timeout=30.0
        )
        return response.json()

Troubleshooting

故障排查

SymptomCauseSolution
OOM on loadModel too largeUse quantization
High latencyNo batchingEnable vLLM batching
Connection refusedServer not startedCheck health endpoint
症状原因解决方案
加载时内存不足(OOM)模型过大使用量化技术
延迟过高未启用批处理开启vLLM批处理
连接被拒绝服务器未启动检查健康检查端点

Unit Test Template

单元测试模板

python
def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"
python
def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"