model-serving

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Serving

模型服务

Purpose

用途

Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.
借助优化的服务引擎、流式响应模式和编排框架,将LLM与ML模型部署至生产环境以实现推理。重点关注自托管模型服务、GPU优化以及与前端应用的集成。

When to Use

适用场景

  • Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
  • Building AI APIs with streaming responses
  • Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
  • Implementing RAG pipelines with vector databases
  • Optimizing inference throughput and latency
  • Integrating LLM serving with frontend chat interfaces
  • 部署LLM至生产环境(自托管Llama、Mistral、Qwen)
  • 构建带流式响应的AI API
  • 部署传统ML模型(scikit-learn、XGBoost、PyTorch)
  • 结合向量数据库实现RAG流水线
  • 优化推理吞吐量与延迟
  • 将LLM服务与前端聊天界面集成

Model Serving Selection

模型服务选型

LLM Serving Engines

LLM服务引擎

vLLM (Recommended Primary)
  • PagedAttention memory management (20-30x throughput improvement)
  • Continuous batching for dynamic request handling
  • OpenAI-compatible API endpoints
  • Use for: Most self-hosted LLM deployments
TensorRT-LLM
  • Maximum GPU efficiency (2-8x faster than vLLM)
  • Requires model conversion and optimization
  • Use for: Production workloads needing absolute maximum throughput
Ollama
  • Local development without GPUs
  • Simple CLI interface
  • Use for: Prototyping, laptop development, educational purposes
Decision Framework:
Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed
vLLM(推荐首选)
  • PagedAttention内存管理(吞吐量提升20-30倍)
  • 动态请求处理的连续批处理
  • 兼容OpenAI的API端点
  • 适用场景:大多数自托管LLM部署
TensorRT-LLM
  • 极致GPU效率(比vLLM快2-8倍)
  • 需要模型转换与优化
  • 适用场景:对吞吐量有极致要求的生产工作负载
Ollama
  • 无需GPU的本地开发
  • 简洁的CLI界面
  • 适用场景:原型开发、笔记本电脑开发、教学用途
决策框架:
是否需要自托管LLM部署?
├─ 是,需要高吞吐量 → vLLM
├─ 是,需要极致GPU效率 → TensorRT-LLM
├─ 是,仅本地开发 → Ollama
└─ 否,使用托管API(OpenAI、Anthropic) → 无需服务层

ML Model Serving (Non-LLM)

ML模型服务(非LLM)

BentoML (Recommended)
  • Python-native, easy deployment
  • Adaptive batching for throughput
  • Multi-framework support (scikit-learn, PyTorch, XGBoost)
  • Use for: Most traditional ML model deployments
Triton Inference Server
  • Multi-model serving on same GPU
  • Model ensembles (chain multiple models)
  • Use for: NVIDIA GPU optimization, serving 10+ models
BentoML(推荐)
  • 原生Python支持,部署便捷
  • 自适应批处理提升吞吐量
  • 多框架支持(scikit-learn、PyTorch、XGBoost)
  • 适用场景:大多数传统ML模型部署
Triton推理服务器
  • 单GPU多模型服务
  • 模型集成(链式调用多模型)
  • 适用场景:NVIDIA GPU优化、部署10个以上模型

LLM Orchestration

LLM编排

LangChain
  • General-purpose workflows, agents, RAG
  • 100+ integrations (LLMs, vector DBs, tools)
  • Use for: Most RAG and agent applications
LlamaIndex
  • RAG-focused with advanced retrieval strategies
  • 100+ data connectors (PDF, Notion, web)
  • Use for: RAG is primary use case
LangChain
  • 通用工作流、Agent、RAG支持
  • 100+集成(LLM、向量数据库、工具)
  • 适用场景:大多数RAG与Agent应用
LlamaIndex
  • 专注RAG,支持高级检索策略
  • 100+数据连接器(PDF、Notion、网页)
  • 适用场景:以RAG为核心需求的场景

Quick Start Examples

快速入门示例

vLLM Server Setup

vLLM服务器搭建

bash
undefined
bash
undefined

Install

安装

pip install vllm
pip install vllm

Serve a model (OpenAI-compatible API)

启动模型服务(兼容OpenAI API)

vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000

**Key Parameters:**
- `--dtype`: Model precision (auto, float16, bfloat16)
- `--max-model-len`: Context window size
- `--gpu-memory-utilization`: GPU memory fraction (0.8-0.95)
- `--tensor-parallel-size`: Number of GPUs for model parallelism
vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000

**关键参数:**
- `--dtype`:模型精度(auto、float16、bfloat16)
- `--max-model-len`:上下文窗口大小
- `--gpu-memory-utilization`:GPU内存占用比例(0.8-0.95)
- `--tensor-parallel-size`:模型并行使用的GPU数量

Streaming Responses (SSE Pattern)

流式响应(SSE模式)

Backend (FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )
Frontend (React):
typescript
// Integration with ai-chat skill
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}
后端(FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )
前端(React):
typescript
// 与ai-chat技能集成
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}

BentoML Service

BentoML服务

python
import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]
python
import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]

LangChain RAG Pipeline

LangChain RAG流水线

python
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
python
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

Load and chunk documents

加载并拆分文档

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = text_splitter.split_documents(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = text_splitter.split_documents(documents)

Create vector store

创建向量数据库

embeddings = OpenAIEmbeddings() vectorstore = Qdrant.from_documents( chunks, embeddings, url="http://localhost:6333", collection_name="docs" )
embeddings = OpenAIEmbeddings() vectorstore = Qdrant.from_documents( chunks, embeddings, url="http://localhost:6333", collection_name="docs" )

Create retrieval chain

创建检索链

llm = ChatOpenAI(model="gpt-4o") qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True )
llm = ChatOpenAI(model="gpt-4o") qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True )

Query

查询

result = qa_chain({"query": "What is PagedAttention?"})
undefined
result = qa_chain({"query": "What is PagedAttention?"})
undefined

Performance Optimization

性能优化

GPU Memory Estimation

GPU内存估算

Rule of thumb for LLMs:
GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2
Examples:
  • Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
  • Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)
Quantization reduces memory:
  • FP16: 2 bytes per parameter
  • INT8: 1 byte per parameter (2x memory reduction)
  • INT4: 0.5 bytes per parameter (4x memory reduction)
LLM经验公式:
GPU内存(GB)= 模型参数量(B) × 精度(字节) × 1.2
示例:
  • Llama-3.1-8B(FP16):8B × 2字节 × 1.2 = 19.2 GB
  • Llama-3.1-70B(FP16):70B × 2字节 × 1.2 = 168 GB(需要2-4张A100显卡)
量化可减少内存占用:
  • FP16:每个参数2字节
  • INT8:每个参数1字节(内存占用减少2倍)
  • INT4:每个参数0.5字节(内存占用减少4倍)

vLLM Optimization

vLLM优化

bash
undefined
bash
undefined

Enable quantization (AWQ for 4-bit)

启用量化(AWQ 4位)

vllm serve TheBloke/Llama-3.1-8B-AWQ
--quantization awq
--gpu-memory-utilization 0.9
vllm serve TheBloke/Llama-3.1-8B-AWQ
--quantization awq
--gpu-memory-utilization 0.9

Multi-GPU deployment (tensor parallelism)

多GPU部署(张量并行)

vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
undefined
vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
undefined

Batching Strategies

批处理策略

Continuous batching (vLLM default):
  • Dynamically adds/removes requests from batch
  • Higher throughput than static batching
  • No configuration needed
Adaptive batching (BentoML):
python
@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML automatically batches requests
    return self.model.predict(np.array(inputs))
连续批处理(vLLM默认):
  • 动态添加/移除批处理中的请求
  • 比静态批处理吞吐量更高
  • 无需额外配置
自适应批处理(BentoML):
python
@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # 最多等待1秒填充批处理
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML自动处理请求批处理
    return self.model.predict(np.array(inputs))

Production Deployment

生产环境部署

Kubernetes Deployment

Kubernetes部署

See
examples/k8s-vllm-deployment/
for complete YAML manifests.
Key considerations:
  • GPU resource requests:
    nvidia.com/gpu: 1
  • Health checks:
    /health
    endpoint
  • Horizontal Pod Autoscaling based on queue depth
  • Persistent volume for model caching
完整YAML清单请查看
examples/k8s-vllm-deployment/
关键注意事项:
  • GPU资源请求:
    nvidia.com/gpu: 1
  • 健康检查:
    /health
    端点
  • 基于队列深度的水平Pod自动扩缩容
  • 模型缓存持久化存储

API Gateway Pattern

API网关模式

For production, add rate limiting, authentication, and monitoring:
Kong Configuration:
yaml
services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 60 requests per minute per API key
      - name: key-auth
      - name: prometheus
生产环境中需添加限流、认证与监控:
Kong配置:
yaml
services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 每个API密钥每分钟60次请求
      - name: key-auth
      - name: prometheus

Monitoring Metrics

监控指标

Essential LLM metrics:
  • Tokens per second (throughput)
  • Time to first token (TTFT)
  • Inter-token latency
  • GPU utilization and memory
  • Queue depth
Prometheus instrumentation:
python
from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response
核心LLM指标:
  • 每秒生成Token数(吞吐量)
  • 首Token响应时间(TTFT)
  • Token间延迟
  • GPU利用率与内存占用
  • 队列深度
Prometheus埋点:
python
from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response

Integration Patterns

集成模式

Frontend (ai-chat) Integration

前端(ai-chat)集成

This skill provides the backend serving layer for the
ai-chat
skill.
Flow:
Frontend (React) → API Gateway → vLLM Server → GPU Inference
     ↑                                                  ↓
     └─────────── SSE Stream (tokens) ─────────────────┘
See
references/streaming-sse.md
for complete implementation patterns.
本技能为
ai-chat
技能提供后端服务层。
流程:
前端(React) → API网关 → vLLM服务器 → GPU推理
     ↑                                                  ↓
     └─────────── SSE流式响应(Token) ─────────────────┘
完整实现模式请查看
references/streaming-sse.md

RAG with Vector Databases

与向量数据库的RAG集成

Architecture:
User Query → LangChain
              ├─> Vector DB (Qdrant) for retrieval
              ├─> Combine context + query
              └─> LLM (vLLM) for generation
See
references/langchain-orchestration.md
and
examples/langchain-rag-qdrant/
for complete patterns.
架构:
用户查询 → LangChain
              ├─> 向量数据库(Qdrant)检索
              ├─> 结合上下文与查询
              └─> LLM(vLLM)生成结果
完整模式请查看
references/langchain-orchestration.md
examples/langchain-rag-qdrant/

Async Inference Queue

异步推理队列

For batch processing or non-real-time inference:
Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB
Useful for:
  • Batch document processing
  • Background summarization
  • Non-interactive workflows
适用于批处理或非实时推理场景:
客户端 → API → 消息队列(Celery) → 工作节点(vLLM) → 结果数据库
适用场景:
  • 批量文档处理
  • 后台摘要生成
  • 非交互式工作流

Benchmarking

基准测试

Use
scripts/benchmark_inference.py
to measure the deployment:
bash
python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000
Outputs:
  • Requests per second
  • P50/P95/P99 latency
  • Tokens per second
  • GPU memory usage
使用
scripts/benchmark_inference.py
测试部署性能:
bash
python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000
输出结果:
  • 每秒请求数
  • P50/P95/P99延迟
  • 每秒生成Token数
  • GPU内存占用

Bundled Resources

配套资源

Detailed Guides:
  • references/vllm.md
    - vLLM setup, PagedAttention, optimization
  • references/tgi.md
    - Text Generation Inference patterns
  • references/bentoml.md
    - BentoML deployment patterns
  • references/langchain-orchestration.md
    - LangChain RAG and agents
  • references/inference-optimization.md
    - Quantization, batching, GPU tuning
Working Examples:
  • examples/vllm-serving/
    - Complete vLLM + FastAPI streaming setup
  • examples/ollama-local/
    - Local development with Ollama
  • examples/langchain-agents/
    - LangChain agent patterns
Utility Scripts:
  • scripts/benchmark_inference.py
    - Throughput and latency benchmarking
  • scripts/validate_model_config.py
    - Validate deployment configurations
详细指南:
  • references/vllm.md
    - vLLM搭建、PagedAttention、优化
  • references/tgi.md
    - 文本生成推理模式
  • references/bentoml.md
    - BentoML部署模式
  • references/langchain-orchestration.md
    - LangChain RAG与Agent
  • references/inference-optimization.md
    - 量化、批处理、GPU调优
可用示例:
  • examples/vllm-serving/
    - 完整vLLM + FastAPI流式响应搭建
  • examples/ollama-local/
    - Ollama本地开发
  • examples/langchain-agents/
    - LangChain Agent模式
实用脚本:
  • scripts/benchmark_inference.py
    - 吞吐量与延迟基准测试
  • scripts/validate_model_config.py
    - 验证部署配置

Common Patterns

常见模式

Migration from OpenAI API

从OpenAI API迁移

vLLM provides OpenAI-compatible endpoints for easy migration:
python
undefined
vLLM提供兼容OpenAI的端点,便于迁移:
python
undefined

Before (OpenAI)

之前(OpenAI)

from openai import OpenAI client = OpenAI(api_key="sk-...")
from openai import OpenAI client = OpenAI(api_key="sk-...")

After (vLLM)

之后(vLLM)

from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" )
from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" )

Same API calls work!

相同API调用完全兼容!

response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello"}] )
undefined
response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello"}] )
undefined

Multi-Model Serving

多模型服务

Route requests to different models based on task:
python
MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # Fast, cheap
    "large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
    "code": "codellama/CodeLlama-34b-Instruct"    # Code-specific
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # Route to appropriate vLLM instance
根据任务类型将请求路由至不同模型:
python
MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # 快速、低成本
    "large": "meta-llama/Llama-3.1-70B-Instruct", # 高精度、高成本
    "code": "codellama/CodeLlama-34b-Instruct"    # 代码专用
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # 路由至对应vLLM实例

Cost Optimization

成本优化

Track token usage:
python
import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k
Token使用量统计:
python
import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k

Compare costs

成本对比

openai_cost = estimate_cost(text, "gpt-4o", 0.005) # $5 per 1M tokens self_hosted_cost = 0 # Fixed GPU cost, unlimited tokens
undefined
openai_cost = estimate_cost(text, "gpt-4o", 0.005) # 每百万Token5美元 self_hosted_cost = 0 # 固定GPU成本,Token使用无限制
undefined

Troubleshooting

故障排查

Out of GPU memory:
  • Reduce
    --max-model-len
  • Lower
    --gpu-memory-utilization
    (try 0.8)
  • Enable quantization (
    --quantization awq
    )
  • Use smaller model variant
Low throughput:
  • Increase
    --gpu-memory-utilization
    (try 0.95)
  • Enable continuous batching (vLLM default)
  • Check GPU utilization (should be >80%)
  • Consider tensor parallelism for multi-GPU
High latency:
  • Reduce batch size if using static batching
  • Check network latency to GPU server
  • Profile with
    scripts/benchmark_inference.py
GPU内存不足:
  • 降低
    --max-model-len
  • 调低
    --gpu-memory-utilization
    (尝试0.8)
  • 启用量化(
    --quantization awq
  • 使用更小的模型变体
吞吐量低:
  • 提高
    --gpu-memory-utilization
    (尝试0.95)
  • 启用连续批处理(vLLM默认)
  • 检查GPU利用率(应>80%)
  • 多GPU场景考虑张量并行
延迟高:
  • 静态批处理场景减小批大小
  • 检查与GPU服务器的网络延迟
  • 使用
    scripts/benchmark_inference.py
    进行性能分析

Next Steps

后续步骤

  1. Local Development: Start with
    examples/ollama-local/
    for GPU-free testing
  2. Production Setup: Deploy vLLM with
    examples/vllm-serving/
  3. RAG Integration: Add vector DB with
    examples/langchain-rag-qdrant/
  4. Kubernetes: Scale with
    examples/k8s-vllm-deployment/
  5. Monitoring: Add metrics with Prometheus and Grafana
  1. 本地开发:从
    examples/ollama-local/
    开始,无需GPU即可测试
  2. 生产环境搭建:通过
    examples/vllm-serving/
    部署vLLM
  3. RAG集成:通过
    examples/langchain-rag-qdrant/
    添加向量数据库
  4. Kubernetes部署:通过
    examples/k8s-vllm-deployment/
    实现扩缩容
  5. 监控配置:集成Prometheus与Grafana添加指标监控