model-serving
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModel Serving
模型服务
Purpose
用途
Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.
借助优化的服务引擎、流式响应模式和编排框架,将LLM与ML模型部署至生产环境以实现推理。重点关注自托管模型服务、GPU优化以及与前端应用的集成。
When to Use
适用场景
- Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
- Building AI APIs with streaming responses
- Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
- Implementing RAG pipelines with vector databases
- Optimizing inference throughput and latency
- Integrating LLM serving with frontend chat interfaces
- 部署LLM至生产环境(自托管Llama、Mistral、Qwen)
- 构建带流式响应的AI API
- 部署传统ML模型(scikit-learn、XGBoost、PyTorch)
- 结合向量数据库实现RAG流水线
- 优化推理吞吐量与延迟
- 将LLM服务与前端聊天界面集成
Model Serving Selection
模型服务选型
LLM Serving Engines
LLM服务引擎
vLLM (Recommended Primary)
- PagedAttention memory management (20-30x throughput improvement)
- Continuous batching for dynamic request handling
- OpenAI-compatible API endpoints
- Use for: Most self-hosted LLM deployments
TensorRT-LLM
- Maximum GPU efficiency (2-8x faster than vLLM)
- Requires model conversion and optimization
- Use for: Production workloads needing absolute maximum throughput
Ollama
- Local development without GPUs
- Simple CLI interface
- Use for: Prototyping, laptop development, educational purposes
Decision Framework:
Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer neededvLLM(推荐首选)
- PagedAttention内存管理(吞吐量提升20-30倍)
- 动态请求处理的连续批处理
- 兼容OpenAI的API端点
- 适用场景:大多数自托管LLM部署
TensorRT-LLM
- 极致GPU效率(比vLLM快2-8倍)
- 需要模型转换与优化
- 适用场景:对吞吐量有极致要求的生产工作负载
Ollama
- 无需GPU的本地开发
- 简洁的CLI界面
- 适用场景:原型开发、笔记本电脑开发、教学用途
决策框架:
是否需要自托管LLM部署?
├─ 是,需要高吞吐量 → vLLM
├─ 是,需要极致GPU效率 → TensorRT-LLM
├─ 是,仅本地开发 → Ollama
└─ 否,使用托管API(OpenAI、Anthropic) → 无需服务层ML Model Serving (Non-LLM)
ML模型服务(非LLM)
BentoML (Recommended)
- Python-native, easy deployment
- Adaptive batching for throughput
- Multi-framework support (scikit-learn, PyTorch, XGBoost)
- Use for: Most traditional ML model deployments
Triton Inference Server
- Multi-model serving on same GPU
- Model ensembles (chain multiple models)
- Use for: NVIDIA GPU optimization, serving 10+ models
BentoML(推荐)
- 原生Python支持,部署便捷
- 自适应批处理提升吞吐量
- 多框架支持(scikit-learn、PyTorch、XGBoost)
- 适用场景:大多数传统ML模型部署
Triton推理服务器
- 单GPU多模型服务
- 模型集成(链式调用多模型)
- 适用场景:NVIDIA GPU优化、部署10个以上模型
LLM Orchestration
LLM编排
LangChain
- General-purpose workflows, agents, RAG
- 100+ integrations (LLMs, vector DBs, tools)
- Use for: Most RAG and agent applications
LlamaIndex
- RAG-focused with advanced retrieval strategies
- 100+ data connectors (PDF, Notion, web)
- Use for: RAG is primary use case
LangChain
- 通用工作流、Agent、RAG支持
- 100+集成(LLM、向量数据库、工具)
- 适用场景:大多数RAG与Agent应用
LlamaIndex
- 专注RAG,支持高级检索策略
- 100+数据连接器(PDF、Notion、网页)
- 适用场景:以RAG为核心需求的场景
Quick Start Examples
快速入门示例
vLLM Server Setup
vLLM服务器搭建
bash
undefinedbash
undefinedInstall
安装
pip install vllm
pip install vllm
Serve a model (OpenAI-compatible API)
启动模型服务(兼容OpenAI API)
vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000
**Key Parameters:**
- `--dtype`: Model precision (auto, float16, bfloat16)
- `--max-model-len`: Context window size
- `--gpu-memory-utilization`: GPU memory fraction (0.8-0.95)
- `--tensor-parallel-size`: Number of GPUs for model parallelismvllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000
**关键参数:**
- `--dtype`:模型精度(auto、float16、bfloat16)
- `--max-model-len`:上下文窗口大小
- `--gpu-memory-utilization`:GPU内存占用比例(0.8-0.95)
- `--tensor-parallel-size`:模型并行使用的GPU数量Streaming Responses (SSE Pattern)
流式响应(SSE模式)
Backend (FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
@app.post("/chat/stream")
async def chat_stream(message: str):
async def generate():
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": message}],
stream=True,
max_tokens=512
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
yield f"data: {json.dumps({'token': token})}\n\n"
yield f"data: {json.dumps({'done': True})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache"}
)Frontend (React):
typescript
// Integration with ai-chat skill
const sendMessage = async (message: string) => {
const response = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
})
const reader = response.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value)
const lines = chunk.split('\n\n')
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6))
if (data.token) {
setResponse(prev => prev + data.token)
}
}
}
}
}后端(FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
@app.post("/chat/stream")
async def chat_stream(message: str):
async def generate():
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": message}],
stream=True,
max_tokens=512
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
yield f"data: {json.dumps({'token': token})}\n\n"
yield f"data: {json.dumps({'done': True})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache"}
)前端(React):
typescript
// 与ai-chat技能集成
const sendMessage = async (message: string) => {
const response = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
})
const reader = response.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value)
const lines = chunk.split('\n\n')
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6))
if (data.token) {
setResponse(prev => prev + data.token)
}
}
}
}
}BentoML Service
BentoML服务
python
import bentoml
from bentoml.io import JSON
import numpy as np
@bentoml.service(
resources={"cpu": "2", "memory": "4Gi"},
traffic={"timeout": 10}
)
class IrisClassifier:
model_ref = bentoml.models.get("iris_classifier:latest")
def __init__(self):
self.model = bentoml.sklearn.load_model(self.model_ref)
@bentoml.api(batchable=True, max_batch_size=32)
def classify(self, features: list[dict]) -> list[str]:
X = np.array([[f['sepal_length'], f['sepal_width'],
f['petal_length'], f['petal_width']] for f in features])
predictions = self.model.predict(X)
return ['setosa', 'versicolor', 'virginica'][predictions]python
import bentoml
from bentoml.io import JSON
import numpy as np
@bentoml.service(
resources={"cpu": "2", "memory": "4Gi"},
traffic={"timeout": 10}
)
class IrisClassifier:
model_ref = bentoml.models.get("iris_classifier:latest")
def __init__(self):
self.model = bentoml.sklearn.load_model(self.model_ref)
@bentoml.api(batchable=True, max_batch_size=32)
def classify(self, features: list[dict]) -> list[str]:
X = np.array([[f['sepal_length'], f['sepal_width'],
f['petal_length'], f['petal_width']] for f in features])
predictions = self.model.predict(X)
return ['setosa', 'versicolor', 'virginica'][predictions]LangChain RAG Pipeline
LangChain RAG流水线
python
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitterpython
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitterLoad and chunk documents
加载并拆分文档
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
Create vector store
创建向量数据库
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_documents(
chunks,
embeddings,
url="http://localhost:6333",
collection_name="docs"
)
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_documents(
chunks,
embeddings,
url="http://localhost:6333",
collection_name="docs"
)
Create retrieval chain
创建检索链
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
Query
查询
result = qa_chain({"query": "What is PagedAttention?"})
undefinedresult = qa_chain({"query": "What is PagedAttention?"})
undefinedPerformance Optimization
性能优化
GPU Memory Estimation
GPU内存估算
Rule of thumb for LLMs:
GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2Examples:
- Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
- Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)
Quantization reduces memory:
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter (2x memory reduction)
- INT4: 0.5 bytes per parameter (4x memory reduction)
LLM经验公式:
GPU内存(GB)= 模型参数量(B) × 精度(字节) × 1.2示例:
- Llama-3.1-8B(FP16):8B × 2字节 × 1.2 = 19.2 GB
- Llama-3.1-70B(FP16):70B × 2字节 × 1.2 = 168 GB(需要2-4张A100显卡)
量化可减少内存占用:
- FP16:每个参数2字节
- INT8:每个参数1字节(内存占用减少2倍)
- INT4:每个参数0.5字节(内存占用减少4倍)
vLLM Optimization
vLLM优化
bash
undefinedbash
undefinedEnable quantization (AWQ for 4-bit)
启用量化(AWQ 4位)
vllm serve TheBloke/Llama-3.1-8B-AWQ
--quantization awq
--gpu-memory-utilization 0.9
--quantization awq
--gpu-memory-utilization 0.9
vllm serve TheBloke/Llama-3.1-8B-AWQ
--quantization awq
--gpu-memory-utilization 0.9
--quantization awq
--gpu-memory-utilization 0.9
Multi-GPU deployment (tensor parallelism)
多GPU部署(张量并行)
vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
undefinedvllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
undefinedBatching Strategies
批处理策略
Continuous batching (vLLM default):
- Dynamically adds/removes requests from batch
- Higher throughput than static batching
- No configuration needed
Adaptive batching (BentoML):
python
@bentoml.api(
batchable=True,
max_batch_size=32,
max_latency_ms=1000 # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
# BentoML automatically batches requests
return self.model.predict(np.array(inputs))连续批处理(vLLM默认):
- 动态添加/移除批处理中的请求
- 比静态批处理吞吐量更高
- 无需额外配置
自适应批处理(BentoML):
python
@bentoml.api(
batchable=True,
max_batch_size=32,
max_latency_ms=1000 # 最多等待1秒填充批处理
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
# BentoML自动处理请求批处理
return self.model.predict(np.array(inputs))Production Deployment
生产环境部署
Kubernetes Deployment
Kubernetes部署
See for complete YAML manifests.
examples/k8s-vllm-deployment/Key considerations:
- GPU resource requests:
nvidia.com/gpu: 1 - Health checks: endpoint
/health - Horizontal Pod Autoscaling based on queue depth
- Persistent volume for model caching
完整YAML清单请查看。
examples/k8s-vllm-deployment/关键注意事项:
- GPU资源请求:
nvidia.com/gpu: 1 - 健康检查:端点
/health - 基于队列深度的水平Pod自动扩缩容
- 模型缓存持久化存储
API Gateway Pattern
API网关模式
For production, add rate limiting, authentication, and monitoring:
Kong Configuration:
yaml
services:
- name: vllm-service
url: http://vllm-llama-8b:8000
plugins:
- name: rate-limiting
config:
minute: 60 # 60 requests per minute per API key
- name: key-auth
- name: prometheus生产环境中需添加限流、认证与监控:
Kong配置:
yaml
services:
- name: vllm-service
url: http://vllm-llama-8b:8000
plugins:
- name: rate-limiting
config:
minute: 60 # 每个API密钥每分钟60次请求
- name: key-auth
- name: prometheusMonitoring Metrics
监控指标
Essential LLM metrics:
- Tokens per second (throughput)
- Time to first token (TTFT)
- Inter-token latency
- GPU utilization and memory
- Queue depth
Prometheus instrumentation:
python
from prometheus_client import Counter, Histogram
requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')
@app.post("/chat")
async def chat(request):
requests_total.inc()
start = time.time()
response = await generate(request)
tokens_generated.inc(len(response.tokens))
request_duration.observe(time.time() - start)
return response核心LLM指标:
- 每秒生成Token数(吞吐量)
- 首Token响应时间(TTFT)
- Token间延迟
- GPU利用率与内存占用
- 队列深度
Prometheus埋点:
python
from prometheus_client import Counter, Histogram
requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')
@app.post("/chat")
async def chat(request):
requests_total.inc()
start = time.time()
response = await generate(request)
tokens_generated.inc(len(response.tokens))
request_duration.observe(time.time() - start)
return responseIntegration Patterns
集成模式
Frontend (ai-chat) Integration
前端(ai-chat)集成
This skill provides the backend serving layer for the skill.
ai-chatFlow:
Frontend (React) → API Gateway → vLLM Server → GPU Inference
↑ ↓
└─────────── SSE Stream (tokens) ─────────────────┘See for complete implementation patterns.
references/streaming-sse.md本技能为技能提供后端服务层。
ai-chat流程:
前端(React) → API网关 → vLLM服务器 → GPU推理
↑ ↓
└─────────── SSE流式响应(Token) ─────────────────┘完整实现模式请查看。
references/streaming-sse.mdRAG with Vector Databases
与向量数据库的RAG集成
Architecture:
User Query → LangChain
├─> Vector DB (Qdrant) for retrieval
├─> Combine context + query
└─> LLM (vLLM) for generationSee and for complete patterns.
references/langchain-orchestration.mdexamples/langchain-rag-qdrant/架构:
用户查询 → LangChain
├─> 向量数据库(Qdrant)检索
├─> 结合上下文与查询
└─> LLM(vLLM)生成结果完整模式请查看与。
references/langchain-orchestration.mdexamples/langchain-rag-qdrant/Async Inference Queue
异步推理队列
For batch processing or non-real-time inference:
Client → API → Message Queue (Celery) → Workers (vLLM) → Results DBUseful for:
- Batch document processing
- Background summarization
- Non-interactive workflows
适用于批处理或非实时推理场景:
客户端 → API → 消息队列(Celery) → 工作节点(vLLM) → 结果数据库适用场景:
- 批量文档处理
- 后台摘要生成
- 非交互式工作流
Benchmarking
基准测试
Use to measure the deployment:
scripts/benchmark_inference.pybash
python scripts/benchmark_inference.py \
--endpoint http://localhost:8000/v1/chat/completions \
--model meta-llama/Llama-3.1-8B-Instruct \
--concurrency 32 \
--requests 1000Outputs:
- Requests per second
- P50/P95/P99 latency
- Tokens per second
- GPU memory usage
使用测试部署性能:
scripts/benchmark_inference.pybash
python scripts/benchmark_inference.py \
--endpoint http://localhost:8000/v1/chat/completions \
--model meta-llama/Llama-3.1-8B-Instruct \
--concurrency 32 \
--requests 1000输出结果:
- 每秒请求数
- P50/P95/P99延迟
- 每秒生成Token数
- GPU内存占用
Bundled Resources
配套资源
Detailed Guides:
- - vLLM setup, PagedAttention, optimization
references/vllm.md - - Text Generation Inference patterns
references/tgi.md - - BentoML deployment patterns
references/bentoml.md - - LangChain RAG and agents
references/langchain-orchestration.md - - Quantization, batching, GPU tuning
references/inference-optimization.md
Working Examples:
- - Complete vLLM + FastAPI streaming setup
examples/vllm-serving/ - - Local development with Ollama
examples/ollama-local/ - - LangChain agent patterns
examples/langchain-agents/
Utility Scripts:
- - Throughput and latency benchmarking
scripts/benchmark_inference.py - - Validate deployment configurations
scripts/validate_model_config.py
详细指南:
- - vLLM搭建、PagedAttention、优化
references/vllm.md - - 文本生成推理模式
references/tgi.md - - BentoML部署模式
references/bentoml.md - - LangChain RAG与Agent
references/langchain-orchestration.md - - 量化、批处理、GPU调优
references/inference-optimization.md
可用示例:
- - 完整vLLM + FastAPI流式响应搭建
examples/vllm-serving/ - - Ollama本地开发
examples/ollama-local/ - - LangChain Agent模式
examples/langchain-agents/
实用脚本:
- - 吞吐量与延迟基准测试
scripts/benchmark_inference.py - - 验证部署配置
scripts/validate_model_config.py
Common Patterns
常见模式
Migration from OpenAI API
从OpenAI API迁移
vLLM provides OpenAI-compatible endpoints for easy migration:
python
undefinedvLLM提供兼容OpenAI的端点,便于迁移:
python
undefinedBefore (OpenAI)
之前(OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-...")
from openai import OpenAI
client = OpenAI(api_key="sk-...")
After (vLLM)
之后(vLLM)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
Same API calls work!
相同API调用完全兼容!
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}]
)
undefinedresponse = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}]
)
undefinedMulti-Model Serving
多模型服务
Route requests to different models based on task:
python
MODEL_ROUTING = {
"small": "meta-llama/Llama-3.1-8B-Instruct", # Fast, cheap
"large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
"code": "codellama/CodeLlama-34b-Instruct" # Code-specific
}
@app.post("/chat")
async def chat(message: str, task: str = "small"):
model = MODEL_ROUTING[task]
# Route to appropriate vLLM instance根据任务类型将请求路由至不同模型:
python
MODEL_ROUTING = {
"small": "meta-llama/Llama-3.1-8B-Instruct", # 快速、低成本
"large": "meta-llama/Llama-3.1-70B-Instruct", # 高精度、高成本
"code": "codellama/CodeLlama-34b-Instruct" # 代码专用
}
@app.post("/chat")
async def chat(message: str, task: str = "small"):
model = MODEL_ROUTING[task]
# 路由至对应vLLM实例Cost Optimization
成本优化
Track token usage:
python
import tiktoken
def estimate_cost(text: str, model: str, price_per_1k: float):
encoding = tiktoken.encoding_for_model(model)
tokens = len(encoding.encode(text))
return (tokens / 1000) * price_per_1kToken使用量统计:
python
import tiktoken
def estimate_cost(text: str, model: str, price_per_1k: float):
encoding = tiktoken.encoding_for_model(model)
tokens = len(encoding.encode(text))
return (tokens / 1000) * price_per_1kCompare costs
成本对比
openai_cost = estimate_cost(text, "gpt-4o", 0.005) # $5 per 1M tokens
self_hosted_cost = 0 # Fixed GPU cost, unlimited tokens
undefinedopenai_cost = estimate_cost(text, "gpt-4o", 0.005) # 每百万Token5美元
self_hosted_cost = 0 # 固定GPU成本,Token使用无限制
undefinedTroubleshooting
故障排查
Out of GPU memory:
- Reduce
--max-model-len - Lower (try 0.8)
--gpu-memory-utilization - Enable quantization ()
--quantization awq - Use smaller model variant
Low throughput:
- Increase (try 0.95)
--gpu-memory-utilization - Enable continuous batching (vLLM default)
- Check GPU utilization (should be >80%)
- Consider tensor parallelism for multi-GPU
High latency:
- Reduce batch size if using static batching
- Check network latency to GPU server
- Profile with
scripts/benchmark_inference.py
GPU内存不足:
- 降低
--max-model-len - 调低(尝试0.8)
--gpu-memory-utilization - 启用量化()
--quantization awq - 使用更小的模型变体
吞吐量低:
- 提高(尝试0.95)
--gpu-memory-utilization - 启用连续批处理(vLLM默认)
- 检查GPU利用率(应>80%)
- 多GPU场景考虑张量并行
延迟高:
- 静态批处理场景减小批大小
- 检查与GPU服务器的网络延迟
- 使用进行性能分析
scripts/benchmark_inference.py
Next Steps
后续步骤
- Local Development: Start with for GPU-free testing
examples/ollama-local/ - Production Setup: Deploy vLLM with
examples/vllm-serving/ - RAG Integration: Add vector DB with
examples/langchain-rag-qdrant/ - Kubernetes: Scale with
examples/k8s-vllm-deployment/ - Monitoring: Add metrics with Prometheus and Grafana
- 本地开发:从开始,无需GPU即可测试
examples/ollama-local/ - 生产环境搭建:通过部署vLLM
examples/vllm-serving/ - RAG集成:通过添加向量数据库
examples/langchain-rag-qdrant/ - Kubernetes部署:通过实现扩缩容
examples/k8s-vllm-deployment/ - 监控配置:集成Prometheus与Grafana添加指标监控