model-serving

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Model Serving

模型服务

Purpose

用途

Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.

借助优化的服务引擎、流式响应模式和编排框架，将LLM与ML模型部署至生产环境以实现推理。重点关注自托管模型服务、GPU优化以及与前端应用的集成。

When to Use

适用场景

Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
Building AI APIs with streaming responses
Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
Implementing RAG pipelines with vector databases
Optimizing inference throughput and latency
Integrating LLM serving with frontend chat interfaces

部署LLM至生产环境（自托管Llama、Mistral、Qwen）
构建带流式响应的AI API
部署传统ML模型（scikit-learn、XGBoost、PyTorch）
结合向量数据库实现RAG流水线
优化推理吞吐量与延迟
将LLM服务与前端聊天界面集成

Model Serving Selection

模型服务选型

LLM Serving Engines

LLM服务引擎

vLLM (Recommended Primary)

PagedAttention memory management (20-30x throughput improvement)
Continuous batching for dynamic request handling
OpenAI-compatible API endpoints
Use for: Most self-hosted LLM deployments

TensorRT-LLM

Maximum GPU efficiency (2-8x faster than vLLM)
Requires model conversion and optimization
Use for: Production workloads needing absolute maximum throughput

Ollama

Local development without GPUs
Simple CLI interface
Use for: Prototyping, laptop development, educational purposes

Decision Framework:

Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed

vLLM（推荐首选）

PagedAttention内存管理（吞吐量提升20-30倍）
动态请求处理的连续批处理
兼容OpenAI的API端点
适用场景：大多数自托管LLM部署

TensorRT-LLM

极致GPU效率（比vLLM快2-8倍）
需要模型转换与优化
适用场景：对吞吐量有极致要求的生产工作负载

Ollama

无需GPU的本地开发
简洁的CLI界面
适用场景：原型开发、笔记本电脑开发、教学用途

决策框架：

是否需要自托管LLM部署？
├─ 是，需要高吞吐量 → vLLM
├─ 是，需要极致GPU效率 → TensorRT-LLM
├─ 是，仅本地开发 → Ollama
└─ 否，使用托管API（OpenAI、Anthropic） → 无需服务层

ML Model Serving (Non-LLM)

ML模型服务（非LLM）

BentoML (Recommended)

Python-native, easy deployment
Adaptive batching for throughput
Multi-framework support (scikit-learn, PyTorch, XGBoost)
Use for: Most traditional ML model deployments

Triton Inference Server

Multi-model serving on same GPU
Model ensembles (chain multiple models)
Use for: NVIDIA GPU optimization, serving 10+ models

BentoML（推荐）

原生Python支持，部署便捷
自适应批处理提升吞吐量
多框架支持（scikit-learn、PyTorch、XGBoost）
适用场景：大多数传统ML模型部署

Triton推理服务器

单GPU多模型服务
模型集成（链式调用多模型）
适用场景：NVIDIA GPU优化、部署10个以上模型

LLM Orchestration

LLM编排

LangChain

General-purpose workflows, agents, RAG
100+ integrations (LLMs, vector DBs, tools)
Use for: Most RAG and agent applications

LlamaIndex

RAG-focused with advanced retrieval strategies
100+ data connectors (PDF, Notion, web)
Use for: RAG is primary use case

LangChain

通用工作流、Agent、RAG支持
100+集成（LLM、向量数据库、工具）
适用场景：大多数RAG与Agent应用

LlamaIndex

专注RAG，支持高级检索策略
100+数据连接器（PDF、Notion、网页）
适用场景：以RAG为核心需求的场景

Quick Start Examples

快速入门示例

vLLM Server Setup

vLLM服务器搭建

bash

undefined

bash

undefined

Install

安装

pip install vllm

Serve a model (OpenAI-compatible API)

启动模型服务（兼容OpenAI API）

vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000


**Key Parameters:**
- `--dtype`: Model precision (auto, float16, bfloat16)
- `--max-model-len`: Context window size
- `--gpu-memory-utilization`: GPU memory fraction (0.8-0.95)
- `--tensor-parallel-size`: Number of GPUs for model parallelism

vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype auto
--max-model-len 4096
--gpu-memory-utilization 0.9
--port 8000


**关键参数：**
- `--dtype`：模型精度（auto、float16、bfloat16）
- `--max-model-len`：上下文窗口大小
- `--gpu-memory-utilization`：GPU内存占用比例（0.8-0.95）
- `--tensor-parallel-size`：模型并行使用的GPU数量

Streaming Responses (SSE Pattern)

流式响应（SSE模式）

Backend (FastAPI):

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

Frontend (React):

typescript

// Integration with ai-chat skill
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}

后端（FastAPI）：

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

前端（React）：

typescript

// 与ai-chat技能集成
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}

BentoML Service

BentoML服务

python

import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]

python

import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]

LangChain RAG Pipeline

LangChain RAG流水线

python

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

python

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

Load and chunk documents

加载并拆分文档

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = text_splitter.split_documents(documents)

Create vector store

创建向量数据库

embeddings = OpenAIEmbeddings() vectorstore = Qdrant.from_documents( chunks, embeddings, url="http://localhost:6333", collection_name="docs" )

Create retrieval chain

创建检索链

llm = ChatOpenAI(model="gpt-4o") qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True )

Query

查询

result = qa_chain({"query": "What is PagedAttention?"})

undefined

result = qa_chain({"query": "What is PagedAttention?"})

undefined

Performance Optimization

性能优化

GPU Memory Estimation

GPU内存估算

Rule of thumb for LLMs:

GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2

Examples:

Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)

Quantization reduces memory:

FP16: 2 bytes per parameter
INT8: 1 byte per parameter (2x memory reduction)
INT4: 0.5 bytes per parameter (4x memory reduction)

LLM经验公式：

GPU内存（GB）= 模型参数量（B） × 精度（字节） × 1.2

示例：

Llama-3.1-8B（FP16）：8B × 2字节 × 1.2 = 19.2 GB
Llama-3.1-70B（FP16）：70B × 2字节 × 1.2 = 168 GB（需要2-4张A100显卡）

量化可减少内存占用：

FP16：每个参数2字节
INT8：每个参数1字节（内存占用减少2倍）
INT4：每个参数0.5字节（内存占用减少4倍）

vLLM Optimization

vLLM优化

bash

undefined

bash

undefined

Enable quantization (AWQ for 4-bit)

启用量化（AWQ 4位）

vllm serve TheBloke/Llama-3.1-8B-AWQ
--quantization awq
--gpu-memory-utilization 0.9

Multi-GPU deployment (tensor parallelism)

多GPU部署（张量并行）

vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9

undefined

vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.9

undefined

Batching Strategies

批处理策略

Continuous batching (vLLM default):

Dynamically adds/removes requests from batch
Higher throughput than static batching
No configuration needed

Adaptive batching (BentoML):

python

@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML automatically batches requests
    return self.model.predict(np.array(inputs))

连续批处理（vLLM默认）：

动态添加/移除批处理中的请求
比静态批处理吞吐量更高
无需额外配置

自适应批处理（BentoML）：

python

@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # 最多等待1秒填充批处理
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML自动处理请求批处理
    return self.model.predict(np.array(inputs))

Production Deployment

生产环境部署

Kubernetes Deployment

Kubernetes部署

See

examples/k8s-vllm-deployment/

for complete YAML manifests.

Key considerations:

GPU resource requests:
```
nvidia.com/gpu: 1
```
Health checks:
```
/health
```
endpoint
Horizontal Pod Autoscaling based on queue depth
Persistent volume for model caching

完整YAML清单请查看

examples/k8s-vllm-deployment/

。

关键注意事项：

GPU资源请求：
```
nvidia.com/gpu: 1
```
健康检查：
```
/health
```
端点
基于队列深度的水平Pod自动扩缩容
模型缓存持久化存储

API Gateway Pattern

API网关模式

For production, add rate limiting, authentication, and monitoring:

Kong Configuration:

yaml

services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 60 requests per minute per API key
      - name: key-auth
      - name: prometheus

生产环境中需添加限流、认证与监控：

Kong配置：

yaml

services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 每个API密钥每分钟60次请求
      - name: key-auth
      - name: prometheus

Monitoring Metrics

监控指标

Essential LLM metrics:

Tokens per second (throughput)
Time to first token (TTFT)
Inter-token latency
GPU utilization and memory
Queue depth

Prometheus instrumentation:

python

from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response

核心LLM指标：

每秒生成Token数（吞吐量）
首Token响应时间（TTFT）
Token间延迟
GPU利用率与内存占用
队列深度

Prometheus埋点：

python

from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response

Integration Patterns

集成模式

Frontend (ai-chat) Integration

前端（ai-chat）集成

This skill provides the backend serving layer for the

ai-chat

skill.

Flow:

Frontend (React) → API Gateway → vLLM Server → GPU Inference
     ↑                                                  ↓
     └─────────── SSE Stream (tokens) ─────────────────┘

See

references/streaming-sse.md

for complete implementation patterns.

本技能为

ai-chat

技能提供后端服务层。

流程：

前端（React） → API网关 → vLLM服务器 → GPU推理
     ↑                                                  ↓
     └─────────── SSE流式响应（Token） ─────────────────┘

完整实现模式请查看

references/streaming-sse.md

。

RAG with Vector Databases

与向量数据库的RAG集成

Architecture:

User Query → LangChain
              ├─> Vector DB (Qdrant) for retrieval
              ├─> Combine context + query
              └─> LLM (vLLM) for generation

See

references/langchain-orchestration.md

and

examples/langchain-rag-qdrant/

for complete patterns.

架构：

用户查询 → LangChain
              ├─> 向量数据库（Qdrant）检索
              ├─> 结合上下文与查询
              └─> LLM（vLLM）生成结果

完整模式请查看

references/langchain-orchestration.md

与

examples/langchain-rag-qdrant/

。

Async Inference Queue

异步推理队列

For batch processing or non-real-time inference:

Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB

Useful for:

Batch document processing
Background summarization
Non-interactive workflows

适用于批处理或非实时推理场景：

客户端 → API → 消息队列（Celery） → 工作节点（vLLM） → 结果数据库

适用场景：

批量文档处理
后台摘要生成
非交互式工作流

Benchmarking

基准测试

Use

scripts/benchmark_inference.py

to measure the deployment:

bash

python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000

Outputs:

Requests per second
P50/P95/P99 latency
Tokens per second
GPU memory usage

使用

scripts/benchmark_inference.py

测试部署性能：

bash

python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000

输出结果：

每秒请求数
P50/P95/P99延迟
每秒生成Token数
GPU内存占用

Bundled Resources

配套资源

Detailed Guides:

```
references/vllm.md
```
- vLLM setup, PagedAttention, optimization
```
references/tgi.md
```
- Text Generation Inference patterns
```
references/bentoml.md
```
- BentoML deployment patterns
```
references/langchain-orchestration.md
```
- LangChain RAG and agents
```
references/inference-optimization.md
```
- Quantization, batching, GPU tuning

Working Examples:

```
examples/vllm-serving/
```
- Complete vLLM + FastAPI streaming setup
```
examples/ollama-local/
```
- Local development with Ollama
```
examples/langchain-agents/
```
- LangChain agent patterns

Utility Scripts:

```
scripts/benchmark_inference.py
```
- Throughput and latency benchmarking
```
scripts/validate_model_config.py
```
- Validate deployment configurations

详细指南：

```
references/vllm.md
```
- vLLM搭建、PagedAttention、优化
```
references/tgi.md
```
- 文本生成推理模式
```
references/bentoml.md
```
- BentoML部署模式
```
references/langchain-orchestration.md
```
- LangChain RAG与Agent
```
references/inference-optimization.md
```
- 量化、批处理、GPU调优

可用示例：

```
examples/vllm-serving/
```
- 完整vLLM + FastAPI流式响应搭建
```
examples/ollama-local/
```
- Ollama本地开发
```
examples/langchain-agents/
```
- LangChain Agent模式

实用脚本：

```
scripts/benchmark_inference.py
```
- 吞吐量与延迟基准测试
```
scripts/validate_model_config.py
```
- 验证部署配置

Common Patterns

常见模式

Migration from OpenAI API

从OpenAI API迁移

vLLM provides OpenAI-compatible endpoints for easy migration:

python

undefined

vLLM提供兼容OpenAI的端点，便于迁移：

python

undefined

Before (OpenAI)

之前（OpenAI）

from openai import OpenAI client = OpenAI(api_key="sk-...")

After (vLLM)

之后（vLLM）

from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" )

Same API calls work!

相同API调用完全兼容！

response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello"}] )

undefined

response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello"}] )

undefined

Multi-Model Serving

多模型服务

Route requests to different models based on task:

python

MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # Fast, cheap
    "large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
    "code": "codellama/CodeLlama-34b-Instruct"    # Code-specific
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # Route to appropriate vLLM instance

根据任务类型将请求路由至不同模型：

python

MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # 快速、低成本
    "large": "meta-llama/Llama-3.1-70B-Instruct", # 高精度、高成本
    "code": "codellama/CodeLlama-34b-Instruct"    # 代码专用
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # 路由至对应vLLM实例

Cost Optimization

成本优化

Track token usage:

python

import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k

Token使用量统计：

python

import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k

Compare costs

成本对比

openai_cost = estimate_cost(text, "gpt-4o", 0.005) # $5 per 1M tokens self_hosted_cost = 0 # Fixed GPU cost, unlimited tokens

undefined

openai_cost = estimate_cost(text, "gpt-4o", 0.005) # 每百万Token5美元 self_hosted_cost = 0 # 固定GPU成本，Token使用无限制

undefined

Troubleshooting

故障排查

Out of GPU memory:

Reduce
```
--max-model-len
```
Lower
```
--gpu-memory-utilization
```
(try 0.8)
Enable quantization (
```
--quantization awq
```
)
Use smaller model variant

Low throughput:

Increase
```
--gpu-memory-utilization
```
(try 0.95)
Enable continuous batching (vLLM default)
Check GPU utilization (should be >80%)
Consider tensor parallelism for multi-GPU

High latency:

Reduce batch size if using static batching
Check network latency to GPU server
Profile with
```
scripts/benchmark_inference.py
```

GPU内存不足：

降低
```
--max-model-len
```
调低
```
--gpu-memory-utilization
```
（尝试0.8）
启用量化（
```
--quantization awq
```
）
使用更小的模型变体

吞吐量低：

提高
```
--gpu-memory-utilization
```
（尝试0.95）
启用连续批处理（vLLM默认）
检查GPU利用率（应>80%）
多GPU场景考虑张量并行

延迟高：

静态批处理场景减小批大小
检查与GPU服务器的网络延迟
使用
```
scripts/benchmark_inference.py
```
进行性能分析

Next Steps

后续步骤

Local Development: Start with
```
examples/ollama-local/
```
for GPU-free testing
Production Setup: Deploy vLLM with
```
examples/vllm-serving/
```
RAG Integration: Add vector DB with
```
examples/langchain-rag-qdrant/
```
Kubernetes: Scale with
```
examples/k8s-vllm-deployment/
```
Monitoring: Add metrics with Prometheus and Grafana

本地开发：从
```
examples/ollama-local/
```
开始，无需GPU即可测试
生产环境搭建：通过
```
examples/vllm-serving/
```
部署vLLM
RAG集成：通过
```
examples/langchain-rag-qdrant/
```
添加向量数据库
Kubernetes部署：通过
```
examples/k8s-vllm-deployment/
```
实现扩缩容
监控配置：集成Prometheus与Grafana添加指标监控