runpod-deployment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>
Deploy and manage GPU workloads on RunPod infrastructure:
- Serverless Workers - Scale-to-zero handlers with pay-per-second billing
- vLLM Endpoints - OpenAI-compatible LLM serving with 2-3x throughput
- Pod Management - Dedicated GPU instances for development/training
- Cost Optimization - GPU selection, spot instances, budget controls
Key deliverables:
- Production-ready serverless handlers with streaming
- vLLM deployment with OpenAI API compatibility
- Cost-optimized GPU selection for any model size
- Health monitoring and auto-scaling configuration </objective>
<quick_start>
Minimal Serverless Handler (v1.8.1):
python
import runpod
def handler(job):
"""Basic handler - receives job, returns result."""
job_input = job["input"]
prompt = job_input.get("prompt", "")
# Your inference logic here
result = process(prompt)
return {"output": result}
runpod.serverless.start({"handler": handler})Streaming Handler:
python
import runpod
def streaming_handler(job):
"""Generator for streaming responses."""
for chunk in generate_chunks(job["input"]):
yield {"token": chunk, "finished": False}
yield {"token": "", "finished": True}
runpod.serverless.start({
"handler": streaming_handler,
"return_aggregate_stream": True
})vLLM OpenAI-Compatible Client:
python
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100,
)</quick_start>
<success_criteria>
A RunPod deployment is successful when:
- Handler processes requests without errors
- Endpoint scales appropriately (0 → N workers)
- Cold start time is acceptable for use case
- Cost stays within budget projections
- Health checks pass consistently </success_criteria>
<m1_mac_critical>
<objective>
在RunPod基础设施上部署和管理GPU工作负载:
- 无服务器工作节点 - 支持缩容至零的处理器,按秒计费
- vLLM端点 - 兼容OpenAI的LLM服务,吞吐量提升2-3倍
- Pod管理 - 用于开发/训练的专属GPU实例
- 成本优化 - GPU选择、抢占式实例、预算控制
关键交付成果:
- 支持流式传输的生产级无服务器处理器
- 兼容OpenAI API的vLLM部署
- 针对任意模型尺寸的成本优化GPU选择
- 健康监控与自动扩缩容配置 </objective>
<quick_start>
极简无服务器处理器(v1.8.1):
python
import runpod
def handler(job):
"""Basic handler - receives job, returns result."""
job_input = job["input"]
prompt = job_input.get("prompt", "")
# Your inference logic here
result = process(prompt)
return {"output": result}
runpod.serverless.start({"handler": handler})流式处理器:
python
import runpod
def streaming_handler(job):
"""Generator for streaming responses."""
for chunk in generate_chunks(job["input"]):
yield {"token": chunk, "finished": False}
yield {"token": "", "finished": True}
runpod.serverless.start({
"handler": streaming_handler,
"return_aggregate_stream": True
})vLLM兼容OpenAI客户端:
python
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100,
)</quick_start>
<success_criteria>
RunPod部署成功的判定标准:
- 处理器可无错误处理请求
- 端点可按需扩缩容(0 → N个工作节点)
- 冷启动时间符合业务场景需求
- 成本控制在预算范围内
- 健康检查持续通过 </success_criteria>
<m1_mac_critical>
M1/M2 Mac: Cannot Build Docker Locally
M1/M2 Mac:无法在本地构建Docker镜像
ARM architecture is incompatible with RunPod's x86 GPUs.
Solution: GitHub Actions builds for you:
bash
undefinedARM架构与RunPod的x86 GPU不兼容。
解决方案:使用GitHub Actions进行构建:
bash
undefinedPush code - Actions builds x86 image
推送代码 - Actions将构建x86镜像
git add . && git commit -m "Deploy" && git push
> See `reference/cicd.md` for complete GitHub Actions workflow.
**Never run `docker build` locally for RunPod on Apple Silicon.**
</m1_mac_critical>
<gpu_selection>git add . && git commit -m "Deploy" && git push
> 完整GitHub Actions工作流请参考`reference/cicd.md`。
**在Apple Silicon设备上,切勿在本地执行`docker build`命令用于RunPod部署。**
</m1_mac_critical>
<gpu_selection>GPU Selection Matrix (January 2025)
GPU选择矩阵(2025年1月)
| GPU | VRAM | Secure $/hr | Spot $/hr | Best For |
|---|---|---|---|---|
| RTX A4000 | 16GB | $0.36 | $0.18 | Embeddings, small models |
| RTX 4090 | 24GB | $0.44 | $0.22 | 7B-8B inference |
| A40 | 48GB | $0.65 | $0.39 | 13B-30B, fine-tuning |
| A100 80GB | 80GB | $1.89 | $0.89 | 70B models, production |
| H100 80GB | 80GB | $4.69 | $1.88 | 70B+ training |
Quick Selection:
python
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
effective = model_params_b * (0.5 if quantized else 1.0)
if effective <= 3: return "RTX_A4000" # $0.36/hr
if effective <= 8: return "RTX_4090" # $0.44/hr
if effective <= 30: return "A40" # $0.65/hr
if effective <= 70: return "A100_80GB" # $1.89/hr
return "H100_80GB" # $4.69/hrSeefor detailed pricing and budget controls. </gpu_selection>reference/cost-optimization.md
<handler_patterns>
| GPU | 显存 | 按需实例每小时费用 | 抢占式实例每小时费用 | 最佳适用场景 |
|---|---|---|---|---|
| RTX A4000 | 16GB | $0.36 | $0.18 | 嵌入模型、小型模型 |
| RTX 4090 | 24GB | $0.44 | $0.22 | 7B-8B模型推理 |
| A40 | 48GB | $0.65 | $0.39 | 13B-30B模型、微调 |
| A100 80GB | 80GB | $1.89 | $0.89 | 70B模型、生产环境 |
| H100 80GB | 80GB | $4.69 | $1.88 | 70B+模型训练 |
快速选择工具:
python
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
effective = model_params_b * (0.5 if quantized else 1.0)
if effective <= 3: return "RTX_A4000" # $0.36/hr
if effective <= 8: return "RTX_4090" # $0.44/hr
if effective <= 30: return "A40" # $0.65/hr
if effective <= 70: return "A100_80GB" # $1.89/hr
return "H100_80GB" # $4.69/hr详细定价与预算控制请参考。 </gpu_selection>reference/cost-optimization.md
<handler_patterns>
Handler Patterns
处理器模式
Progress Updates (Long-Running Tasks)
进度更新(长时任务)
python
import runpod
def long_task_handler(job):
total_steps = job["input"].get("steps", 10)
for step in range(total_steps):
process_step(step)
runpod.serverless.progress_update(
job_id=job["id"],
progress=int((step + 1) / total_steps * 100)
)
return {"status": "complete", "steps": total_steps}
runpod.serverless.start({"handler": long_task_handler})python
import runpod
def long_task_handler(job):
total_steps = job["input"].get("steps", 10)
for step in range(total_steps):
process_step(step)
runpod.serverless.progress_update(
job_id=job["id"],
progress=int((step + 1) / total_steps * 100)
)
return {"status": "complete", "steps": total_steps}
runpod.serverless.start({"handler": long_task_handler})Error Handling
错误处理
python
import runpod
import traceback
def safe_handler(job):
try:
# Validate input
if "prompt" not in job["input"]:
return {"error": "Missing required field: prompt"}
result = process(job["input"])
return {"output": result}
except torch.cuda.OutOfMemoryError:
return {"error": "GPU OOM - reduce input size", "retry": False}
except Exception as e:
return {"error": str(e), "traceback": traceback.format_exc()}
runpod.serverless.start({"handler": safe_handler})Seefor async patterns, batching, and advanced handlers. </handler_patterns>reference/serverless-workers.md
<vllm_deployment>
python
import runpod
import traceback
def safe_handler(job):
try:
# Validate input
if "prompt" not in job["input"]:
return {"error": "Missing required field: prompt"}
result = process(job["input"])
return {"output": result}
except torch.cuda.OutOfMemoryError:
return {"error": "GPU OOM - reduce input size", "retry": False}
except Exception as e:
return {"error": str(e), "traceback": traceback.format_exc()}
runpod.serverless.start({"handler": safe_handler})异步模式、批量处理及高级处理器请参考。 </handler_patterns>reference/serverless-workers.md
<vllm_deployment>
vLLM Deployment
vLLM部署
Note: vLLM uses OpenAI-compatible API FORMAT but connects to YOUR RunPod endpoint, NOT OpenAI servers. Models run on your GPU (Llama, Qwen, Mistral, etc.)
注意: vLLM采用兼容OpenAI的API格式,但连接的是您的RunPod端点, 而非OpenAI服务器。模型运行在您的GPU上(如Llama、Qwen、Mistral等)
Environment Configuration
环境配置
python
vllm_env = {
"MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
"HF_TOKEN": "${HF_TOKEN}",
"TENSOR_PARALLEL_SIZE": "2", # Multi-GPU
"MAX_MODEL_LEN": "16384",
"GPU_MEMORY_UTILIZATION": "0.95",
"QUANTIZATION": "awq", # Optional: awq, gptq
}python
vllm_env = {
"MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
"HF_TOKEN": "${HF_TOKEN}",
"TENSOR_PARALLEL_SIZE": "2", # 多GPU
"MAX_MODEL_LEN": "16384",
"GPU_MEMORY_UTILIZATION": "0.95",
"QUANTIZATION": "awq", # 可选:awq, gptq
}OpenAI-Compatible Streaming
兼容OpenAI的流式传输
python
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)python
from openai import OpenAI
client = OpenAI(
api_key="RUNPOD_API_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Direct RunPod Streaming
RunPod直接流式传输
python
import requests
url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}
response = requests.post(url, headers=headers, json={
"input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]python
import requests
url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}
response = requests.post(url, headers=headers, json={
"input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]Stream results
流式获取结果
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
for line in r.iter_lines():
if line: print(line.decode())
> See `reference/model-deployment.md` for HuggingFace, TGI, and custom model patterns.
</vllm_deployment>
<auto_scaling>stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
for line in r.iter_lines():
if line: print(line.decode())
> HuggingFace、TGI及自定义模型部署模式请参考`reference/model-deployment.md`。
</vllm_deployment>
<auto_scaling>Auto-Scaling Configuration
自动扩缩容配置
Scaler Types
扩缩容器类型
| Type | Best For | Config |
|---|---|---|
| QUEUE_DELAY | Variable traffic | |
| REQUEST_COUNT | Predictable load | |
| 类型 | 最佳适用场景 | 配置 |
|---|---|---|
| QUEUE_DELAY | 流量波动大 | |
| REQUEST_COUNT | 负载可预测 | |
Configuration Patterns
配置示例
python
configs = {
"interactive_api": {
"workers_min": 1, # Always warm
"workers_max": 5,
"idle_timeout": 120,
"scaler_type": "QUEUE_DELAY",
"scaler_value": 1, # 1s latency target
},
"batch_processing": {
"workers_min": 0,
"workers_max": 20,
"idle_timeout": 30,
"scaler_type": "REQUEST_COUNT",
"scaler_value": 5,
},
"cost_optimized": {
"workers_min": 0,
"workers_max": 3,
"idle_timeout": 15, # Aggressive scale-down
"scaler_type": "QUEUE_DELAY",
"scaler_value": 5,
},
}Seefor pod lifecycle and scaling details. </auto_scaling>reference/pod-management.md
<health_monitoring>
python
configs = {
"interactive_api": {
"workers_min": 1, # 始终保持预热状态
"workers_max": 5,
"idle_timeout": 120,
"scaler_type": "QUEUE_DELAY",
"scaler_value": 1, # 目标延迟1秒
},
"batch_processing": {
"workers_min": 0,
"workers_max": 20,
"idle_timeout": 30,
"scaler_type": "REQUEST_COUNT",
"scaler_value": 5,
},
"cost_optimized": {
"workers_min": 0,
"workers_max": 3,
"idle_timeout": 15, # 快速缩容
"scaler_type": "QUEUE_DELAY",
"scaler_value": 5,
},
}Pod生命周期与扩缩容详情请参考。 </auto_scaling>reference/pod-management.md
<health_monitoring>
Health & Monitoring
健康监控
Quick Health Check
快速健康检查
python
import runpod
async def check_health(endpoint_id: str):
endpoint = runpod.Endpoint(endpoint_id)
health = await endpoint.health()
return {
"status": health.status,
"workers_ready": health.workers.ready,
"queue_depth": health.queue.in_queue,
"avg_latency_ms": health.metrics.avg_execution_time,
}python
import runpod
async def check_health(endpoint_id: str):
endpoint = runpod.Endpoint(endpoint_id)
health = await endpoint.health()
return {
"status": health.status,
"workers_ready": health.workers.ready,
"queue_depth": health.queue.in_queue,
"avg_latency_ms": health.metrics.avg_execution_time,
}GraphQL Metrics Query
GraphQL指标查询
graphql
query GetEndpoint($id: String!) {
endpoint(id: $id) {
status
workers { ready running pending throttled }
queue { inQueue inProgress completed failed }
metrics {
requestsPerMinute
avgExecutionTimeMs
p95ExecutionTimeMs
successRate
}
}
}Seefor structured logging, alerts, and dashboards. </health_monitoring>reference/monitoring.md
<dockerfile_pattern>
graphql
query GetEndpoint($id: String!) {
endpoint(id: $id) {
status
workers { ready running pending throttled }
queue { inQueue inProgress completed failed }
metrics {
requestsPerMinute
avgExecutionTimeMs
p95ExecutionTimeMs
successRate
}
}
}结构化日志、告警及仪表板请参考。 </health_monitoring>reference/monitoring.md
<dockerfile_pattern>
Dockerfile Template
Dockerfile模板
dockerfile
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04
WORKDIR /appdockerfile
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04
WORKDIR /appInstall dependencies (cached layer)
安装依赖(缓存层)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy application
复制应用代码
COPY . .
COPY . .
RunPod entrypoint
RunPod入口命令
CMD ["python", "-u", "handler.py"]
> See `reference/templates.md` for runpod.toml, requirements.txt patterns.
</dockerfile_pattern>
<file_locations>CMD ["python", "-u", "handler.py"]
> runpod.toml、requirements.txt模板请参考`reference/templates.md`。
</dockerfile_pattern>
<file_locations>Reference Files
参考文件
Core Patterns:
- - Handler patterns, streaming, async
reference/serverless-workers.md - - vLLM, TGI, HuggingFace deployment
reference/model-deployment.md - - GPU types, scaling, lifecycle
reference/pod-management.md
Operations:
- - Budget controls, right-sizing
reference/cost-optimization.md - - Health checks, logging, GraphQL
reference/monitoring.md - - Common issues and solutions
reference/troubleshooting.md
DevOps:
- - GitHub Actions for M1 Mac builds
reference/cicd.md - - Dockerfile, runpod.toml configs
reference/templates.md - - Production handler template </file_locations>
templates/runpod-worker.py
核心模式:
- - 处理器模式、流式传输、异步处理
reference/serverless-workers.md - - vLLM、TGI、HuggingFace部署
reference/model-deployment.md - - GPU类型、扩缩容、生命周期
reference/pod-management.md
运维相关:
- - 预算控制、资源合理配置
reference/cost-optimization.md - - 健康检查、日志、GraphQL
reference/monitoring.md - - 常见问题与解决方案
reference/troubleshooting.md
DevOps相关:
- - 适用于M1 Mac的GitHub Actions构建
reference/cicd.md - - Dockerfile、runpod.toml配置
reference/templates.md - - 生产级处理器模板 </file_locations>
templates/runpod-worker.py
Request Routing
请求路由
User wants serverless deployment:
→ Provide handler pattern, Dockerfile, deployment steps
→ Reference:
reference/serverless-workers.mdUser wants vLLM endpoint:
→ Provide vLLM env config, OpenAI client setup
→ Reference:
reference/model-deployment.mdUser wants cost optimization:
→ Provide GPU selection matrix, spot pricing, budget controls
→ Reference:
reference/cost-optimization.mdUser on M1/M2 Mac:
→ CRITICAL: Must use GitHub Actions for builds
→ Reference:
reference/cicd.mdUser has deployment issues:
→ Check health endpoint, review logs
→ Reference:
</routing>
reference/troubleshooting.md<cost_quick_ref>
用户需要无服务器部署:
→ 提供处理器模式、Dockerfile、部署步骤
→ 参考:
reference/serverless-workers.md用户需要vLLM端点:
→ 提供vLLM环境配置、OpenAI客户端设置
→ 参考:
reference/model-deployment.md用户需要成本优化:
→ 提供GPU选择矩阵、抢占式定价、预算控制
→ 参考:
reference/cost-optimization.md用户使用M1/M2 Mac:
→ 重要提示:必须使用GitHub Actions进行构建
→ 参考:
reference/cicd.md用户遇到部署问题:
→ 检查健康端点、查看日志
→ 参考:
</routing>
reference/troubleshooting.md<cost_quick_ref>
Cost Quick Reference
成本速查
Storage Costs:
- Container disk: $0.10/GB/month (billed in 5-min intervals)
- Network volumes: $0.07/GB/month (first 1TB), $0.05/GB after
Cost Estimation:
python
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
daily_hours = (daily_requests * avg_time_s) / 3600
return daily_hours * 30 * rates.get(gpu_type, 1.0)存储成本:
- 容器磁盘:$0.10/GB/月(按5分钟间隔计费)
- 网络卷:前1TB $0.07/GB/月,超出部分$0.05/GB/月
成本估算:
python
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
daily_hours = (daily_requests * avg_time_s) / 3600
return daily_hours * 30 * rates.get(gpu_type, 1.0)Example: 1000 requests/day, 5s each, RTX 4090
示例:每日1000次请求,每次5秒,使用RTX 4090
= (1000 * 5) / 3600 * 30 * 0.44 = $18.33/month
= (1000 * 5) / 3600 * 30 * 0.44 = 每月约$18.33
</cost_quick_ref>
<cli_commands></cost_quick_ref>
<cli_commands>Quick CLI Commands
快速CLI命令
bash
undefinedbash
undefinedInstall
安装
pip install runpod
pip install runpod
Deploy endpoint
部署端点
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"
Health check
健康检查
runpod endpoint health ENDPOINT_ID
runpod endpoint health ENDPOINT_ID
View logs
查看日志
runpod endpoint logs ENDPOINT_ID
runpod endpoint logs ENDPOINT_ID
Scale workers
调整工作节点数量
runpod endpoint scale ENDPOINT_ID --min 1 --max 10
runpod endpoint scale ENDPOINT_ID --min 1 --max 10
Local testing
本地测试
python handler.py --rp_serve_api
</cli_commands>
<example_session>python handler.py --rp_serve_api
</cli_commands>
<example_session>Example Session
示例会话
User: "Deploy a Llama 3.1 8B model to RunPod serverless"
Claude:
- First, select GPU - 8B model fits on RTX 4090 ($0.44/hr):
python
gpu_type = "NVIDIA GeForce RTX 4090"- Use vLLM for 2-3x throughput. Create endpoint with:
python
env = {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
"MAX_MODEL_LEN": "8192",
"GPU_MEMORY_UTILIZATION": "0.95",
}- Access via OpenAI-compatible API:
python
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)- Cost estimate: ~$0.44/hr compute, scale-to-zero when idle. </example_session>
用户: "将Llama 3.1 8B模型部署到RunPod无服务器实例"
Claude:
- 首先选择GPU - 8B模型可在RTX 4090上运行($0.44/小时):
python
gpu_type = "NVIDIA GeForce RTX 4090"- 使用vLLM提升2-3倍吞吐量。创建端点时配置:
python
env = {
"MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
"MAX_MODEL_LEN": "8192",
"GPU_MEMORY_UTILIZATION": "0.95",
}- 通过兼容OpenAI的API访问:
python
from openai import OpenAI
client = OpenAI(
api_key="YOUR_KEY",
base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)- 成本估算:计算资源约$0.44/小时,空闲时自动缩容至零。 </example_session>