runpod-deployment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<objective> Deploy and manage GPU workloads on RunPod infrastructure:
  1. Serverless Workers - Scale-to-zero handlers with pay-per-second billing
  2. vLLM Endpoints - OpenAI-compatible LLM serving with 2-3x throughput
  3. Pod Management - Dedicated GPU instances for development/training
  4. Cost Optimization - GPU selection, spot instances, budget controls
Key deliverables:
  • Production-ready serverless handlers with streaming
  • vLLM deployment with OpenAI API compatibility
  • Cost-optimized GPU selection for any model size
  • Health monitoring and auto-scaling configuration </objective>
<quick_start> Minimal Serverless Handler (v1.8.1):
python
import runpod

def handler(job):
    """Basic handler - receives job, returns result."""
    job_input = job["input"]
    prompt = job_input.get("prompt", "")

    # Your inference logic here
    result = process(prompt)

    return {"output": result}

runpod.serverless.start({"handler": handler})
Streaming Handler:
python
import runpod

def streaming_handler(job):
    """Generator for streaming responses."""
    for chunk in generate_chunks(job["input"]):
        yield {"token": chunk, "finished": False}
    yield {"token": "", "finished": True}

runpod.serverless.start({
    "handler": streaming_handler,
    "return_aggregate_stream": True
})
vLLM OpenAI-Compatible Client:
python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100,
)
</quick_start>
<success_criteria> A RunPod deployment is successful when:
  • Handler processes requests without errors
  • Endpoint scales appropriately (0 → N workers)
  • Cold start time is acceptable for use case
  • Cost stays within budget projections
  • Health checks pass consistently </success_criteria>
<m1_mac_critical>
<objective> 在RunPod基础设施上部署和管理GPU工作负载:
  1. 无服务器工作节点 - 支持缩容至零的处理器,按秒计费
  2. vLLM端点 - 兼容OpenAI的LLM服务,吞吐量提升2-3倍
  3. Pod管理 - 用于开发/训练的专属GPU实例
  4. 成本优化 - GPU选择、抢占式实例、预算控制
关键交付成果:
  • 支持流式传输的生产级无服务器处理器
  • 兼容OpenAI API的vLLM部署
  • 针对任意模型尺寸的成本优化GPU选择
  • 健康监控与自动扩缩容配置 </objective>
<quick_start> 极简无服务器处理器(v1.8.1):
python
import runpod

def handler(job):
    """Basic handler - receives job, returns result."""
    job_input = job["input"]
    prompt = job_input.get("prompt", "")

    # Your inference logic here
    result = process(prompt)

    return {"output": result}

runpod.serverless.start({"handler": handler})
流式处理器:
python
import runpod

def streaming_handler(job):
    """Generator for streaming responses."""
    for chunk in generate_chunks(job["input"]):
        yield {"token": chunk, "finished": False}
    yield {"token": "", "finished": True}

runpod.serverless.start({
    "handler": streaming_handler,
    "return_aggregate_stream": True
})
vLLM兼容OpenAI客户端:
python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100,
)
</quick_start>
<success_criteria> RunPod部署成功的判定标准:
  • 处理器可无错误处理请求
  • 端点可按需扩缩容(0 → N个工作节点)
  • 冷启动时间符合业务场景需求
  • 成本控制在预算范围内
  • 健康检查持续通过 </success_criteria>
<m1_mac_critical>

M1/M2 Mac: Cannot Build Docker Locally

M1/M2 Mac:无法在本地构建Docker镜像

ARM architecture is incompatible with RunPod's x86 GPUs.
Solution: GitHub Actions builds for you:
bash
undefined
ARM架构与RunPod的x86 GPU不兼容。
解决方案:使用GitHub Actions进行构建:
bash
undefined

Push code - Actions builds x86 image

推送代码 - Actions将构建x86镜像

git add . && git commit -m "Deploy" && git push

> See `reference/cicd.md` for complete GitHub Actions workflow.

**Never run `docker build` locally for RunPod on Apple Silicon.**
</m1_mac_critical>

<gpu_selection>
git add . && git commit -m "Deploy" && git push

> 完整GitHub Actions工作流请参考`reference/cicd.md`。

**在Apple Silicon设备上,切勿在本地执行`docker build`命令用于RunPod部署。**
</m1_mac_critical>

<gpu_selection>

GPU Selection Matrix (January 2025)

GPU选择矩阵(2025年1月)

GPUVRAMSecure $/hrSpot $/hrBest For
RTX A400016GB$0.36$0.18Embeddings, small models
RTX 409024GB$0.44$0.227B-8B inference
A4048GB$0.65$0.3913B-30B, fine-tuning
A100 80GB80GB$1.89$0.8970B models, production
H100 80GB80GB$4.69$1.8870B+ training
Quick Selection:
python
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
    effective = model_params_b * (0.5 if quantized else 1.0)
    if effective <= 3: return "RTX_A4000"      # $0.36/hr
    if effective <= 8: return "RTX_4090"       # $0.44/hr
    if effective <= 30: return "A40"           # $0.65/hr
    if effective <= 70: return "A100_80GB"     # $1.89/hr
    return "H100_80GB"                         # $4.69/hr
See
reference/cost-optimization.md
for detailed pricing and budget controls. </gpu_selection>
<handler_patterns>
GPU显存按需实例每小时费用抢占式实例每小时费用最佳适用场景
RTX A400016GB$0.36$0.18嵌入模型、小型模型
RTX 409024GB$0.44$0.227B-8B模型推理
A4048GB$0.65$0.3913B-30B模型、微调
A100 80GB80GB$1.89$0.8970B模型、生产环境
H100 80GB80GB$4.69$1.8870B+模型训练
快速选择工具:
python
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
    effective = model_params_b * (0.5 if quantized else 1.0)
    if effective <= 3: return "RTX_A4000"      # $0.36/hr
    if effective <= 8: return "RTX_4090"       # $0.44/hr
    if effective <= 30: return "A40"           # $0.65/hr
    if effective <= 70: return "A100_80GB"     # $1.89/hr
    return "H100_80GB"                         # $4.69/hr
详细定价与预算控制请参考
reference/cost-optimization.md
。 </gpu_selection>
<handler_patterns>

Handler Patterns

处理器模式

Progress Updates (Long-Running Tasks)

进度更新(长时任务)

python
import runpod

def long_task_handler(job):
    total_steps = job["input"].get("steps", 10)

    for step in range(total_steps):
        process_step(step)
        runpod.serverless.progress_update(
            job_id=job["id"],
            progress=int((step + 1) / total_steps * 100)
        )

    return {"status": "complete", "steps": total_steps}

runpod.serverless.start({"handler": long_task_handler})
python
import runpod

def long_task_handler(job):
    total_steps = job["input"].get("steps", 10)

    for step in range(total_steps):
        process_step(step)
        runpod.serverless.progress_update(
            job_id=job["id"],
            progress=int((step + 1) / total_steps * 100)
        )

    return {"status": "complete", "steps": total_steps}

runpod.serverless.start({"handler": long_task_handler})

Error Handling

错误处理

python
import runpod
import traceback

def safe_handler(job):
    try:
        # Validate input
        if "prompt" not in job["input"]:
            return {"error": "Missing required field: prompt"}

        result = process(job["input"])
        return {"output": result}

    except torch.cuda.OutOfMemoryError:
        return {"error": "GPU OOM - reduce input size", "retry": False}
    except Exception as e:
        return {"error": str(e), "traceback": traceback.format_exc()}

runpod.serverless.start({"handler": safe_handler})
See
reference/serverless-workers.md
for async patterns, batching, and advanced handlers. </handler_patterns>
<vllm_deployment>
python
import runpod
import traceback

def safe_handler(job):
    try:
        # Validate input
        if "prompt" not in job["input"]:
            return {"error": "Missing required field: prompt"}

        result = process(job["input"])
        return {"output": result}

    except torch.cuda.OutOfMemoryError:
        return {"error": "GPU OOM - reduce input size", "retry": False}
    except Exception as e:
        return {"error": str(e), "traceback": traceback.format_exc()}

runpod.serverless.start({"handler": safe_handler})
异步模式、批量处理及高级处理器请参考
reference/serverless-workers.md
。 </handler_patterns>
<vllm_deployment>

vLLM Deployment

vLLM部署

Note: vLLM uses OpenAI-compatible API FORMAT but connects to YOUR RunPod endpoint, NOT OpenAI servers. Models run on your GPU (Llama, Qwen, Mistral, etc.)
注意: vLLM采用兼容OpenAI的API格式,但连接的是您的RunPod端点, 而非OpenAI服务器。模型运行在您的GPU上(如Llama、Qwen、Mistral等)

Environment Configuration

环境配置

python
vllm_env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
    "HF_TOKEN": "${HF_TOKEN}",
    "TENSOR_PARALLEL_SIZE": "2",        # Multi-GPU
    "MAX_MODEL_LEN": "16384",
    "GPU_MEMORY_UTILIZATION": "0.95",
    "QUANTIZATION": "awq",              # Optional: awq, gptq
}
python
vllm_env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
    "HF_TOKEN": "${HF_TOKEN}",
    "TENSOR_PARALLEL_SIZE": "2",        # 多GPU
    "MAX_MODEL_LEN": "16384",
    "GPU_MEMORY_UTILIZATION": "0.95",
    "QUANTIZATION": "awq",              # 可选:awq, gptq
}

OpenAI-Compatible Streaming

兼容OpenAI的流式传输

python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Direct RunPod Streaming

RunPod直接流式传输

python
import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}

response = requests.post(url, headers=headers, json={
    "input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]
python
import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}

response = requests.post(url, headers=headers, json={
    "input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]

Stream results

流式获取结果

stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}" with requests.get(stream_url, headers=headers, stream=True) as r: for line in r.iter_lines(): if line: print(line.decode())

> See `reference/model-deployment.md` for HuggingFace, TGI, and custom model patterns.
</vllm_deployment>

<auto_scaling>
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}" with requests.get(stream_url, headers=headers, stream=True) as r: for line in r.iter_lines(): if line: print(line.decode())

> HuggingFace、TGI及自定义模型部署模式请参考`reference/model-deployment.md`。
</vllm_deployment>

<auto_scaling>

Auto-Scaling Configuration

自动扩缩容配置

Scaler Types

扩缩容器类型

TypeBest ForConfig
QUEUE_DELAYVariable traffic
scaler_value=2
(2s target)
REQUEST_COUNTPredictable load
scaler_value=5
(5 req/worker)
类型最佳适用场景配置
QUEUE_DELAY流量波动大
scaler_value=2
(目标延迟2秒)
REQUEST_COUNT负载可预测
scaler_value=5
(每个工作节点处理5个请求)

Configuration Patterns

配置示例

python
configs = {
    "interactive_api": {
        "workers_min": 1,      # Always warm
        "workers_max": 5,
        "idle_timeout": 120,
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 1,     # 1s latency target
    },
    "batch_processing": {
        "workers_min": 0,
        "workers_max": 20,
        "idle_timeout": 30,
        "scaler_type": "REQUEST_COUNT",
        "scaler_value": 5,
    },
    "cost_optimized": {
        "workers_min": 0,
        "workers_max": 3,
        "idle_timeout": 15,    # Aggressive scale-down
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 5,
    },
}
See
reference/pod-management.md
for pod lifecycle and scaling details. </auto_scaling>
<health_monitoring>
python
configs = {
    "interactive_api": {
        "workers_min": 1,      # 始终保持预热状态
        "workers_max": 5,
        "idle_timeout": 120,
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 1,     # 目标延迟1秒
    },
    "batch_processing": {
        "workers_min": 0,
        "workers_max": 20,
        "idle_timeout": 30,
        "scaler_type": "REQUEST_COUNT",
        "scaler_value": 5,
    },
    "cost_optimized": {
        "workers_min": 0,
        "workers_max": 3,
        "idle_timeout": 15,    # 快速缩容
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 5,
    },
}
Pod生命周期与扩缩容详情请参考
reference/pod-management.md
。 </auto_scaling>
<health_monitoring>

Health & Monitoring

健康监控

Quick Health Check

快速健康检查

python
import runpod

async def check_health(endpoint_id: str):
    endpoint = runpod.Endpoint(endpoint_id)
    health = await endpoint.health()

    return {
        "status": health.status,
        "workers_ready": health.workers.ready,
        "queue_depth": health.queue.in_queue,
        "avg_latency_ms": health.metrics.avg_execution_time,
    }
python
import runpod

async def check_health(endpoint_id: str):
    endpoint = runpod.Endpoint(endpoint_id)
    health = await endpoint.health()

    return {
        "status": health.status,
        "workers_ready": health.workers.ready,
        "queue_depth": health.queue.in_queue,
        "avg_latency_ms": health.metrics.avg_execution_time,
    }

GraphQL Metrics Query

GraphQL指标查询

graphql
query GetEndpoint($id: String!) {
    endpoint(id: $id) {
        status
        workers { ready running pending throttled }
        queue { inQueue inProgress completed failed }
        metrics {
            requestsPerMinute
            avgExecutionTimeMs
            p95ExecutionTimeMs
            successRate
        }
    }
}
See
reference/monitoring.md
for structured logging, alerts, and dashboards. </health_monitoring>
<dockerfile_pattern>
graphql
query GetEndpoint($id: String!) {
    endpoint(id: $id) {
        status
        workers { ready running pending throttled }
        queue { inQueue inProgress completed failed }
        metrics {
            requestsPerMinute
            avgExecutionTimeMs
            p95ExecutionTimeMs
            successRate
        }
    }
}
结构化日志、告警及仪表板请参考
reference/monitoring.md
。 </health_monitoring>
<dockerfile_pattern>

Dockerfile Template

Dockerfile模板

dockerfile
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app
dockerfile
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

Install dependencies (cached layer)

安装依赖(缓存层)

COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt
COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt

Copy application

复制应用代码

COPY . .
COPY . .

RunPod entrypoint

RunPod入口命令

CMD ["python", "-u", "handler.py"]

> See `reference/templates.md` for runpod.toml, requirements.txt patterns.
</dockerfile_pattern>

<file_locations>
CMD ["python", "-u", "handler.py"]

> runpod.toml、requirements.txt模板请参考`reference/templates.md`。
</dockerfile_pattern>

<file_locations>

Reference Files

参考文件

Core Patterns:
  • reference/serverless-workers.md
    - Handler patterns, streaming, async
  • reference/model-deployment.md
    - vLLM, TGI, HuggingFace deployment
  • reference/pod-management.md
    - GPU types, scaling, lifecycle
Operations:
  • reference/cost-optimization.md
    - Budget controls, right-sizing
  • reference/monitoring.md
    - Health checks, logging, GraphQL
  • reference/troubleshooting.md
    - Common issues and solutions
DevOps:
  • reference/cicd.md
    - GitHub Actions for M1 Mac builds
  • reference/templates.md
    - Dockerfile, runpod.toml configs
  • templates/runpod-worker.py
    - Production handler template </file_locations>
<routing>
核心模式:
  • reference/serverless-workers.md
    - 处理器模式、流式传输、异步处理
  • reference/model-deployment.md
    - vLLM、TGI、HuggingFace部署
  • reference/pod-management.md
    - GPU类型、扩缩容、生命周期
运维相关:
  • reference/cost-optimization.md
    - 预算控制、资源合理配置
  • reference/monitoring.md
    - 健康检查、日志、GraphQL
  • reference/troubleshooting.md
    - 常见问题与解决方案
DevOps相关:
  • reference/cicd.md
    - 适用于M1 Mac的GitHub Actions构建
  • reference/templates.md
    - Dockerfile、runpod.toml配置
  • templates/runpod-worker.py
    - 生产级处理器模板 </file_locations>
<routing>

Request Routing

请求路由

User wants serverless deployment: → Provide handler pattern, Dockerfile, deployment steps → Reference:
reference/serverless-workers.md
User wants vLLM endpoint: → Provide vLLM env config, OpenAI client setup → Reference:
reference/model-deployment.md
User wants cost optimization: → Provide GPU selection matrix, spot pricing, budget controls → Reference:
reference/cost-optimization.md
User on M1/M2 Mac: → CRITICAL: Must use GitHub Actions for builds → Reference:
reference/cicd.md
User has deployment issues: → Check health endpoint, review logs → Reference:
reference/troubleshooting.md
</routing>
<cost_quick_ref>
用户需要无服务器部署: → 提供处理器模式、Dockerfile、部署步骤 → 参考:
reference/serverless-workers.md
用户需要vLLM端点: → 提供vLLM环境配置、OpenAI客户端设置 → 参考:
reference/model-deployment.md
用户需要成本优化: → 提供GPU选择矩阵、抢占式定价、预算控制 → 参考:
reference/cost-optimization.md
用户使用M1/M2 Mac: → 重要提示:必须使用GitHub Actions进行构建 → 参考:
reference/cicd.md
用户遇到部署问题: → 检查健康端点、查看日志 → 参考:
reference/troubleshooting.md
</routing>
<cost_quick_ref>

Cost Quick Reference

成本速查

Storage Costs:
  • Container disk: $0.10/GB/month (billed in 5-min intervals)
  • Network volumes: $0.07/GB/month (first 1TB), $0.05/GB after
Cost Estimation:
python
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
    rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
    daily_hours = (daily_requests * avg_time_s) / 3600
    return daily_hours * 30 * rates.get(gpu_type, 1.0)
存储成本:
  • 容器磁盘:$0.10/GB/月(按5分钟间隔计费)
  • 网络卷:前1TB $0.07/GB/月,超出部分$0.05/GB/月
成本估算:
python
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
    rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
    daily_hours = (daily_requests * avg_time_s) / 3600
    return daily_hours * 30 * rates.get(gpu_type, 1.0)

Example: 1000 requests/day, 5s each, RTX 4090

示例:每日1000次请求,每次5秒,使用RTX 4090

= (1000 * 5) / 3600 * 30 * 0.44 = $18.33/month

= (1000 * 5) / 3600 * 30 * 0.44 = 每月约$18.33

</cost_quick_ref>

<cli_commands>
</cost_quick_ref>

<cli_commands>

Quick CLI Commands

快速CLI命令

bash
undefined
bash
undefined

Install

安装

pip install runpod
pip install runpod

Deploy endpoint

部署端点

runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"

Health check

健康检查

runpod endpoint health ENDPOINT_ID
runpod endpoint health ENDPOINT_ID

View logs

查看日志

runpod endpoint logs ENDPOINT_ID
runpod endpoint logs ENDPOINT_ID

Scale workers

调整工作节点数量

runpod endpoint scale ENDPOINT_ID --min 1 --max 10
runpod endpoint scale ENDPOINT_ID --min 1 --max 10

Local testing

本地测试

python handler.py --rp_serve_api
</cli_commands>

<example_session>
python handler.py --rp_serve_api
</cli_commands>

<example_session>

Example Session

示例会话

User: "Deploy a Llama 3.1 8B model to RunPod serverless"
Claude:
  1. First, select GPU - 8B model fits on RTX 4090 ($0.44/hr):
python
gpu_type = "NVIDIA GeForce RTX 4090"
  1. Use vLLM for 2-3x throughput. Create endpoint with:
python
env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
    "MAX_MODEL_LEN": "8192",
    "GPU_MEMORY_UTILIZATION": "0.95",
}
  1. Access via OpenAI-compatible API:
python
from openai import OpenAI
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
  1. Cost estimate: ~$0.44/hr compute, scale-to-zero when idle. </example_session>
用户: "将Llama 3.1 8B模型部署到RunPod无服务器实例"
Claude:
  1. 首先选择GPU - 8B模型可在RTX 4090上运行($0.44/小时):
python
gpu_type = "NVIDIA GeForce RTX 4090"
  1. 使用vLLM提升2-3倍吞吐量。创建端点时配置:
python
env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
    "MAX_MODEL_LEN": "8192",
    "GPU_MEMORY_UTILIZATION": "0.95",
}
  1. 通过兼容OpenAI的API访问:
python
from openai import OpenAI
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
  1. 成本估算:计算资源约$0.44/小时,空闲时自动缩容至零。 </example_session>