modal-serverless-gpu

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Modal Serverless GPU

Modal 无服务器GPU

Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.
本指南全面介绍如何在Modal的无服务器GPU云平台上运行机器学习(ML)工作负载。

When to use Modal

何时使用Modal

Use Modal when:
  • Running GPU-intensive ML workloads without managing infrastructure
  • Deploying ML models as auto-scaling APIs
  • Running batch processing jobs (training, inference, data processing)
  • Need pay-per-second GPU pricing without idle costs
  • Prototyping ML applications quickly
  • Running scheduled jobs (cron-like workloads)
Key features:
  • Serverless GPUs: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
  • Python-native: Define infrastructure in Python code, no YAML
  • Auto-scaling: Scale to zero, scale to 100+ GPUs instantly
  • Sub-second cold starts: Rust-based infrastructure for fast container launches
  • Container caching: Image layers cached for rapid iteration
  • Web endpoints: Deploy functions as REST APIs with zero-downtime updates
Use alternatives instead:
  • RunPod: For longer-running pods with persistent state
  • Lambda Labs: For reserved GPU instances
  • SkyPilot: For multi-cloud orchestration and cost optimization
  • Kubernetes: For complex multi-service architectures
使用Modal的场景:
  • 无需管理基础设施即可运行GPU密集型机器学习工作负载
  • 将机器学习模型部署为支持自动扩缩容的API
  • 运行批处理作业(训练、推理、数据处理)
  • 需要按秒付费的GPU定价,无闲置成本
  • 快速原型化机器学习应用
  • 运行定时作业(类cron的工作负载)
核心特性:
  • 无服务器GPU:按需使用T4、L4、A10G、L40S、A100、H100、H200、B200
  • 原生Python支持:用Python代码定义基础设施,无需YAML
  • 自动扩缩容:支持缩容至零,可瞬间扩容至100+ GPU
  • 亚秒级冷启动:基于Rust的基础设施实现快速容器启动
  • 容器缓存:镜像层缓存,加速迭代
  • Web端点:将函数部署为REST API,支持零停机更新
可选择的替代方案:
  • RunPod:适用于需要持久化状态的长时运行Pod
  • Lambda Labs:适用于预留GPU实例
  • SkyPilot:适用于多云编排与成本优化
  • Kubernetes:适用于复杂的多服务架构

Quick start

快速开始

Installation

安装

bash
pip install modal
modal setup  # Opens browser for authentication
bash
pip install modal
modal setup  # 打开浏览器进行身份验证

Hello World with GPU

GPU版Hello World

python
import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())
Run:
modal run hello_gpu.py
python
import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())
运行:
modal run hello_gpu.py

Basic inference endpoint

基础推理端点

python
import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))
python
import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

Core concepts

核心概念

Key components

关键组件

ComponentPurpose
App
Container for functions and resources
Function
Serverless function with compute specs
Cls
Class-based functions with lifecycle hooks
Image
Container image definition
Volume
Persistent storage for models/data
Secret
Secure credential storage
组件用途
App
函数和资源的容器
Function
带有计算规格的无服务器函数
Cls
带有生命周期钩子的类式函数
Image
容器镜像定义
Volume
用于模型/数据的持久化存储
Secret
安全凭证存储

Execution modes

执行模式

CommandDescription
modal run script.py
Execute and exit
modal serve script.py
Development with live reload
modal deploy script.py
Persistent cloud deployment
命令描述
modal run script.py
执行后退出
modal serve script.py
开发模式,支持热重载
modal deploy script.py
持久化云端部署

GPU configuration

GPU配置

Available GPUs

可用GPU型号

GPUVRAMBest For
T4
16GBBudget inference, small models
L4
24GBInference, Ada Lovelace arch
A10G
24GBTraining/inference, 3.3x faster than T4
L40S
48GBRecommended for inference (best cost/perf)
A100-40GB
40GBLarge model training
A100-80GB
80GBVery large models
H100
80GBFastest, FP8 + Transformer Engine
H200
141GBAuto-upgrade from H100, 4.8TB/s bandwidth
B200
LatestBlackwell architecture
GPU显存适用场景
T4
16GB低成本推理、小型模型
L4
24GB推理任务、Ada Lovelace架构
A10G
24GB训练/推理任务,性能是T4的3.3倍
L40S
48GB推荐用于推理任务(最佳性价比)
A100-40GB
40GB大型模型训练
A100-80GB
80GB超大型模型
H100
80GB性能最强,支持FP8与Transformer Engine
H200
141GBH100自动升级款,带宽4.8TB/s
B200
最新款Blackwell架构

GPU specification patterns

GPU规格配置示例

python
undefined
python
undefined

Single GPU

单GPU

@app.function(gpu="A100")
@app.function(gpu="A100")

Specific memory variant

指定显存版本

@app.function(gpu="A100-80GB")
@app.function(gpu="A100-80GB")

Multiple GPUs (up to 8)

多GPU(最多8块)

@app.function(gpu="H100:4")
@app.function(gpu="H100:4")

GPU with fallbacks

GPU fallback配置

@app.function(gpu=["H100", "A100", "L40S"])
@app.function(gpu=["H100", "A100", "L40S"])

Any available GPU

任意可用GPU

@app.function(gpu="any")
undefined
@app.function(gpu="any")
undefined

Container images

容器镜像

python
undefined
python
undefined

Basic image with pip

带pip的基础镜像

image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch==2.1.0", "transformers==4.36.0", "accelerate" )
image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch==2.1.0", "transformers==4.36.0", "accelerate" )

From CUDA base

基于CUDA基础镜像

image = modal.Image.from_registry( "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04", add_python="3.11" ).pip_install("torch", "transformers")
image = modal.Image.from_registry( "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04", add_python="3.11" ).pip_install("torch", "transformers")

With system packages

包含系统包的镜像

image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
undefined
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
undefined

Persistent storage

持久化存储

python
volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # Persist changes
    return load_from_path(model_path)
python
volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # 持久化变更
    return load_from_path(model_path)

Web endpoints

Web端点

FastAPI endpoint decorator

FastAPI端点装饰器

python
@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}
python
@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

Full ASGI app

完整ASGI应用

python
from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app
python
from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

Web endpoint types

Web端点类型

DecoratorUse Case
@modal.fastapi_endpoint()
Simple function → API
@modal.asgi_app()
Full FastAPI/Starlette apps
@modal.wsgi_app()
Django/Flask apps
@modal.web_server(port)
Arbitrary HTTP servers
装饰器适用场景
@modal.fastapi_endpoint()
简单函数转API
@modal.asgi_app()
完整FastAPI/Starlette应用
@modal.wsgi_app()
Django/Flask应用
@modal.web_server(port)
任意HTTP服务器

Dynamic batching

动态批处理

python
@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # Inputs automatically batched
    return model.batch_predict(inputs)
python
@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # 输入会被自动批处理
    return model.batch_predict(inputs)

Secrets management

密钥管理

bash
undefined
bash
undefined

Create secret

创建密钥

modal secret create huggingface HF_TOKEN=hf_xxx

```python
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]
modal secret create huggingface HF_TOKEN=hf_xxx

```python
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

Scheduling

任务调度

python
@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily midnight
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass
python
@app.function(schedule=modal.Cron("0 0 * * *"))  # 每日午夜
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

Performance optimization

性能优化

Cold start mitigation

冷启动缓解

python
@app.function(
    container_idle_timeout=300,  # Keep warm 5 min
    allow_concurrent_inputs=10,  # Handle concurrent requests
)
def inference():
    pass
python
@app.function(
    container_idle_timeout=300,  # 保持预热5分钟
    allow_concurrent_inputs=10,  # 处理并发请求
)
def inference():
    pass

Model loading best practices

模型加载最佳实践

python
@app.cls(gpu="A100")
class Model:
    @modal.enter()  # Run once at container start
    def load(self):
        self.model = load_model()  # Load during warm-up

    @modal.method()
    def predict(self, x):
        return self.model(x)
python
@app.cls(gpu="A100")
class Model:
    @modal.enter()  # 容器启动时运行一次
    def load(self):
        self.model = load_model()  # 预热阶段加载模型

    @modal.method()
    def predict(self, x):
        return self.model(x)

Parallel processing

并行处理

python
@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # Fan out to parallel containers
    results = list(process_item.map(items))
    return results
python
@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # 分发到并行容器处理
    results = list(process_item.map(items))
    return results

Common configuration

通用配置

python
@app.function(
    gpu="A100",
    memory=32768,              # 32GB RAM
    cpu=4,                     # 4 CPU cores
    timeout=3600,              # 1 hour max
    container_idle_timeout=120,# Keep warm 2 min
    retries=3,                 # Retry on failure
    concurrency_limit=10,      # Max concurrent containers
)
def my_function():
    pass
python
@app.function(
    gpu="A100",
    memory=32768,              # 32GB内存
    cpu=4,                     # 4核CPU
    timeout=3600,              # 最长1小时
    container_idle_timeout=120,# 保持预热2分钟
    retries=3,                 # 失败重试3次
    concurrency_limit=10,      # 最大并发容器数
)
def my_function():
    pass

Debugging

调试

python
undefined
python
undefined

Test locally

本地测试

if name == "main": result = my_function.local()
if name == "main": result = my_function.local()

View logs

查看日志

modal app logs my-app

modal app logs my-app

undefined
undefined

Common issues

常见问题

IssueSolution
Cold start latencyIncrease
container_idle_timeout
, use
@modal.enter()
GPU OOMUse larger GPU (
A100-80GB
), enable gradient checkpointing
Image build failsPin dependency versions, check CUDA compatibility
Timeout errorsIncrease
timeout
, add checkpointing
问题解决方案
冷启动延迟增大
container_idle_timeout
,使用
@modal.enter()
GPU内存不足使用更大显存的GPU(如
A100-80GB
),启用梯度检查点
镜像构建失败固定依赖版本,检查CUDA兼容性
超时错误增大
timeout
,添加检查点机制

References

参考资料

  • Advanced Usage - Multi-GPU, distributed training, cost optimization
  • Troubleshooting - Common issues and solutions
  • 高级用法 - 多GPU、分布式训练、成本优化
  • 故障排除 - 常见问题与解决方案

Resources

资源