build-models

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Docs

文档

When to use this skill

适用场景

  • You have model code, weights, or a HuggingFace/GitHub project you want to host on Replicate.
  • You're writing or editing a
    cog.yaml
    ,
    predict.py
    , or
    train.py
    .
  • For pushing a built model to Replicate, see
    publish-models
    .
  • For running existing Replicate models, see
    run-models
    .
  • 你拥有模型代码、权重,或是想要托管到Replicate的HuggingFace/GitHub项目。
  • 你正在编写或编辑
    cog.yaml
    predict.py
    train.py
    文件。
  • 如需将构建好的模型推送至Replicate,请查看
    publish-models
    技能。
  • 如需运行现有Replicate模型,请查看
    run-models
    技能。

Prerequisites

前置条件

  • Docker running locally.
  • Cog installed:
    brew install replicate/tap/cog
    or
    sh <(curl -fsSL https://cog.run/install.sh)
    .
  • Optional:
    cog init
    to scaffold
    cog.yaml
    and
    predict.py
    .
  • 本地运行Docker。
  • 已安装Cog:
    brew install replicate/tap/cog
    sh <(curl -fsSL https://cog.run/install.sh)
  • 可选:使用
    cog init
    快速生成
    cog.yaml
    predict.py
    文件。

Project layout

项目结构

The canonical Replicate model layout:
cog.yaml
predict.py
weights.py                 # optional download helpers
requirements.txt
cog-safe-push-configs/
  default.yaml             # see publish-models skill
.github/workflows/
  ci.yaml
script/                    # github.com/github/scripts-to-rule-them-all
  lint
  test
  push
标准的Replicate模型结构:
cog.yaml
predict.py
weights.py                 # optional download helpers
requirements.txt
cog-safe-push-configs/
  default.yaml             # see publish-models skill
.github/workflows/
  ci.yaml
script/                    # github.com/github/scripts-to-rule-them-all
  lint
  test
  push

cog.yaml essentials

cog.yaml核心配置

A modern config for a GPU model:
yaml
build:
  gpu: true
  cuda: "12.8"
  python_version: "3.12"
  python_requirements: requirements.txt
  system_packages:
    - libgl1
    - libglib2.0-0
predict: predict.py:Predictor
Notes:
  • Pin Python to a specific minor version, and pin every line in
    requirements.txt
    . Floating versions break cold boots.
  • Use
    python_requirements
    over inline
    python_packages
    once the list grows.
  • cuda
    follows your torch wheel (e.g.
    12.8
    paired with
    torch==2.7.1+cu128
    ).
  • Add
    train: train.py:train
    if your model is fine-tunable.
  • Add
    image: r8.im/owner/name
    to enable bare
    cog push
    .
For async predictors with continuous batching:
yaml
concurrency:
  max: 32
适用于GPU模型的现代配置示例:
yaml
build:
  gpu: true
  cuda: "12.8"
  python_version: "3.12"
  python_requirements: requirements.txt
  system_packages:
    - libgl1
    - libglib2.0-0
predict: predict.py:Predictor
注意事项:
  • 固定Python的具体小版本号,并固定
    requirements.txt
    中的每一项依赖版本。浮动版本会导致冷启动失败。
  • 当依赖列表变长时,使用
    python_requirements
    替代内联的
    python_packages
  • cuda
    版本需与torch包匹配(例如
    12.8
    搭配
    torch==2.7.1+cu128
    )。
  • 如果模型支持微调,添加
    train: train.py:train
    配置。
  • 添加
    image: r8.im/owner/name
    配置以支持直接使用
    cog push
    命令。
支持连续批处理的异步预测器配置:
yaml
concurrency:
  max: 32

predict.py essentials

predict.py核心代码

python
from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self) -> None:
        """One-time loads. Heavy work goes here, not in predict()."""
        self.model = load_model("weights/")

    def predict(
        self,
        prompt: str = Input(description="Text prompt for generation"),
        seed: int = Input(description="Random seed; leave blank for random", default=None),
        num_steps: int = Input(description="Number of denoising steps", ge=1, le=50, default=20),
        output_format: str = Input(description="Output image format", choices=["webp", "jpg", "png"], default="webp"),
    ) -> Path:
        """Run a single prediction."""
        if not prompt.strip():
            raise ValueError("prompt cannot be empty")
        out = self.model.generate(prompt, seed=seed, steps=num_steps)
        return Path(out)
Input rules:
  • Every input needs a
    description
    . The description shows up in the model schema and on Replicate's web UI.
  • Use
    ge
    /
    le
    for numeric bounds,
    choices=[...]
    for enums,
    regex=
    for strings.
  • Use
    cog.Path
    for file inputs and outputs, never raw bytes.
  • Use
    cog.Secret
    for any token-like input (HF tokens, API keys), never plain
    str
    .
  • Provide a default that's inside
    choices
    for categorical inputs.
  • Validate inputs early in
    predict()
    and raise
    ValueError
    .
Streaming text output (for LLMs):
python
from cog import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self, prompt: str = Input(description="Prompt")) -> ConcatenateIterator[str]:
        for token in self.model.stream(prompt):
            yield token
Async predictor with continuous batching (paired with
concurrency.max
in cog.yaml):
python
from cog import BasePredictor, Input, AsyncConcatenateIterator

class Predictor(BasePredictor):
    async def setup(self) -> None:
        self.engine = await load_async_engine()

    async def predict(
        self,
        prompt: str = Input(description="Prompt"),
    ) -> AsyncConcatenateIterator[str]:
        async for token in self.engine.generate(prompt):
            yield token
Dynamic
choices
from on-disk assets (e.g. a
voices/
directory of audio samples):
python
from pathlib import Path as _P
AVAILABLE_VOICES = sorted(p.stem for p in _P("voices").glob("*.wav"))

class Predictor(BasePredictor):
    def predict(
        self,
        speaker: str = Input(description="Voice", choices=AVAILABLE_VOICES, default=AVAILABLE_VOICES[0]),
    ) -> Path: ...
python
from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self) -> None:
        """One-time loads. Heavy work goes here, not in predict()."""
        self.model = load_model("weights/")

    def predict(
        self,
        prompt: str = Input(description="Text prompt for generation"),
        seed: int = Input(description="Random seed; leave blank for random", default=None),
        num_steps: int = Input(description="Number of denoising steps", ge=1, le=50, default=20),
        output_format: str = Input(description="Output image format", choices=["webp", "jpg", "png"], default="webp"),
    ) -> Path:
        """Run a single prediction."""
        if not prompt.strip():
            raise ValueError("prompt cannot be empty")
        out = self.model.generate(prompt, seed=seed, steps=num_steps)
        return Path(out)
输入规则:
  • 每个输入都需要添加
    description
    。该描述会显示在模型架构中以及Replicate的网页UI上。
  • 对数值类型使用
    ge
    /
    le
    设置范围,对枚举类型使用
    choices=[...]
    ,对字符串类型使用
    regex=
  • 使用
    cog.Path
    处理文件输入和输出,不要使用原始字节类型。
  • 使用
    cog.Secret
    处理令牌类输入(如HF令牌、API密钥),不要使用普通
    str
    类型。
  • 为分类输入提供一个在
    choices
    范围内的默认值。
  • predict()
    方法早期验证输入,并抛出
    ValueError
    异常。
流式文本输出(适用于LLM):
python
from cog import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self, prompt: str = Input(description="Prompt")) -> ConcatenateIterator[str]:
        for token in self.model.stream(prompt):
            yield token
支持连续批处理的异步预测器(需搭配cog.yaml中的
concurrency.max
配置):
python
from cog import BasePredictor, Input, AsyncConcatenateIterator

class Predictor(BasePredictor):
    async def setup(self) -> None:
        self.engine = await load_async_engine()

    async def predict(
        self,
        prompt: str = Input(description="Prompt"),
    ) -> AsyncConcatenateIterator[str]:
        async for token in self.engine.generate(prompt):
            yield token
从磁盘资源动态生成
choices
(例如
voices/
目录下的音频样本):
python
from pathlib import Path as _P
AVAILABLE_VOICES = sorted(p.stem for p in _P("voices").glob("*.wav"))

class Predictor(BasePredictor):
    def predict(
        self,
        speaker: str = Input(description="Voice", choices=AVAILABLE_VOICES, default=AVAILABLE_VOICES[0]),
    ) -> Path: ...

Loading weights fast

快速加载权重

Cold boot dominates user-perceived latency. Three patterns, ranked by simplicity:
冷启动时间是影响用户感知延迟的关键因素。以下三种方案按简易程度排序:

1. Bake weights into the image at build time

1. 在构建阶段将权重打包到镜像中

Best for small or medium weights (< 5GB) that you want zero-cold-boot for.
For torchvision:
python
import os
os.environ["TORCH_HOME"] = "."  # set before importing torch
import torch
from torchvision import models
For HuggingFace:
python
import os
os.environ["HF_HUB_CACHE"] = "./.cache"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
Then download once during
cog build
(e.g. in a
run:
step or by running a small fetcher script as part of the build). The weights become part of the image layer.
适用于小型或中型权重(<5GB),可实现零冷启动。
针对torchvision的配置:
python
import os
os.environ["TORCH_HOME"] = "."  # set before importing torch
import torch
from torchvision import models
针对HuggingFace的配置:
python
import os
os.environ["HF_HUB_CACHE"] = "./.cache"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
随后在
cog build
过程中下载一次权重(例如在
run:
步骤中执行,或作为构建过程的一部分运行小型下载脚本)。权重将成为镜像层的一部分。

2. Pull from
weights.replicate.delivery
with pget

2. 使用pget从
weights.replicate.delivery
拉取权重

Best for large weights, or when you want to share weights across multiple models.
pget
is Replicate's parallel HTTP fetcher.
In
cog.yaml
:
yaml
build:
  run:
    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.8.2/pget_linux_x86_64"
    - chmod +x /usr/local/bin/pget
In
setup()
:
python
import subprocess
from pathlib import Path

WEIGHTS_URL = "https://weights.replicate.delivery/default/my-model/weights.tar"
WEIGHTS_DIR = Path("weights")

class Predictor(BasePredictor):
    def setup(self) -> None:
        if not WEIGHTS_DIR.exists():
            # -x extracts tar in-memory; default concurrency is 4 * NumCPU
            subprocess.check_call(["pget", "-x", WEIGHTS_URL, str(WEIGHTS_DIR)])
        self.model = load_from(WEIGHTS_DIR)
For multiple files in one shot:
python
manifest = "\n".join([
    f"{base}/unet.safetensors weights/unet.safetensors",
    f"{base}/vae.safetensors  weights/vae.safetensors",
    f"{base}/text_encoder.safetensors weights/text_encoder.safetensors",
])
subprocess.run(["pget", "multifile", "-"], input=manifest, text=True, check=True)
适用于大型权重,或需要在多个模型间共享权重的场景。
pget
是Replicate的并行HTTP下载工具。
cog.yaml
中配置:
yaml
build:
  run:
    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.8.2/pget_linux_x86_64"
    - chmod +x /usr/local/bin/pget
setup()
方法中使用:
python
import subprocess
from pathlib import Path

WEIGHTS_URL = "https://weights.replicate.delivery/default/my-model/weights.tar"
WEIGHTS_DIR = Path("weights")

class Predictor(BasePredictor):
    def setup(self) -> None:
        if not WEIGHTS_DIR.exists():
            # -x extracts tar in-memory; default concurrency is 4 * NumCPU
            subprocess.check_call(["pget", "-x", WEIGHTS_URL, str(WEIGHTS_DIR)])
        self.model = load_from(WEIGHTS_DIR)
一次性下载多个文件的示例:
python
manifest = "\n".join([
    f"{base}/unet.safetensors weights/unet.safetensors",
    f"{base}/vae.safetensors  weights/vae.safetensors",
    f"{base}/text_encoder.safetensors weights/text_encoder.safetensors",
])
subprocess.run(["pget", "multifile", "-"], input=manifest, text=True, check=True)

3. HuggingFace Hub with hf_transfer

3. 使用hf_transfer从HuggingFace Hub下载

Set
HF_HUB_ENABLE_HF_TRANSFER=1
and use
huggingface_hub.snapshot_download
or
from_pretrained
. Faster than vanilla HF downloads. Use a
cog.Secret
input for gated models.
设置
HF_HUB_ENABLE_HF_TRANSFER=1
,并使用
huggingface_hub.snapshot_download
from_pretrained
方法。比原生HF下载更快。对于 gated 模型,使用
cog.Secret
输入令牌。

Weight cache for user-supplied weights

用户自定义权重缓存

For LoRAs or any weights URL the user passes at predict time, use a sha256-keyed disk cache with LRU eviction:
python
import hashlib, shutil, subprocess
from pathlib import Path

class WeightsDownloadCache:
    def __init__(self, cache_dir: str = "/tmp/weights-cache", min_disk_free_gb: int = 10):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.min_disk_free = min_disk_free_gb * 1024**3

    def ensure(self, url: str) -> Path:
        key = hashlib.sha256(url.encode()).hexdigest()
        target = self.cache_dir / key
        if target.exists():
            target.touch()  # bump LRU mtime
            return target
        self._evict_until_room()
        subprocess.check_call(["pget", url, str(target)])
        return target

    def _evict_until_room(self) -> None:
        while shutil.disk_usage(self.cache_dir).free < self.min_disk_free:
            entries = sorted(self.cache_dir.iterdir(), key=lambda p: p.stat().st_mtime)
            if not entries:
                return
            entries[0].unlink()
See
replicate/cog-flux/weights.py
for a production version that handles HF, CivitAI, Replicate, and arbitrary
.safetensors
URLs.
针对LoRA或用户在预测时传入的任意权重URL,使用基于sha256密钥的磁盘缓存并启用LRU淘汰机制:
python
import hashlib, shutil, subprocess
from pathlib import Path

class WeightsDownloadCache:
    def __init__(self, cache_dir: str = "/tmp/weights-cache", min_disk_free_gb: int = 10):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.min_disk_free = min_disk_free_gb * 1024**3

    def ensure(self, url: str) -> Path:
        key = hashlib.sha256(url.encode()).hexdigest()
        target = self.cache_dir / key
        if target.exists():
            target.touch()  # bump LRU mtime
            return target
        self._evict_until_room()
        subprocess.check_call(["pget", url, str(target)])
        return target

    def _evict_until_room(self) -> None:
        while shutil.disk_usage(self.cache_dir).free < self.min_disk_free:
            entries = sorted(self.cache_dir.iterdir(), key=lambda p: p.stat().st_mtime)
            if not entries:
                return
            entries[0].unlink()
可参考
replicate/cog-flux/weights.py
中的生产级实现,它支持HF、CivitAI、Replicate以及任意
.safetensors
URL。

Multi-LoRA composition

多LoRA组合

Reload only when the URL changes; compose two LoRAs with separate scales:
python
class Predictor(BasePredictor):
    def setup(self) -> None:
        self.pipe = load_base_pipeline()
        self.loaded = {"main": None, "extra": None}

    def _ensure_lora(self, slot: str, url: str | None) -> None:
        if url == self.loaded[slot]:
            return
        if self.loaded[slot] is not None:
            self.pipe.unload_lora_weights(adapter_name=slot)
        if url:
            path = self.cache.ensure(url)
            self.pipe.load_lora_weights(str(path), adapter_name=slot)
        self.loaded[slot] = url

    def predict(
        self,
        prompt: str = Input(description="Prompt"),
        lora_url: str = Input(description="Primary LoRA URL", default=None),
        lora_scale: float = Input(description="Primary LoRA scale", ge=0.0, le=2.0, default=1.0),
        extra_lora_url: str = Input(description="Optional second LoRA URL", default=None),
        extra_lora_scale: float = Input(description="Second LoRA scale", ge=0.0, le=2.0, default=1.0),
    ) -> Path:
        self._ensure_lora("main", lora_url)
        self._ensure_lora("extra", extra_lora_url)
        adapters = [s for s, u in self.loaded.items() if u]
        scales = [lora_scale if s == "main" else extra_lora_scale for s in adapters]
        if adapters:
            self.pipe.set_adapters(adapters, adapter_weights=scales)
        return Path(self.pipe(prompt).images[0].save("/tmp/out.png"))
仅在URL变化时重新加载;使用不同的权重比例组合两个LoRA:
python
class Predictor(BasePredictor):
    def setup(self) -> None:
        self.pipe = load_base_pipeline()
        self.loaded = {"main": None, "extra": None}

    def _ensure_lora(self, slot: str, url: str | None) -> None:
        if url == self.loaded[slot]:
            return
        if self.loaded[slot] is not None:
            self.pipe.unload_lora_weights(adapter_name=slot)
        if url:
            path = self.cache.ensure(url)
            self.pipe.load_lora_weights(str(path), adapter_name=slot)
        self.loaded[slot] = url

    def predict(
        self,
        prompt: str = Input(description="Prompt"),
        lora_url: str = Input(description="Primary LoRA URL", default=None),
        lora_scale: float = Input(description="Primary LoRA scale", ge=0.0, le=2.0, default=1.0),
        extra_lora_url: str = Input(description="Optional second LoRA URL", default=None),
        extra_lora_scale: float = Input(description="Second LoRA scale", ge=0.0, le=2.0, default=1.0),
    ) -> Path:
        self._ensure_lora("main", lora_url)
        self._ensure_lora("extra", extra_lora_url)
        adapters = [s for s, u in self.loaded.items() if u]
        scales = [lora_scale if s == "main" else extra_lora_scale for s in adapters]
        if adapters:
            self.pipe.set_adapters(adapters, adapter_weights=scales)
        return Path(self.pipe(prompt).images[0].save("/tmp/out.png"))

Cold-boot tricks

冷启动优化技巧

From production diffusion models like
replicate/cog-flux
and
replicate/cog-flux-kontext
:
  • Set perf flags once in
    setup()
    :
    python
    import torch
    torch.set_float32_matmul_precision("high")
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
  • Compile and warm up:
    python
    self.model = torch.compile(self.model, dynamic=True)
    _ = self.predict(prompt="warmup", num_steps=1)  # absorbs compile cost in setup
  • Load big weights with meta device +
    assign=True
    to avoid double-allocating:
    python
    with torch.device("meta"):
        model = build_model_skeleton()
    state = torch.load("weights.pt", map_location="cpu")
    model.load_state_dict(state, assign=True)
  • Share VAE / text encoder across multiple pipelines (e.g. base + img2img + inpaint) instead of loading three copies.
  • For fp8/int8, save quantized weights ahead of time and load directly; don't quantize at boot.
参考生产级扩散模型如
replicate/cog-flux
replicate/cog-flux-kontext
的实现:
  • setup()
    中设置性能标志:
    python
    import torch
    torch.set_float32_matmul_precision("high")
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
  • 编译并预热模型:
    python
    self.model = torch.compile(self.model, dynamic=True)
    _ = self.predict(prompt="warmup", num_steps=1)  # absorbs compile cost in setup
  • 使用meta设备加载大权重并设置
    assign=True
    以避免重复分配内存:
    python
    with torch.device("meta"):
        model = build_model_skeleton()
    state = torch.load("weights.pt", map_location="cpu")
    model.load_state_dict(state, assign=True)
  • 在多个流水线间共享VAE/文本编码器(例如基础模型+图生图+图修复),而不是加载三份副本。
  • 对于fp8/int8量化模型,提前保存量化后的权重并直接加载;不要在启动时进行量化。

Local development

本地开发

cog init                                    # scaffold cog.yaml + predict.py
cog predict -i prompt="hello"               # build + run a single prediction
cog predict -i image=@input.jpg -o out.png  # file inputs and outputs
cog serve -p 8393                           # HTTP server matching production
cog exec python                             # interactive shell inside the build env
cog init                                    # scaffold cog.yaml + predict.py
cog predict -i prompt="hello"               # build + run a single prediction
cog predict -i image=@input.jpg -o out.png  # file inputs and outputs
cog serve -p 8393                           # HTTP server matching production
cog exec python                             # interactive shell inside the build env

Building

构建模型

cog build -t my-model
cog build --separate-weights -t my-model    # weights in their own image layer
cog build --secret id=hf,src=$HOME/.hf_token -t my-model
Tips:
  • Use
    --separate-weights
    for any model with weights > ~1GB. It speeds up cold boots and registry pushes.
  • Use
    --mount=type=cache,target=/root/.cache/pip
    in
    run:
    steps to cache pip across builds.
  • Use
    --secret
    instead of
    ARG
    to keep tokens out of image history.
  • The default Cog base image (
    --use-cog-base-image=true
    ) is faster than rolling your own.
cog build -t my-model
cog build --separate-weights -t my-model    # weights in their own image layer
cog build --secret id=hf,src=$HOME/.hf_token -t my-model
技巧:
  • 对于权重>约1GB的模型,使用
    --separate-weights
    参数。它能加快冷启动速度和镜像推送速度。
  • run:
    步骤中使用
    --mount=type=cache,target=/root/.cache/pip
    以在构建间缓存pip依赖。
  • 使用
    --secret
    而非
    ARG
    来避免令牌出现在镜像历史中。
  • 默认的Cog基础镜像(
    --use-cog-base-image=true
    )比自定义镜像更快。

Training

模型训练

If your model supports fine-tuning, add
train: train.py:train
to
cog.yaml
and write a
train()
function that returns
TrainingOutput(weights=Path("model.tar"))
. The predictor then accepts the URL via
setup(self, weights)
or the
COG_WEIGHTS
env var. See https://cog.run/training and
replicate/flux-fine-tuner
for a full example.
如果你的模型支持微调,在
cog.yaml
中添加
train: train.py:train
配置,并编写返回
TrainingOutput(weights=Path("model.tar"))
train()
函数。预测器随后可通过
setup(self, weights)
COG_WEIGHTS
环境变量接收权重URL。详情请查看https://cog.run/training
replicate/flux-fine-tuner
的完整示例。

Guidelines

实践指南

  • Keep
    setup()
    for one-time loads; keep
    predict()
    fast and deterministic in shape.
  • Pin Python and every dependency. Use
    numpy<2
    if your torch is older.
  • Always describe every input. Schemas without descriptions are unusable on the web UI.
  • Use
    cog.Path
    for files and
    cog.Secret
    for tokens.
  • Pin
    pget
    to a specific release (
    v0.8.2
    ) for reproducibility.
  • Set
    HF_HUB_ENABLE_HF_TRANSFER=1
    whenever you call HuggingFace Hub.
  • Set
    TRANSFORMERS_OFFLINE=1
    after weights are loaded to prevent runtime HF lookups.
  • Test with
    cog predict
    before pushing. If it doesn't work locally, it won't work in production.
  • 将一次性加载操作放在
    setup()
    中;保持
    predict()
    快速且输出格式确定。
  • 固定Python和所有依赖的版本。如果你的torch版本较旧,使用
    numpy<2
  • 务必为每个输入添加描述。没有描述的架构在网页UI上无法使用。
  • 使用
    cog.Path
    处理文件,使用
    cog.Secret
    处理令牌。
  • 固定
    pget
    的具体版本(如
    v0.8.2
    )以确保可复现性。
  • 调用HuggingFace Hub时始终设置
    HF_HUB_ENABLE_HF_TRANSFER=1
  • 权重加载完成后设置
    TRANSFORMERS_OFFLINE=1
    以避免运行时HF查询。
  • 推送前使用
    cog predict
    测试。如果本地无法运行,生产环境也无法运行。

Production references

生产环境参考示例