build-models

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Docs

文档

Cog reference (single file): https://cog.run/llms.txt
```
cog.yaml
```
reference: https://cog.run/yaml
Python predictor reference: https://cog.run/python
Examples: https://github.com/replicate/cog-examples
Template: https://github.com/replicate/cog-template

Cog 参考文档（单文件）：https://cog.run/llms.txt
```
cog.yaml
```
参考文档：https://cog.run/yaml
Python 预测器参考文档：https://cog.run/python
示例：https://github.com/replicate/cog-examples
模板：https://github.com/replicate/cog-template

When to use this skill

适用场景

You have model code, weights, or a HuggingFace/GitHub project you want to host on Replicate.
You're writing or editing a
```
cog.yaml
```
,
```
predict.py
```
, or
```
train.py
```
.
For pushing a built model to Replicate, see
```
publish-models
```
.
For running existing Replicate models, see
```
run-models
```
.

你拥有模型代码、权重，或是想要托管到Replicate的HuggingFace/GitHub项目。
你正在编写或编辑
```
cog.yaml
```
、
```
predict.py
```
或
```
train.py
```
文件。
如需将构建好的模型推送至Replicate，请查看
```
publish-models
```
技能。
如需运行现有Replicate模型，请查看
```
run-models
```
技能。

Prerequisites

前置条件

Docker running locally.

Cog installed:

brew install replicate/tap/cog

sh <(curl -fsSL https://cog.run/install.sh)

Optional:
```
cog init
```
to scaffold
```
cog.yaml
```
and
```
predict.py
```
.

本地运行Docker。

已安装Cog：

brew install replicate/tap/cog

或

sh <(curl -fsSL https://cog.run/install.sh)

。

可选：使用
```
cog init
```
快速生成
```
cog.yaml
```
和
```
predict.py
```
文件。

Project layout

项目结构

The canonical Replicate model layout:

cog.yaml
predict.py
weights.py                 # optional download helpers
requirements.txt
cog-safe-push-configs/
  default.yaml             # see publish-models skill
.github/workflows/
  ci.yaml
script/                    # github.com/github/scripts-to-rule-them-all
  lint
  test
  push

标准的Replicate模型结构：

cog.yaml
predict.py
weights.py                 # optional download helpers
requirements.txt
cog-safe-push-configs/
  default.yaml             # see publish-models skill
.github/workflows/
  ci.yaml
script/                    # github.com/github/scripts-to-rule-them-all
  lint
  test
  push

cog.yaml essentials

cog.yaml核心配置

A modern config for a GPU model:

yaml

build:
  gpu: true
  cuda: "12.8"
  python_version: "3.12"
  python_requirements: requirements.txt
  system_packages:
    - libgl1
    - libglib2.0-0
predict: predict.py:Predictor

Notes:

Pin Python to a specific minor version, and pin every line in
```
requirements.txt
```
. Floating versions break cold boots.
Use
```
python_requirements
```
over inline
```
python_packages
```
once the list grows.
```
cuda
```
follows your torch wheel (e.g.
```
12.8
```
paired with
```
torch==2.7.1+cu128
```
).
Add
```
train: train.py:train
```
if your model is fine-tunable.
Add
```
image: r8.im/owner/name
```
to enable bare
```
cog push
```
.

For async predictors with continuous batching:

yaml

concurrency:
  max: 32

适用于GPU模型的现代配置示例：

yaml

build:
  gpu: true
  cuda: "12.8"
  python_version: "3.12"
  python_requirements: requirements.txt
  system_packages:
    - libgl1
    - libglib2.0-0
predict: predict.py:Predictor

注意事项：

固定Python的具体小版本号，并固定
```
requirements.txt
```
中的每一项依赖版本。浮动版本会导致冷启动失败。
当依赖列表变长时，使用
```
python_requirements
```
替代内联的
```
python_packages
```
。
```
cuda
```
版本需与torch包匹配（例如
```
12.8
```
搭配
```
torch==2.7.1+cu128
```
）。
如果模型支持微调，添加
```
train: train.py:train
```
配置。
添加
```
image: r8.im/owner/name
```
配置以支持直接使用
```
cog push
```
命令。

支持连续批处理的异步预测器配置：

yaml

concurrency:
  max: 32

predict.py essentials

predict.py核心代码

python

from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self) -> None:
        """One-time loads. Heavy work goes here, not in predict()."""
        self.model = load_model("weights/")

    def predict(
        self,
        prompt: str = Input(description="Text prompt for generation"),
        seed: int = Input(description="Random seed; leave blank for random", default=None),
        num_steps: int = Input(description="Number of denoising steps", ge=1, le=50, default=20),
        output_format: str = Input(description="Output image format", choices=["webp", "jpg", "png"], default="webp"),
    ) -> Path:
        """Run a single prediction."""
        if not prompt.strip():
            raise ValueError("prompt cannot be empty")
        out = self.model.generate(prompt, seed=seed, steps=num_steps)
        return Path(out)

Input rules:

Every input needs a
```
description
```
. The description shows up in the model schema and on Replicate's web UI.
Use
```
ge
```
/
```
le
```
for numeric bounds,
```
choices=[...]
```
for enums,
```
regex=
```
for strings.
Use
```
cog.Path
```
for file inputs and outputs, never raw bytes.
Use
```
cog.Secret
```
for any token-like input (HF tokens, API keys), never plain
```
str
```
.
Provide a default that's inside
```
choices
```
for categorical inputs.
Validate inputs early in
```
predict()
```
and raise
```
ValueError
```
.

Streaming text output (for LLMs):

python

from cog import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self, prompt: str = Input(description="Prompt")) -> ConcatenateIterator[str]:
        for token in self.model.stream(prompt):
            yield token

Async predictor with continuous batching (paired with

concurrency.max

in cog.yaml):

python

from cog import BasePredictor, Input, AsyncConcatenateIterator

class Predictor(BasePredictor):
    async def setup(self) -> None:
        self.engine = await load_async_engine()

    async def predict(
        self,
        prompt: str = Input(description="Prompt"),
    ) -> AsyncConcatenateIterator[str]:
        async for token in self.engine.generate(prompt):
            yield token

Dynamic

choices

from on-disk assets (e.g. a

voices/

directory of audio samples):

python

from pathlib import Path as _P
AVAILABLE_VOICES = sorted(p.stem for p in _P("voices").glob("*.wav"))

class Predictor(BasePredictor):
    def predict(
        self,
        speaker: str = Input(description="Voice", choices=AVAILABLE_VOICES, default=AVAILABLE_VOICES[0]),
    ) -> Path: ...

python

from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self) -> None:
        """One-time loads. Heavy work goes here, not in predict()."""
        self.model = load_model("weights/")

    def predict(
        self,
        prompt: str = Input(description="Text prompt for generation"),
        seed: int = Input(description="Random seed; leave blank for random", default=None),
        num_steps: int = Input(description="Number of denoising steps", ge=1, le=50, default=20),
        output_format: str = Input(description="Output image format", choices=["webp", "jpg", "png"], default="webp"),
    ) -> Path:
        """Run a single prediction."""
        if not prompt.strip():
            raise ValueError("prompt cannot be empty")
        out = self.model.generate(prompt, seed=seed, steps=num_steps)
        return Path(out)

输入规则：

每个输入都需要添加
```
description
```
。该描述会显示在模型架构中以及Replicate的网页UI上。
对数值类型使用
```
ge
```
/
```
le
```
设置范围，对枚举类型使用
```
choices=[...]
```
，对字符串类型使用
```
regex=
```
。
使用
```
cog.Path
```
处理文件输入和输出，不要使用原始字节类型。
使用
```
cog.Secret
```
处理令牌类输入（如HF令牌、API密钥），不要使用普通
```
str
```
类型。
为分类输入提供一个在
```
choices
```
范围内的默认值。
在
```
predict()
```
方法早期验证输入，并抛出
```
ValueError
```
异常。

流式文本输出（适用于LLM）：

python

from cog import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self, prompt: str = Input(description="Prompt")) -> ConcatenateIterator[str]:
        for token in self.model.stream(prompt):
            yield token

支持连续批处理的异步预测器（需搭配cog.yaml中的

concurrency.max

配置）：

python

from cog import BasePredictor, Input, AsyncConcatenateIterator

class Predictor(BasePredictor):
    async def setup(self) -> None:
        self.engine = await load_async_engine()

    async def predict(
        self,
        prompt: str = Input(description="Prompt"),
    ) -> AsyncConcatenateIterator[str]:
        async for token in self.engine.generate(prompt):
            yield token

从磁盘资源动态生成

choices

（例如

voices/

目录下的音频样本）：

python

from pathlib import Path as _P
AVAILABLE_VOICES = sorted(p.stem for p in _P("voices").glob("*.wav"))

class Predictor(BasePredictor):
    def predict(
        self,
        speaker: str = Input(description="Voice", choices=AVAILABLE_VOICES, default=AVAILABLE_VOICES[0]),
    ) -> Path: ...

Loading weights fast

快速加载权重

Cold boot dominates user-perceived latency. Three patterns, ranked by simplicity:

冷启动时间是影响用户感知延迟的关键因素。以下三种方案按简易程度排序：

1. Bake weights into the image at build time

1. 在构建阶段将权重打包到镜像中

Best for small or medium weights (< 5GB) that you want zero-cold-boot for.

For torchvision:

python

import os
os.environ["TORCH_HOME"] = "."  # set before importing torch
import torch
from torchvision import models

For HuggingFace:

python

import os
os.environ["HF_HUB_CACHE"] = "./.cache"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"

Then download once during

cog build

(e.g. in a

run:

step or by running a small fetcher script as part of the build). The weights become part of the image layer.

适用于小型或中型权重（<5GB），可实现零冷启动。

针对torchvision的配置：

python

import os
os.environ["TORCH_HOME"] = "."  # set before importing torch
import torch
from torchvision import models

针对HuggingFace的配置：

python

import os
os.environ["HF_HUB_CACHE"] = "./.cache"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"

随后在

cog build

过程中下载一次权重（例如在

run:

步骤中执行，或作为构建过程的一部分运行小型下载脚本）。权重将成为镜像层的一部分。

2. Pull from

weights.replicate.delivery

with pget

2. 使用pget从

weights.replicate.delivery

拉取权重

Best for large weights, or when you want to share weights across multiple models.

pget

is Replicate's parallel HTTP fetcher.

cog.yaml

yaml

build:
  run:
    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.8.2/pget_linux_x86_64"
    - chmod +x /usr/local/bin/pget

setup()

python

import subprocess
from pathlib import Path

WEIGHTS_URL = "https://weights.replicate.delivery/default/my-model/weights.tar"
WEIGHTS_DIR = Path("weights")

class Predictor(BasePredictor):
    def setup(self) -> None:
        if not WEIGHTS_DIR.exists():
            # -x extracts tar in-memory; default concurrency is 4 * NumCPU
            subprocess.check_call(["pget", "-x", WEIGHTS_URL, str(WEIGHTS_DIR)])
        self.model = load_from(WEIGHTS_DIR)

For multiple files in one shot:

python

manifest = "\n".join([
    f"{base}/unet.safetensors weights/unet.safetensors",
    f"{base}/vae.safetensors  weights/vae.safetensors",
    f"{base}/text_encoder.safetensors weights/text_encoder.safetensors",
])
subprocess.run(["pget", "multifile", "-"], input=manifest, text=True, check=True)

适用于大型权重，或需要在多个模型间共享权重的场景。

pget

是Replicate的并行HTTP下载工具。

在

cog.yaml

中配置：

yaml

build:
  run:
    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.8.2/pget_linux_x86_64"
    - chmod +x /usr/local/bin/pget

在

setup()

方法中使用：

python

import subprocess
from pathlib import Path

WEIGHTS_URL = "https://weights.replicate.delivery/default/my-model/weights.tar"
WEIGHTS_DIR = Path("weights")

class Predictor(BasePredictor):
    def setup(self) -> None:
        if not WEIGHTS_DIR.exists():
            # -x extracts tar in-memory; default concurrency is 4 * NumCPU
            subprocess.check_call(["pget", "-x", WEIGHTS_URL, str(WEIGHTS_DIR)])
        self.model = load_from(WEIGHTS_DIR)

一次性下载多个文件的示例：

python

manifest = "\n".join([
    f"{base}/unet.safetensors weights/unet.safetensors",
    f"{base}/vae.safetensors  weights/vae.safetensors",
    f"{base}/text_encoder.safetensors weights/text_encoder.safetensors",
])
subprocess.run(["pget", "multifile", "-"], input=manifest, text=True, check=True)

3. HuggingFace Hub with hf_transfer

3. 使用hf_transfer从HuggingFace Hub下载

Set

HF_HUB_ENABLE_HF_TRANSFER=1

and use

huggingface_hub.snapshot_download

from_pretrained

. Faster than vanilla HF downloads. Use a

cog.Secret

input for gated models.

设置

HF_HUB_ENABLE_HF_TRANSFER=1

，并使用

huggingface_hub.snapshot_download

或

from_pretrained

方法。比原生HF下载更快。对于 gated 模型，使用

cog.Secret

输入令牌。

Weight cache for user-supplied weights

用户自定义权重缓存

For LoRAs or any weights URL the user passes at predict time, use a sha256-keyed disk cache with LRU eviction:

python

import hashlib, shutil, subprocess
from pathlib import Path

class WeightsDownloadCache:
    def __init__(self, cache_dir: str = "/tmp/weights-cache", min_disk_free_gb: int = 10):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.min_disk_free = min_disk_free_gb * 1024**3

    def ensure(self, url: str) -> Path:
        key = hashlib.sha256(url.encode()).hexdigest()
        target = self.cache_dir / key
        if target.exists():
            target.touch()  # bump LRU mtime
            return target
        self._evict_until_room()
        subprocess.check_call(["pget", url, str(target)])
        return target

    def _evict_until_room(self) -> None:
        while shutil.disk_usage(self.cache_dir).free < self.min_disk_free:
            entries = sorted(self.cache_dir.iterdir(), key=lambda p: p.stat().st_mtime)
            if not entries:
                return
            entries[0].unlink()

See

replicate/cog-flux/weights.py

for a production version that handles HF, CivitAI, Replicate, and arbitrary

.safetensors

URLs.

针对LoRA或用户在预测时传入的任意权重URL，使用基于sha256密钥的磁盘缓存并启用LRU淘汰机制：

python

import hashlib, shutil, subprocess
from pathlib import Path

class WeightsDownloadCache:
    def __init__(self, cache_dir: str = "/tmp/weights-cache", min_disk_free_gb: int = 10):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.min_disk_free = min_disk_free_gb * 1024**3

    def ensure(self, url: str) -> Path:
        key = hashlib.sha256(url.encode()).hexdigest()
        target = self.cache_dir / key
        if target.exists():
            target.touch()  # bump LRU mtime
            return target
        self._evict_until_room()
        subprocess.check_call(["pget", url, str(target)])
        return target

    def _evict_until_room(self) -> None:
        while shutil.disk_usage(self.cache_dir).free < self.min_disk_free:
            entries = sorted(self.cache_dir.iterdir(), key=lambda p: p.stat().st_mtime)
            if not entries:
                return
            entries[0].unlink()

可参考

replicate/cog-flux/weights.py

中的生产级实现，它支持HF、CivitAI、Replicate以及任意

.safetensors

URL。

Multi-LoRA composition

多LoRA组合

Reload only when the URL changes; compose two LoRAs with separate scales:

python

class Predictor(BasePredictor):
    def setup(self) -> None:
        self.pipe = load_base_pipeline()
        self.loaded = {"main": None, "extra": None}

    def _ensure_lora(self, slot: str, url: str | None) -> None:
        if url == self.loaded[slot]:
            return
        if self.loaded[slot] is not None:
            self.pipe.unload_lora_weights(adapter_name=slot)
        if url:
            path = self.cache.ensure(url)
            self.pipe.load_lora_weights(str(path), adapter_name=slot)
        self.loaded[slot] = url

    def predict(
        self,
        prompt: str = Input(description="Prompt"),
        lora_url: str = Input(description="Primary LoRA URL", default=None),
        lora_scale: float = Input(description="Primary LoRA scale", ge=0.0, le=2.0, default=1.0),
        extra_lora_url: str = Input(description="Optional second LoRA URL", default=None),
        extra_lora_scale: float = Input(description="Second LoRA scale", ge=0.0, le=2.0, default=1.0),
    ) -> Path:
        self._ensure_lora("main", lora_url)
        self._ensure_lora("extra", extra_lora_url)
        adapters = [s for s, u in self.loaded.items() if u]
        scales = [lora_scale if s == "main" else extra_lora_scale for s in adapters]
        if adapters:
            self.pipe.set_adapters(adapters, adapter_weights=scales)
        return Path(self.pipe(prompt).images[0].save("/tmp/out.png"))

仅在URL变化时重新加载；使用不同的权重比例组合两个LoRA：

python

class Predictor(BasePredictor):
    def setup(self) -> None:
        self.pipe = load_base_pipeline()
        self.loaded = {"main": None, "extra": None}

    def _ensure_lora(self, slot: str, url: str | None) -> None:
        if url == self.loaded[slot]:
            return
        if self.loaded[slot] is not None:
            self.pipe.unload_lora_weights(adapter_name=slot)
        if url:
            path = self.cache.ensure(url)
            self.pipe.load_lora_weights(str(path), adapter_name=slot)
        self.loaded[slot] = url

    def predict(
        self,
        prompt: str = Input(description="Prompt"),
        lora_url: str = Input(description="Primary LoRA URL", default=None),
        lora_scale: float = Input(description="Primary LoRA scale", ge=0.0, le=2.0, default=1.0),
        extra_lora_url: str = Input(description="Optional second LoRA URL", default=None),
        extra_lora_scale: float = Input(description="Second LoRA scale", ge=0.0, le=2.0, default=1.0),
    ) -> Path:
        self._ensure_lora("main", lora_url)
        self._ensure_lora("extra", extra_lora_url)
        adapters = [s for s, u in self.loaded.items() if u]
        scales = [lora_scale if s == "main" else extra_lora_scale for s in adapters]
        if adapters:
            self.pipe.set_adapters(adapters, adapter_weights=scales)
        return Path(self.pipe(prompt).images[0].save("/tmp/out.png"))

Cold-boot tricks

冷启动优化技巧

From production diffusion models like

replicate/cog-flux

and

replicate/cog-flux-kontext

Set perf flags once in

setup()

python

import torch
torch.set_float32_matmul_precision("high")
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True

Compile and warm up:

python

self.model = torch.compile(self.model, dynamic=True)
_ = self.predict(prompt="warmup", num_steps=1)  # absorbs compile cost in setup

Load big weights with meta device +

assign=True

to avoid double-allocating:

python

with torch.device("meta"):
    model = build_model_skeleton()
state = torch.load("weights.pt", map_location="cpu")
model.load_state_dict(state, assign=True)

Share VAE / text encoder across multiple pipelines (e.g. base + img2img + inpaint) instead of loading three copies.
For fp8/int8, save quantized weights ahead of time and load directly; don't quantize at boot.

参考生产级扩散模型如

replicate/cog-flux

和

replicate/cog-flux-kontext

的实现：

在

setup()

中设置性能标志：

python

import torch
torch.set_float32_matmul_precision("high")
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True

编译并预热模型：

python

self.model = torch.compile(self.model, dynamic=True)
_ = self.predict(prompt="warmup", num_steps=1)  # absorbs compile cost in setup

使用meta设备加载大权重并设置

assign=True

以避免重复分配内存：

python

with torch.device("meta"):
    model = build_model_skeleton()
state = torch.load("weights.pt", map_location="cpu")
model.load_state_dict(state, assign=True)

在多个流水线间共享VAE/文本编码器（例如基础模型+图生图+图修复），而不是加载三份副本。
对于fp8/int8量化模型，提前保存量化后的权重并直接加载；不要在启动时进行量化。

Local development

本地开发

cog init                                    # scaffold cog.yaml + predict.py
cog predict -i prompt="hello"               # build + run a single prediction
cog predict -i image=@input.jpg -o out.png  # file inputs and outputs
cog serve -p 8393                           # HTTP server matching production
cog exec python                             # interactive shell inside the build env

cog init                                    # scaffold cog.yaml + predict.py
cog predict -i prompt="hello"               # build + run a single prediction
cog predict -i image=@input.jpg -o out.png  # file inputs and outputs
cog serve -p 8393                           # HTTP server matching production
cog exec python                             # interactive shell inside the build env

Building

构建模型

cog build -t my-model
cog build --separate-weights -t my-model    # weights in their own image layer
cog build --secret id=hf,src=$HOME/.hf_token -t my-model

Tips:

Use
```
--separate-weights
```
for any model with weights > ~1GB. It speeds up cold boots and registry pushes.

Use

--mount=type=cache,target=/root/.cache/pip

run:

steps to cache pip across builds.

Use
```
--secret
```
instead of
```
ARG
```
to keep tokens out of image history.
The default Cog base image (
```
--use-cog-base-image=true
```
) is faster than rolling your own.

cog build -t my-model
cog build --separate-weights -t my-model    # weights in their own image layer
cog build --secret id=hf,src=$HOME/.hf_token -t my-model

技巧：

对于权重>约1GB的模型，使用
```
--separate-weights
```
参数。它能加快冷启动速度和镜像推送速度。
在
```
run:
```
步骤中使用
```
--mount=type=cache,target=/root/.cache/pip
```
以在构建间缓存pip依赖。
使用
```
--secret
```
而非
```
ARG
```
来避免令牌出现在镜像历史中。
默认的Cog基础镜像（
```
--use-cog-base-image=true
```
）比自定义镜像更快。

Training

模型训练

If your model supports fine-tuning, add

train: train.py:train

cog.yaml

and write a

train()

function that returns

TrainingOutput(weights=Path("model.tar"))

. The predictor then accepts the URL via

setup(self, weights)

or the

COG_WEIGHTS

env var. See https://cog.run/training and

replicate/flux-fine-tuner

for a full example.

如果你的模型支持微调，在

cog.yaml

中添加

train: train.py:train

配置，并编写返回

TrainingOutput(weights=Path("model.tar"))

的

train()

函数。预测器随后可通过

setup(self, weights)

或

COG_WEIGHTS

环境变量接收权重URL。详情请查看https://cog.run/training和

replicate/flux-fine-tuner

的完整示例。

Guidelines

实践指南

Keep
```
setup()
```
for one-time loads; keep
```
predict()
```
fast and deterministic in shape.
Pin Python and every dependency. Use
```
numpy<2
```
if your torch is older.
Always describe every input. Schemas without descriptions are unusable on the web UI.
Use
```
cog.Path
```
for files and
```
cog.Secret
```
for tokens.
Pin
```
pget
```
to a specific release (
```
v0.8.2
```
) for reproducibility.
Set
```
HF_HUB_ENABLE_HF_TRANSFER=1
```
whenever you call HuggingFace Hub.
Set
```
TRANSFORMERS_OFFLINE=1
```
after weights are loaded to prevent runtime HF lookups.
Test with
```
cog predict
```
before pushing. If it doesn't work locally, it won't work in production.

将一次性加载操作放在
```
setup()
```
中；保持
```
predict()
```
快速且输出格式确定。
固定Python和所有依赖的版本。如果你的torch版本较旧，使用
```
numpy<2
```
。
务必为每个输入添加描述。没有描述的架构在网页UI上无法使用。
使用
```
cog.Path
```
处理文件，使用
```
cog.Secret
```
处理令牌。
固定
```
pget
```
的具体版本（如
```
v0.8.2
```
）以确保可复现性。
调用HuggingFace Hub时始终设置
```
HF_HUB_ENABLE_HF_TRANSFER=1
```
。
权重加载完成后设置
```
TRANSFORMERS_OFFLINE=1
```
以避免运行时HF查询。
推送前使用
```
cog predict
```
测试。如果本地无法运行，生产环境也无法运行。

Production references

生产环境参考示例

https://github.com/replicate/cog-examples — minimal patterns (resnet, hello-world, streaming, training)
https://github.com/replicate/cog-template — scaffolder for new model repos
https://github.com/replicate/cog-flux — multi-variant FLUX models, weights cache, fp8 + torch.compile
https://github.com/replicate/cog-flux-kontext — meta-device loading, warmup compilation
https://github.com/replicate/cog-vllm — async LLM server with continuous batching, training-as-packaging
https://github.com/replicate/cog-comfyui — ComfyUI workflows as a Cog model, custom-node helpers
https://github.com/replicate/flux-fine-tuner — multi-LoRA composition, shared pipeline components
https://github.com/replicate/vibevoice — TTS with dynamic
```
choices
```
, minimal cog.yaml
https://github.com/replicate/pget — parallel weights fetcher

https://github.com/replicate/cog-examples —— 极简示例（resnet、hello-world、流式输出、训练）
https://github.com/replicate/cog-template —— 新模型仓库的脚手架工具
https://github.com/replicate/cog-flux —— 多变体FLUX模型、权重缓存、fp8 + torch.compile
https://github.com/replicate/cog-flux-kontext —— meta设备加载、预热编译
https://github.com/replicate/cog-vllm —— 支持连续批处理的异步LLM服务器、训练即打包
https://github.com/replicate/cog-comfyui —— 将ComfyUI工作流作为Cog模型、自定义节点工具
https://github.com/replicate/flux-fine-tuner —— 多LoRA组合、共享流水线组件
https://github.com/replicate/vibevoice —— 带动态
```
choices
```
的TTS、极简cog.yaml
https://github.com/replicate/pget —— 并行权重下载工具

build-models

Original

Translation

Docs

文档

When to use this skill

适用场景

Prerequisites

前置条件

Project layout

项目结构

cog.yaml essentials

cog.yaml核心配置

predict.py essentials

predict.py核心代码

Loading weights fast

快速加载权重

1. Bake weights into the image at build time

1. 在构建阶段将权重打包到镜像中

2. Pull from
`weights.replicate.delivery`
with pget

2. 使用pget从
`weights.replicate.delivery`
拉取权重

3. HuggingFace Hub with hf_transfer

3. 使用hf_transfer从HuggingFace Hub下载

Weight cache for user-supplied weights

用户自定义权重缓存

Multi-LoRA composition

多LoRA组合

Cold-boot tricks

冷启动优化技巧

Local development

本地开发

Building

构建模型

Training

模型训练

Guidelines

实践指南

Production references

生产环境参考示例

build-models

Original

Translation

Docs

文档

When to use this skill

适用场景

Prerequisites

前置条件

Project layout

项目结构

cog.yaml essentials

cog.yaml核心配置

predict.py essentials

predict.py核心代码

Loading weights fast

快速加载权重

1. Bake weights into the image at build time

1. 在构建阶段将权重打包到镜像中

2. Pull from weights.replicate.delivery with pget

2. 使用pget从weights.replicate.delivery拉取权重

3. HuggingFace Hub with hf_transfer

3. 使用hf_transfer从HuggingFace Hub下载

Weight cache for user-supplied weights

用户自定义权重缓存

Multi-LoRA composition

多LoRA组合

Cold-boot tricks

冷启动优化技巧

Local development

本地开发

Building

构建模型

Training

模型训练

Guidelines

实践指南

Production references

生产环境参考示例

2. Pull from
`weights.replicate.delivery`
with pget

2. 使用pget从
`weights.replicate.delivery`
拉取权重