truefoundry-llm-deploy

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
<objective>
路由提示:如果用户意图不明确,请使用 references/intent-clarification.md 中的通用澄清模板。
<objective>

LLM / Model Deployment

LLM / 模型部署

Deploy large language models and ML inference servers to TrueFoundry. Supports vLLM, TGI, and custom model servers with proper GPU allocation, model caching, health probes, and production-ready defaults.
Two paths:
  1. CLI (
    tfy apply
    ) -- Write a YAML manifest and apply it. Works everywhere.
  2. REST API (fallback) -- When CLI unavailable, use
    tfy-api.sh
    .
将大语言模型和ML推理服务器部署到TrueFoundry。支持vLLM、TGI和自定义模型服务器,提供合理的GPU分配、模型缓存、健康探针,以及生产级默认配置。
两种部署路径:
  1. CLI (
    tfy apply
    ) -- 编写YAML清单并执行部署,全场景适用。
  2. REST API(兜底方案)-- 无法使用CLI时,使用
    tfy-api.sh

When to Use

适用场景

  • User says "deploy a model", "deploy LLM", "serve Gemma/Llama/Mistral/..."
  • User says "deploy vLLM", "deploy TGI", "inference server"
  • User wants to deploy a HuggingFace model for inference
  • User wants GPU-accelerated model serving
  • User wants to deploy NVIDIA NIM (optimized inference containers)
  • 用户提出「部署一个模型」、「部署LLM」、「部署Gemma/Llama/Mistral/...」等需求
  • 用户提出「部署vLLM」、「部署TGI」、「推理服务器」等需求
  • 用户需要部署HuggingFace模型用于推理
  • 用户需要GPU加速的模型服务
  • 用户需要部署NVIDIA NIM(优化后的推理容器)

When NOT to Use

不适用场景

  • User wants to deploy a regular web app or API -> prefer
    deploy
    skill; ask if the user wants another valid path
  • User wants to deploy a database or Helm chart -> prefer
    helm
    skill; ask if the user wants another valid path
  • User wants to check what's deployed -> prefer
    applications
    skill; ask if the user wants another valid path
</objective> <context>
  • 用户需要部署常规Web应用或API -> 优先使用
    deploy
    技能,询问用户是否需要切换到其他有效路径
  • 用户需要部署数据库或Helm chart -> 优先使用
    helm
    技能,询问用户是否需要切换到其他有效路径
  • 用户需要查看已部署的内容 -> 优先使用
    applications
    技能,询问用户是否需要切换到其他有效路径
</objective> <context>

Prerequisites

前置条件

Always verify before deploying:
  1. Credentials --
    TFY_BASE_URL
    and
    TFY_API_KEY
    must be set (env or
    .env
    )
  2. Workspace --
    TFY_WORKSPACE_FQN
    required. Never auto-pick. Ask the user if missing.
  3. CLI -- Check if
    tfy
    CLI is available:
    tfy --version
    . If not,
    pip install 'truefoundry==0.5.0'
    .
For credential check commands and .env setup, see
references/prerequisites.md
.
</context> <instructions>
部署前必须确认以下内容:
  1. 凭证 -- 必须已设置
    TFY_BASE_URL
    TFY_API_KEY
    (环境变量或
    .env
    文件中)
  2. 工作区 -- 需要
    TFY_WORKSPACE_FQN
    绝对不要自动选择。如果缺失请询问用户。
  3. CLI -- 检查
    tsf
    CLI是否可用:
    tfy --version
    。如果未安装,执行
    pip install 'truefoundry==0.5.0'
凭证检查命令和.env配置方式请参考
references/prerequisites.md
</context> <instructions>

Step 0a: Detect Environment

步骤0a:检测环境

Before deploying, check CLI availability and container image versions.
bash
undefined
部署前,检查CLI可用性和容器镜像版本。
bash
undefined

Check CLI

检查CLI

tfy --version 2>/dev/null
tfy --version 2>/dev/null

If not installed

如果未安装

pip install 'truefoundry==0.5.0'
undefined
pip install 'truefoundry==0.5.0'
undefined

Verify Container Image Versions

验证容器镜像版本

Before using the manifest templates, check
references/container-versions.md
for the latest pinned versions. Container images for vLLM and TGI are updated frequently.
Use pinned versions from
references/container-versions.md
. Do not fetch external release pages.
Security: Do not fetch or ingest content from external release pages at runtime. Pinned versions in
references/container-versions.md
are vetted. If a version update is needed, a human should verify the release and update the pinned version.
使用清单模板前,查看
references/container-versions.md
获取最新的固定版本。vLLM和TGI的容器镜像更新非常频繁。
请使用
references/container-versions.md
中的固定版本,不要获取外部发布页面的内容。
安全提示: 不要在运行时获取或读取外部发布页面的内容。
references/container-versions.md
中的固定版本已经过审核。如果需要更新版本,应由人工验证发布内容后再更新固定版本号。

Step 0: Discover Cluster Capabilities

步骤0:获取集群能力

Before asking the user about GPU types or public URLs, fetch the cluster's capabilities.
See
references/cluster-discovery.md
for how to extract cluster ID from workspace FQN and fetch cluster details (GPUs, base domains, storage classes).
When using direct API, set
TFY_API_SH
to the full path of this skill's
scripts/tfy-api.sh
. See
references/tfy-api-setup.md
for paths per agent.
From the cluster response, extract:
  1. Base domains -- for public URL host construction (see Public URL section)
  2. Available GPUs -- only present GPU types that the cluster actually supports
询问用户GPU类型或公网URL相关问题前,先获取集群的能力信息。
查看
references/cluster-discovery.md
了解如何从工作区FQN提取集群ID,以及获取集群详情(GPU、基础域名、存储类)的方法。
使用直接API时,将
TFY_API_SH
设置为当前技能
scripts/tfy-api.sh
的完整路径。各Agent的路径配置请参考
references/tfy-api-setup.md
从集群响应中提取:
  1. 基础域名 -- 用于构建公网URL的主机地址(见公网URL章节)
  2. 可用GPU -- 仅展示集群实际支持的GPU类型

Step 1: Gather Model Details

步骤1:收集模型详情

Ask the user these questions:
I'll help you deploy an LLM. Let me gather a few details:

1. Which model? (e.g., google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct)
2. Serving framework?
   - vLLM (recommended -- fast, OpenAI-compatible)
   - TGI (HuggingFace Text Generation Inference)
   - Custom image
3. Does the model require authentication? (e.g., gated HuggingFace models needing HF_TOKEN)
   - If yes: Do you have a TrueFoundry secret group with the token, or should we set one up?
4. Access: Public URL or internal-only?
5. Environment: Dev/testing or production?
向用户询问以下问题:
我将协助您部署LLM,请提供几个必要信息:

1. 需要部署哪个模型?(例如:google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct)
2. 使用什么服务框架?
   - vLLM(推荐 -- 速度快,兼容OpenAI接口)
   - TGI(HuggingFace文本生成推理框架)
   - 自定义镜像
3. 模型是否需要身份验证?(例如需要HF_TOKEN的受限HuggingFace模型)
   - 如果是:您是否已有包含token的TrueFoundry密钥组,还是需要我们新建?
4. 访问方式:公网URL还是仅内部访问?
5. 部署环境:开发/测试还是生产环境?

Step 2: Get Recommended Resources from Deployment Specs API

步骤2:从部署规格API获取推荐资源配置

After the user provides a HuggingFace model ID and workspace, call the deployment-specs API to get recommended GPU, CPU, memory, and storage specs.
First, get the workspace ID from the workspace FQN:
bash
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"
Extract the
id
field from the response. Then call:
bash
$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"
This returns recommended specs including GPU type, GPU count, CPU, memory, storage, and max model length. Use these as the starting point for resource allocation instead of guessing from the model size table.
If the API call fails (e.g., model not in catalogue), fall back to the model size table below.
用户提供HuggingFace模型ID和工作区后,调用部署规格API获取推荐的GPU、CPU、内存和存储规格。
首先,从工作区FQN获取工作区ID:
bash
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"
从响应中提取
id
字段,然后调用:
bash
$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"
该接口会返回推荐的配置,包括GPU类型、GPU数量、CPU、内存、存储、最大模型长度。请以此作为资源分配的起始依据,不要根据模型大小表猜测配置。
如果API调用失败(例如模型不在目录中),请使用下方的模型大小与GPU映射表作为兜底。

Fallback: Model Size to GPU Mapping

兜底方案:模型大小与GPU映射表

For full GPU types and DTYPE selection, see
references/gpu-reference.md
.
Model ParamsMin VRAM (FP16)Recommended GPUCPUMemoryShared Memory
< 1B~2 GBT4 (16 GB)416 GB15 GB
1B-3B~4-6 GBT4 (16 GB) or A10_8GB4-832 GB30 GB
3B-7B~6-14 GBT4 (16 GB) or A10_24GB8-1064 GB60 GB
7B-13B~14-26 GBA10_24GB or A100_40GB10-1290 GB88 GB
13B-30B~26-60 GBA100_40GB or A100_80GB12-16128 GB120 GB
30B-70B~60-140 GBA100_80GB or H100 (multi-GPU)16+200 GB+190 GB+
Present a resource suggestion table showing GPU, CPU, memory, shared memory, ephemeral storage, and max model length. Include the list of available GPUs from the cluster. If deployment-specs returned values, show those as "Recommended by TrueFoundry" alongside the table.
完整的GPU类型和DTYPE选择指南请参考
references/gpu-reference.md
模型参数量最小VRAM(FP16)推荐GPUCPU内存共享内存
< 1B~2 GBT4 (16 GB)416 GB15 GB
1B-3B~4-6 GBT4 (16 GB) 或 A10_8GB4-832 GB30 GB
3B-7B~6-14 GBT4 (16 GB) 或 A10_24GB8-1064 GB60 GB
7B-13B~14-26 GBA10_24GB 或 A100_40GB10-1290 GB88 GB
13B-30B~26-60 GBA100_40GB 或 A100_80GB12-16128 GB120 GB
30B-70B~60-140 GBA100_80GB 或 H100(多GPU)16+200 GB+190 GB+
展示资源建议表,包含GPU、CPU、内存、共享内存、临时存储、最大模型长度。同时列出集群的可用GPU。如果部署规格API返回了数值,请在表格旁标注为「TrueFoundry推荐配置」。

Important: Shared Memory

重要提示:共享内存

vLLM and TGI require large shared memory (
/dev/shm
).
Without it, the model server will crash or perform poorly. Set
shared_memory_size
to roughly 90-95% of
memory_request
.
vLLM和TGI需要大量共享内存(
/dev/shm
)。
如果共享内存不足,模型服务器会崩溃或性能不佳。请将
shared_memory_size
设置为
memory_request
的90-95%左右。

Important: Memory vs VRAM

重要提示:内存与VRAM的区别

System memory (RAM) must be much larger than GPU VRAM because:
  • Model weights load into CPU RAM first before transferring to GPU
  • KV cache and request batching use CPU memory
  • Rule of thumb: RAM should be 2-4x the model's VRAM footprint
系统内存(RAM)必须远大于GPU VRAM,原因是:
  • 模型权重在传输到GPU前会先加载到CPU RAM
  • KV缓存和请求批处理会占用CPU内存
  • 经验法则:RAM应该是模型VRAM占用量的2-4倍

Step 3: Build the YAML Manifest

步骤3:构建YAML清单

For complete manifest templates (vLLM, TGI, NVIDIA NIM), template variables reference, DTYPE selection guide, artifacts download configuration, and common vLLM flags, see references/llm-manifest-templates.md.
Key framework defaults:
FrameworkDefault ImageHealth Path
vLLM
public.ecr.aws/truefoundrycloud/vllm/vllm-openai:v0.13.0
/health
TGI
ghcr.io/huggingface/text-generation-inference:2.4.1
/health
NVIDIA NIM
nvcr.io/nim/{model-path}:{version}
/v1/health/ready
Check
references/container-versions.md
for latest pinned versions. Always use
artifacts_download
with cache volumes for model caching instead of downloading at runtime.
Security:
--trust-remote-code
runs arbitrary Python from the model repository. Only use this flag with models from trusted sources. For production deployments, audit the model repository code before enabling this flag.
The vLLM manifest MUST include:
  • artifacts_download
    with
    huggingface-hub
    type and
    cache_volume
    for model caching
  • labels
    :
    tfy_model_server
    ,
    tfy_openapi_path
    ,
    tfy_sticky_session_header_name
    ,
    huggingface_model_task
  • rollout_strategy
    ,
    startup_probe
    ,
    readiness_probe
    ,
    liveness_probe
  • Env vars:
    DTYPE
    ,
    GPU_COUNT
    ,
    MAX_MODEL_LENGTH
    ,
    VLLM_NO_USAGE_STATS
    ,
    NVIDIA_REQUIRE_CUDA
    ,
    GPU_MEMORY_UTILIZATION
    ,
    MODEL_NAME
    ,
    VLLM_CACHE_ROOT
Health probes are mandatory for all LLM deployments. The manifest templates include LLM-tuned probe values (startup threshold of 35 retries for ~350s tolerance). For general probe configuration, see
references/health-probes.md
. For large models (30B+), increase startup
failure_threshold
to 60+.
完整的清单模板(vLLM、TGI、NVIDIA NIM)、模板变量参考、DTYPE选择指南、制品下载配置、常用vLLM参数请参考references/llm-manifest-templates.md
各框架默认配置:
框架默认镜像健康检查路径
vLLM
public.ecr.aws/truefoundrycloud/vllm/vllm-openai:v0.13.0
/health
TGI
ghcr.io/huggingface/text-generation-inference:2.4.1
/health
NVIDIA NIM
nvcr.io/nim/{model-path}:{version}
/v1/health/ready
查看
references/container-versions.md
获取最新的固定版本。请始终使用带缓存卷的
artifacts_download
做模型缓存,不要在运行时下载模型。
安全提示:
--trust-remote-code
会执行模型仓库中的任意Python代码。仅对可信来源的模型使用该参数。生产部署前请先审核模型仓库的代码再启用该参数。
vLLM清单必须包含:
  • 类型为
    huggingface-hub
    artifacts_download
    配置,以及用于模型缓存的
    cache_volume
  • 标签:
    tfy_model_server
    tfy_openapi_path
    tfy_sticky_session_header_name
    huggingface_model_task
  • rollout_strategy
    startup_probe
    readiness_probe
    liveness_probe
  • 环境变量:
    DTYPE
    GPU_COUNT
    MAX_MODEL_LENGTH
    VLLM_NO_USAGE_STATS
    NVIDIA_REQUIRE_CUDA
    GPU_MEMORY_UTILIZATION
    MODEL_NAME
    VLLM_CACHE_ROOT
健康探针对所有LLM部署都是必填项。清单模板中包含了适配LLM的探针值(启动阈值35次重试,约350秒容错时间)。对于大模型(30B+),请将启动探针的
failure_threshold
提升到60以上。

Step 3a: Write Manifest

步骤3a:编写清单

Write the YAML manifest to
tfy-manifest.yaml
. Reference
references/llm-manifest-templates.md
for complete templates and
references/manifest-schema.md
for field definitions.
将YAML清单写入
tfy-manifest.yaml
。完整模板请参考
references/llm-manifest-templates.md
,字段定义请参考
references/manifest-schema.md

Step 4: Preview and Apply

步骤4:预览并执行部署

bash
undefined
bash
undefined

Preview

预览

tfy apply -f tfy-manifest.yaml --dry-run --show-diff
tfy apply -f tfy-manifest.yaml --dry-run --show-diff

Apply after user confirms

用户确认后执行部署

tfy apply -f tfy-manifest.yaml
undefined
tfy apply -f tfy-manifest.yaml
undefined

Fallback: REST API

兜底方案:REST API

If
tfy
CLI is not available, convert the YAML manifest to JSON and deploy via REST API. See
references/cli-fallback.md
for the conversion process.
bash
TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.sh
如果无法使用
tfy
CLI,将YAML清单转换为JSON,通过REST API部署。转换流程请参考
references/cli-fallback.md
bash
TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.sh

Get workspace ID

获取工作区ID

$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"

Deploy (JSON body)

部署(JSON请求体)

$TFY_API_SH PUT /api/svc/v1/apps '{ "manifest": { ... JSON version of the YAML manifest ... }, "workspaceId": "WORKSPACE_ID_HERE" }'
undefined
$TFY_API_SH PUT /api/svc/v1/apps '{ "manifest": { ... YAML清单对应的JSON内容 ... }, "workspaceId": "此处填写工作区ID" }'
undefined

Via Tool Call

通过工具调用

tfy_applications_create_deployment(
    manifest={ ... manifest dict ... },
    options={"workspace_id": "ws-internal-id", "force_deploy": false}
)
tfy_applications_create_deployment(
    manifest={ ... 清单字典 ... },
    options={"workspace_id": "ws-internal-id", "force_deploy": false}
)

Step 5: Verify Deployment & Return URL

步骤5:验证部署并返回URL

CRITICAL: Always fetch and return the deployment URL and status to the user. A deployment without a reported URL is incomplete. Do this automatically after deploy, without asking an extra verification prompt.
关键要求:必须自动获取并向用户返回部署URL和状态。没有返回URL的部署是不完整的。 请在部署完成后自动执行该操作,不要额外发起验证询问。

Poll Deployment Status

轮询部署状态

After submitting the manifest, poll for status. Prefer MCP tool calls first:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "MODEL_NAME"})
If MCP tool calls are unavailable, fall back to API:
bash
$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=MODEL_NAME'
LLM deployments take longer than regular services:
  • GPU node provisioning: 5-15 min (if scaling up)
  • Model download: 2-10 min (depends on model size and cache)
  • Model loading into GPU: 1-5 min
  • Total: typically 10-30 min for first deployment
提交清单后,轮询状态。优先使用MCP工具调用:
tfy_applications_list(filters={"workspace_fqn": "工作区FQN", "application_name": "模型名称"})
如果无法使用MCP工具调用,兜底使用API:
bash
$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=工作区FQN&applicationName=模型名称'
LLM部署耗时比常规服务更长:
  • GPU节点供应:5-15分钟(如果需要扩容)
  • 模型下载:2-10分钟(取决于模型大小和缓存情况)
  • 模型加载到GPU:1-5分钟
  • 总耗时:首次部署通常需要10-30分钟

Report to User

向用户反馈

Always present this summary after deployment:
LLM Deployment submitted!

Model: {hf-model-id}
Service: {service-name}
Framework: vLLM / TGI / NIM
Workspace: {workspace-fqn}
GPU: {gpu-count}x {gpu-type}
Status: {BUILDING|DEPLOYING|RUNNING}

Endpoints:
  Public URL:   https://{host} (available once RUNNING)
  Internal DNS: {service-name}.{namespace}.svc.cluster.local:8000

OpenAI-compatible API (once RUNNING):
  curl https://{host}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'

Health check:
  curl https://{host}/health

Note: LLM deployments typically take 10-30 minutes for first deploy
(GPU provisioning + model download + loading). Check status with
the applications skill.
部署完成后必须展示以下汇总信息:
LLM部署已提交!

模型:{hf-model-id}
服务:{service-name}
框架:vLLM / TGI / NIM
工作区:{workspace-fqn}
GPU:{gpu-count}x {gpu-type}
状态:{BUILDING|DEPLOYING|RUNNING}

端点:
  公网URL:   https://{host}(服务运行后可用)
  内部DNS:{service-name}.{namespace}.svc.cluster.local:8000

兼容OpenAI的API(服务运行后可用):
  curl https://{host}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'

健康检查:
  curl https://{host}/health

注意:LLM首次部署通常需要10-30分钟
(GPU供应 + 模型下载 + 加载)。您可以使用applications技能查看部署状态。

Test Once Running

服务运行后测试

When the service reaches RUNNING status:
bash
undefined
当服务状态变为RUNNING时:
bash
undefined

Health check

健康检查

curl https://{HOST}/health
curl https://{HOST}/health

OpenAI-compatible completion (vLLM/TGI)

兼容OpenAI的补全接口测试(vLLM/TGI)

curl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
undefined
curl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
undefined

Public URL

公网URL

Same as the
deploy
skill -- look up cluster base domains and construct the host.
  1. Fetch cluster base domains:
    $TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID
  2. Pick wildcard domain, strip
    *.
    to get base domain
  3. Construct host:
    {model-name}-{workspace-name}.{base_domain}
  4. Alternative: path-based routing -- Use the cluster's base domain directly as
    host
    and set a unique
    path
    prefix.
deploy
技能逻辑一致 -- 查询集群基础域名并构建主机地址。
  1. 获取集群基础域名:
    $TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID
  2. 选择通配符域名,去掉
    *.
    得到基础域名
  3. 构建主机地址:
    {model-name}-{workspace-name}.{base_domain}
  4. 替代方案:路径路由 -- 直接使用集群基础域名作为
    host
    ,设置唯一的
    path
    前缀。

Deployment Flow Summary

部署流程汇总

  1. Check credentials + workspace (Step 0a, prerequisites)
  2. Discover cluster capabilities -- GPUs, base domains (Step 0)
  3. Get model info -- HuggingFace model ID from user (Step 1)
  4. Call deployment-specs API to get recommended resources (Step 2)
  5. Generate YAML manifest referencing
    references/llm-manifest-templates.md
    (Step 3)
  6. Write to
    tfy-manifest.yaml
    (Step 3a)
  7. Preview:
    tfy apply -f tfy-manifest.yaml --dry-run --show-diff
    (Step 4)
  8. Apply:
    tfy apply -f tfy-manifest.yaml
    (Step 4)
  9. Verify deployment and return URL (Step 5)
  1. 检查凭证和工作区(步骤0a、前置条件)
  2. 获取集群能力 -- GPU、基础域名(步骤0)
  3. 获取模型信息 -- 向用户询问HuggingFace模型ID(步骤1)
  4. 调用部署规格API获取推荐资源配置(步骤2)
  5. 参考
    references/llm-manifest-templates.md
    生成YAML清单(步骤3)
  6. 写入到
    tfy-manifest.yaml
    (步骤3a)
  7. 预览:
    tfy apply -f tfy-manifest.yaml --dry-run --show-diff
    (步骤4)
  8. 执行部署:
    tfy apply -f tfy-manifest.yaml
    (步骤4)
  9. 验证部署并返回URL(步骤5)

User Confirmation Checklist

用户确认清单

Before deploying, confirm these with the user:
  • Model -- HuggingFace model ID and revision
  • Framework -- vLLM, TGI, or NVIDIA NIM
  • GPU type & count -- from deployment-specs API or cluster GPUs (Step 2)
  • Resources -- CPU, memory, shared memory (deployment-specs recommendation + cluster availability)
  • DTYPE -- float16 or bfloat16 (based on GPU)
  • Max model length -- context window size
  • Access -- public URL or internal-only
  • Authentication -- HF token for gated models (from TrueFoundry secrets)
  • Environment -- dev (1 replica) or production (2+ replicas)
  • Service name -- what to call the deployment
  • Auto-shutdown -- Should the deployment auto-stop after inactivity? (useful for dev/staging to save GPU costs)
</instructions>
<success_criteria>
部署前,请和用户确认以下内容:
  • 模型 -- HuggingFace模型ID和版本
  • 框架 -- vLLM、TGI或NVIDIA NIM
  • GPU类型和数量 -- 来自部署规格API或集群GPU列表(步骤2)
  • 资源配置 -- CPU、内存、共享内存(部署规格API推荐 + 集群可用资源)
  • DTYPE -- float16或bfloat16(根据GPU类型选择)
  • 最大模型长度 -- 上下文窗口大小
  • 访问方式 -- 公网URL还是仅内部访问
  • 身份验证 -- 受限模型需要的HF token(来自TrueFoundry密钥)
  • 环境 -- 开发环境(1副本)或生产环境(2+副本)
  • 服务名称 -- 部署的命名
  • 自动关停 -- 部署是否需要在空闲后自动停止?(适合开发/ staging环境,节省GPU成本)
</instructions>
<success_criteria>

Success Criteria

成功标准

  • The LLM deployment has been submitted and the user can see its status in TrueFoundry
  • The agent has reported the deployment URL (public or internal DNS), model name, framework, GPU type, and workspace
  • Deployment status is verified automatically immediately after apply/deploy (no extra prompt)
  • The user has been provided an OpenAI-compatible API curl command to test the model once it is running
  • The agent has confirmed GPU type, resource sizing, DTYPE, and model configuration with the user before deploying
  • Health probes are configured with appropriate startup thresholds for the model size
</success_criteria>
<references>
  • LLM部署已提交,用户可以在TrueFoundry中查看部署状态
  • Agent已向用户反馈部署URL(公网或内部DNS)、模型名称、框架、GPU类型、工作区
  • 部署完成后自动验证状态(不需要额外提示)
  • 已向用户提供服务运行后测试用的兼容OpenAI接口的curl命令
  • 部署前Agent已和用户确认GPU类型、资源大小、DTYPE和模型配置
  • 健康探针已根据模型大小配置了合适的启动阈值
</success_criteria>
<references>

Composability

可组合性

  • Find workspace first: Use
    workspaces
    skill to get workspace FQN
  • Check cluster GPUs: Use
    workspaces
    skill for GPU type reference
  • Manage secrets: Use
    secrets
    skill to create/find HF token secret groups
  • Check deployment status: Use
    applications
    skill after deploying
  • Test after deployment: Use
    service-test
    skill to validate the endpoint
  • View logs: Use
    logs
    skill to debug startup issues
  • Deploy database alongside: Use
    helm
    skill for vector DBs, caches, etc.
  • Benchmark performance: Run load tests against the deployed endpoint to measure throughput/latency
  • Fine-tune first: Fine-tune externally and deploy the resulting model artifact with this skill
  • AI Gateway (optional): For unified API access, multi-model routing, and rate limiting, install
    npx skills add truefoundry/tfy-gateway-skills
</references> <troubleshooting>
  • 优先查找工作区:使用
    workspaces
    技能获取工作区FQN
  • 检查集群GPU:使用
    workspaces
    技能获取GPU类型参考
  • 管理密钥:使用
    secrets
    技能创建/查找HF token密钥组
  • 检查部署状态:部署完成后使用
    applications
    技能
  • 部署后测试:使用
    service-test
    技能验证端点
  • 查看日志:使用
    logs
    技能调试启动问题
  • 配套部署数据库:使用
    helm
    技能部署向量数据库、缓存等
  • 性能 benchmark:对部署的端点执行压测,衡量吞吐量/延迟
  • 先微调:在外部完成微调后,使用该技能部署生成的模型制品
  • AI网关(可选):如需统一API访问、多模型路由、限流,安装
    npx skills add truefoundry/tfy-gateway-skills
</references> <troubleshooting>

Error Handling

错误处理

For common LLM deployment errors (GPU not available, OOM, CUDA errors, model download failures, probe timeouts, invalid GPU types, host configuration issues) and their fixes, see references/llm-errors.md.
常见的LLM部署错误(GPU不可用、OOM、CUDA错误、模型下载失败、探针超时、无效GPU类型、主机配置问题)及修复方案请参考references/llm-errors.md

CLI Errors

CLI错误

  • tfy: command not found
    -- Install with
    pip install 'truefoundry==0.5.0'
  • tfy apply
    validation errors -- Check YAML syntax, ensure required fields are present
  • Manifest validation failures -- Check
    references/llm-manifest-templates.md
    for correct field names
</troubleshooting>
  • tfy: command not found
    -- 执行
    pip install 'truefoundry==0.5.0'
    安装
  • tfy apply
    验证错误 -- 检查YAML语法,确保必填字段已填写
  • 清单验证失败 -- 参考
    references/llm-manifest-templates.md
    确认字段名称正确
</troubleshooting>