truefoundry-llm-deploy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.

路由提示：如果用户意图不明确，请使用 references/intent-clarification.md 中的通用澄清模板。

LLM / Model Deployment

LLM / 模型部署

Deploy large language models and ML inference servers to TrueFoundry. Supports vLLM, TGI, and custom model servers with proper GPU allocation, model caching, health probes, and production-ready defaults.

Two paths:

CLI (
```
tfy apply
```
) -- Write a YAML manifest and apply it. Works everywhere.
REST API (fallback) -- When CLI unavailable, use
```
tfy-api.sh
```
.

将大语言模型和ML推理服务器部署到TrueFoundry。支持vLLM、TGI和自定义模型服务器，提供合理的GPU分配、模型缓存、健康探针，以及生产级默认配置。

两种部署路径：

CLI (
```
tfy apply
```
) -- 编写YAML清单并执行部署，全场景适用。
REST API（兜底方案）-- 无法使用CLI时，使用
```
tfy-api.sh
```
。

When to Use

适用场景

User says "deploy a model", "deploy LLM", "serve Gemma/Llama/Mistral/..."
User says "deploy vLLM", "deploy TGI", "inference server"
User wants to deploy a HuggingFace model for inference
User wants GPU-accelerated model serving
User wants to deploy NVIDIA NIM (optimized inference containers)

用户提出「部署一个模型」、「部署LLM」、「部署Gemma/Llama/Mistral/...」等需求
用户提出「部署vLLM」、「部署TGI」、「推理服务器」等需求
用户需要部署HuggingFace模型用于推理
用户需要GPU加速的模型服务
用户需要部署NVIDIA NIM（优化后的推理容器）

When NOT to Use

不适用场景

User wants to deploy a regular web app or API -> prefer
```
deploy
```
skill; ask if the user wants another valid path
User wants to deploy a database or Helm chart -> prefer
```
helm
```
skill; ask if the user wants another valid path
User wants to check what's deployed -> prefer
```
applications
```
skill; ask if the user wants another valid path

</objective> <context>

用户需要部署常规Web应用或API -> 优先使用
```
deploy
```
技能，询问用户是否需要切换到其他有效路径
用户需要部署数据库或Helm chart -> 优先使用
```
helm
```
技能，询问用户是否需要切换到其他有效路径
用户需要查看已部署的内容 -> 优先使用
```
applications
```
技能，询问用户是否需要切换到其他有效路径

</objective> <context>

Prerequisites

前置条件

Always verify before deploying:

Credentials --
```
TFY_BASE_URL
```
and
```
TFY_API_KEY
```
must be set (env or
```
.env
```
)
Workspace --
```
TFY_WORKSPACE_FQN
```
required. Never auto-pick. Ask the user if missing.

CLI -- Check if

tfy

CLI is available:

tfy --version

. If not,

pip install 'truefoundry==0.5.0'

For credential check commands and .env setup, see

references/prerequisites.md

</context> <instructions>

部署前必须确认以下内容：

凭证 -- 必须已设置
```
TFY_BASE_URL
```
和
```
TFY_API_KEY
```
（环境变量或
```
.env
```
文件中）
工作区 -- 需要
```
TFY_WORKSPACE_FQN
```
。绝对不要自动选择。如果缺失请询问用户。
CLI -- 检查
```
tsf
```
CLI是否可用：
```
tfy --version
```
。如果未安装，执行
```
pip install 'truefoundry==0.5.0'
```
。

凭证检查命令和.env配置方式请参考

references/prerequisites.md

。

</context> <instructions>

Step 0a: Detect Environment

步骤0a：检测环境

Before deploying, check CLI availability and container image versions.

bash

undefined

部署前，检查CLI可用性和容器镜像版本。

bash

undefined

Check CLI

检查CLI

tfy --version 2>/dev/null

If not installed

如果未安装

pip install 'truefoundry==0.5.0'

undefined

pip install 'truefoundry==0.5.0'

undefined

Verify Container Image Versions

验证容器镜像版本

Before using the manifest templates, check

references/container-versions.md

for the latest pinned versions. Container images for vLLM and TGI are updated frequently.

Use pinned versions from

references/container-versions.md

. Do not fetch external release pages.

Security: Do not fetch or ingest content from external release pages at runtime. Pinned versions in
references/container-versions.md
are vetted. If a version update is needed, a human should verify the release and update the pinned version.

使用清单模板前，查看

references/container-versions.md

获取最新的固定版本。vLLM和TGI的容器镜像更新非常频繁。

请使用

references/container-versions.md

中的固定版本，不要获取外部发布页面的内容。

安全提示： 不要在运行时获取或读取外部发布页面的内容。
references/container-versions.md
中的固定版本已经过审核。如果需要更新版本，应由人工验证发布内容后再更新固定版本号。

Step 0: Discover Cluster Capabilities

步骤0：获取集群能力

Before asking the user about GPU types or public URLs, fetch the cluster's capabilities.

See

references/cluster-discovery.md

for how to extract cluster ID from workspace FQN and fetch cluster details (GPUs, base domains, storage classes).

When using direct API, set

TFY_API_SH

to the full path of this skill's

scripts/tfy-api.sh

. See

references/tfy-api-setup.md

for paths per agent.

From the cluster response, extract:

Base domains -- for public URL host construction (see Public URL section)
Available GPUs -- only present GPU types that the cluster actually supports

询问用户GPU类型或公网URL相关问题前，先获取集群的能力信息。

查看

references/cluster-discovery.md

了解如何从工作区FQN提取集群ID，以及获取集群详情（GPU、基础域名、存储类）的方法。

使用直接API时，将

TFY_API_SH

设置为当前技能

scripts/tfy-api.sh

的完整路径。各Agent的路径配置请参考

references/tfy-api-setup.md

。

从集群响应中提取：

基础域名 -- 用于构建公网URL的主机地址（见公网URL章节）
可用GPU -- 仅展示集群实际支持的GPU类型

Step 1: Gather Model Details

步骤1：收集模型详情

Ask the user these questions:

I'll help you deploy an LLM. Let me gather a few details:

1. Which model? (e.g., google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct)
2. Serving framework?
   - vLLM (recommended -- fast, OpenAI-compatible)
   - TGI (HuggingFace Text Generation Inference)
   - Custom image
3. Does the model require authentication? (e.g., gated HuggingFace models needing HF_TOKEN)
   - If yes: Do you have a TrueFoundry secret group with the token, or should we set one up?
4. Access: Public URL or internal-only?
5. Environment: Dev/testing or production?

向用户询问以下问题：

我将协助您部署LLM，请提供几个必要信息：

1. 需要部署哪个模型？（例如：google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct）
2. 使用什么服务框架？
   - vLLM（推荐 -- 速度快，兼容OpenAI接口）
   - TGI（HuggingFace文本生成推理框架）
   - 自定义镜像
3. 模型是否需要身份验证？（例如需要HF_TOKEN的受限HuggingFace模型）
   - 如果是：您是否已有包含token的TrueFoundry密钥组，还是需要我们新建？
4. 访问方式：公网URL还是仅内部访问？
5. 部署环境：开发/测试还是生产环境？

Step 2: Get Recommended Resources from Deployment Specs API

步骤2：从部署规格API获取推荐资源配置

After the user provides a HuggingFace model ID and workspace, call the deployment-specs API to get recommended GPU, CPU, memory, and storage specs.

First, get the workspace ID from the workspace FQN:

bash

$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"

Extract the

id

field from the response. Then call:

bash

$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"

This returns recommended specs including GPU type, GPU count, CPU, memory, storage, and max model length. Use these as the starting point for resource allocation instead of guessing from the model size table.

If the API call fails (e.g., model not in catalogue), fall back to the model size table below.

用户提供HuggingFace模型ID和工作区后，调用部署规格API获取推荐的GPU、CPU、内存和存储规格。

首先，从工作区FQN获取工作区ID：

bash

$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"

从响应中提取

id

字段，然后调用：

bash

$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"

该接口会返回推荐的配置，包括GPU类型、GPU数量、CPU、内存、存储、最大模型长度。请以此作为资源分配的起始依据，不要根据模型大小表猜测配置。

如果API调用失败（例如模型不在目录中），请使用下方的模型大小与GPU映射表作为兜底。

Fallback: Model Size to GPU Mapping

兜底方案：模型大小与GPU映射表

For full GPU types and DTYPE selection, see

references/gpu-reference.md

Model Params	Min VRAM (FP16)	Recommended GPU	CPU	Memory	Shared Memory
< 1B	~2 GB	T4 (16 GB)	4	16 GB	15 GB
1B-3B	~4-6 GB	T4 (16 GB) or A10_8GB	4-8	32 GB	30 GB
3B-7B	~6-14 GB	T4 (16 GB) or A10_24GB	8-10	64 GB	60 GB
7B-13B	~14-26 GB	A10_24GB or A100_40GB	10-12	90 GB	88 GB
13B-30B	~26-60 GB	A100_40GB or A100_80GB	12-16	128 GB	120 GB
30B-70B	~60-140 GB	A100_80GB or H100 (multi-GPU)	16+	200 GB+	190 GB+

Present a resource suggestion table showing GPU, CPU, memory, shared memory, ephemeral storage, and max model length. Include the list of available GPUs from the cluster. If deployment-specs returned values, show those as "Recommended by TrueFoundry" alongside the table.

完整的GPU类型和DTYPE选择指南请参考

references/gpu-reference.md

。

模型参数量	最小VRAM（FP16）	推荐GPU	CPU	内存	共享内存
< 1B	~2 GB	T4 (16 GB)	4	16 GB	15 GB
1B-3B	~4-6 GB	T4 (16 GB) 或 A10_8GB	4-8	32 GB	30 GB
3B-7B	~6-14 GB	T4 (16 GB) 或 A10_24GB	8-10	64 GB	60 GB
7B-13B	~14-26 GB	A10_24GB 或 A100_40GB	10-12	90 GB	88 GB
13B-30B	~26-60 GB	A100_40GB 或 A100_80GB	12-16	128 GB	120 GB
30B-70B	~60-140 GB	A100_80GB 或 H100（多GPU）	16+	200 GB+	190 GB+

展示资源建议表，包含GPU、CPU、内存、共享内存、临时存储、最大模型长度。同时列出集群的可用GPU。如果部署规格API返回了数值，请在表格旁标注为「TrueFoundry推荐配置」。

Important: Shared Memory

重要提示：共享内存

vLLM and TGI require large shared memory (
/dev/shm
). Without it, the model server will crash or perform poorly. Set

shared_memory_size

to roughly 90-95% of

memory_request

vLLM和TGI需要大量共享内存（
/dev/shm
）。如果共享内存不足，模型服务器会崩溃或性能不佳。请将

shared_memory_size

设置为

memory_request

的90-95%左右。

Important: Memory vs VRAM

重要提示：内存与VRAM的区别

System memory (RAM) must be much larger than GPU VRAM because:

Model weights load into CPU RAM first before transferring to GPU
KV cache and request batching use CPU memory
Rule of thumb: RAM should be 2-4x the model's VRAM footprint

系统内存（RAM）必须远大于GPU VRAM，原因是：

模型权重在传输到GPU前会先加载到CPU RAM
KV缓存和请求批处理会占用CPU内存
经验法则：RAM应该是模型VRAM占用量的2-4倍

Step 3: Build the YAML Manifest

步骤3：构建YAML清单

For complete manifest templates (vLLM, TGI, NVIDIA NIM), template variables reference, DTYPE selection guide, artifacts download configuration, and common vLLM flags, see references/llm-manifest-templates.md.

Key framework defaults:

Framework	Default Image	Health Path
vLLM	`public.ecr.aws/truefoundrycloud/vllm/vllm-openai:v0.13.0`	`/health`
TGI	`ghcr.io/huggingface/text-generation-inference:2.4.1`	`/health`
NVIDIA NIM	`nvcr.io/nim/{model-path}:{version}`	`/v1/health/ready`

Check

references/container-versions.md

for latest pinned versions. Always use

artifacts_download

with cache volumes for model caching instead of downloading at runtime.

Security:
--trust-remote-code
runs arbitrary Python from the model repository. Only use this flag with models from trusted sources. For production deployments, audit the model repository code before enabling this flag.

The vLLM manifest MUST include:

artifacts_download

with

huggingface-hub

type and

cache_volume

for model caching

labels

tfy_model_server

tfy_openapi_path

tfy_sticky_session_header_name

huggingface_model_task

rollout_strategy

startup_probe

readiness_probe

liveness_probe

Env vars:

DTYPE

GPU_COUNT

MAX_MODEL_LENGTH

VLLM_NO_USAGE_STATS

NVIDIA_REQUIRE_CUDA

GPU_MEMORY_UTILIZATION

MODEL_NAME

VLLM_CACHE_ROOT

Health probes are mandatory for all LLM deployments. The manifest templates include LLM-tuned probe values (startup threshold of 35 retries for ~350s tolerance). For general probe configuration, see

references/health-probes.md

. For large models (30B+), increase startup

failure_threshold

to 60+.

完整的清单模板（vLLM、TGI、NVIDIA NIM）、模板变量参考、DTYPE选择指南、制品下载配置、常用vLLM参数请参考references/llm-manifest-templates.md。

各框架默认配置：

框架	默认镜像	健康检查路径
vLLM	`public.ecr.aws/truefoundrycloud/vllm/vllm-openai:v0.13.0`	`/health`
TGI	`ghcr.io/huggingface/text-generation-inference:2.4.1`	`/health`
NVIDIA NIM	`nvcr.io/nim/{model-path}:{version}`	`/v1/health/ready`

查看

references/container-versions.md

获取最新的固定版本。请始终使用带缓存卷的

artifacts_download

做模型缓存，不要在运行时下载模型。

安全提示：
--trust-remote-code
会执行模型仓库中的任意Python代码。仅对可信来源的模型使用该参数。生产部署前请先审核模型仓库的代码再启用该参数。

vLLM清单必须包含：

类型为
```
huggingface-hub
```
的
```
artifacts_download
```
配置，以及用于模型缓存的
```
cache_volume
```

标签：

tfy_model_server

、

tfy_openapi_path

、

tfy_sticky_session_header_name

、

huggingface_model_task

rollout_strategy

、

startup_probe

、

readiness_probe

、

liveness_probe

环境变量：

DTYPE

、

GPU_COUNT

、

MAX_MODEL_LENGTH

、

VLLM_NO_USAGE_STATS

、

NVIDIA_REQUIRE_CUDA

、

GPU_MEMORY_UTILIZATION

、

MODEL_NAME

、

VLLM_CACHE_ROOT

健康探针对所有LLM部署都是必填项。清单模板中包含了适配LLM的探针值（启动阈值35次重试，约350秒容错时间）。对于大模型（30B+），请将启动探针的

failure_threshold

提升到60以上。

Step 3a: Write Manifest

步骤3a：编写清单

Write the YAML manifest to

tfy-manifest.yaml

. Reference

references/llm-manifest-templates.md

for complete templates and

references/manifest-schema.md

for field definitions.

将YAML清单写入

tfy-manifest.yaml

。完整模板请参考

references/llm-manifest-templates.md

，字段定义请参考

references/manifest-schema.md

。

Step 4: Preview and Apply

步骤4：预览并执行部署

bash

undefined

bash

undefined

Preview

预览

tfy apply -f tfy-manifest.yaml --dry-run --show-diff

Apply after user confirms

用户确认后执行部署

tfy apply -f tfy-manifest.yaml

undefined

tfy apply -f tfy-manifest.yaml

undefined

Fallback: REST API

兜底方案：REST API

tfy

CLI is not available, convert the YAML manifest to JSON and deploy via REST API. See

references/cli-fallback.md

for the conversion process.

bash

TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.sh

如果无法使用

tfy

CLI，将YAML清单转换为JSON，通过REST API部署。转换流程请参考

references/cli-fallback.md

。

bash

TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.sh

Get workspace ID

获取工作区ID

$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"

Deploy (JSON body)

部署（JSON请求体）

$TFY_API_SH PUT /api/svc/v1/apps '{ "manifest": { ... JSON version of the YAML manifest ... }, "workspaceId": "WORKSPACE_ID_HERE" }'

undefined

$TFY_API_SH PUT /api/svc/v1/apps '{ "manifest": { ... YAML清单对应的JSON内容 ... }, "workspaceId": "此处填写工作区ID" }'

undefined

Via Tool Call

通过工具调用

tfy_applications_create_deployment(
    manifest={ ... manifest dict ... },
    options={"workspace_id": "ws-internal-id", "force_deploy": false}
)

tfy_applications_create_deployment(
    manifest={ ... 清单字典 ... },
    options={"workspace_id": "ws-internal-id", "force_deploy": false}
)

Step 5: Verify Deployment & Return URL

步骤5：验证部署并返回URL

CRITICAL: Always fetch and return the deployment URL and status to the user. A deployment without a reported URL is incomplete. Do this automatically after deploy, without asking an extra verification prompt.

关键要求：必须自动获取并向用户返回部署URL和状态。没有返回URL的部署是不完整的。 请在部署完成后自动执行该操作，不要额外发起验证询问。

Poll Deployment Status

轮询部署状态

After submitting the manifest, poll for status. Prefer MCP tool calls first:

tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "MODEL_NAME"})

If MCP tool calls are unavailable, fall back to API:

bash

$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=MODEL_NAME'

LLM deployments take longer than regular services:

GPU node provisioning: 5-15 min (if scaling up)
Model download: 2-10 min (depends on model size and cache)
Model loading into GPU: 1-5 min
Total: typically 10-30 min for first deployment

提交清单后，轮询状态。优先使用MCP工具调用：

tfy_applications_list(filters={"workspace_fqn": "工作区FQN", "application_name": "模型名称"})

如果无法使用MCP工具调用，兜底使用API：

bash

$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=工作区FQN&applicationName=模型名称'

LLM部署耗时比常规服务更长：

GPU节点供应：5-15分钟（如果需要扩容）
模型下载：2-10分钟（取决于模型大小和缓存情况）
模型加载到GPU：1-5分钟
总耗时：首次部署通常需要10-30分钟

Report to User

向用户反馈

Always present this summary after deployment:

LLM Deployment submitted!

Model: {hf-model-id}
Service: {service-name}
Framework: vLLM / TGI / NIM
Workspace: {workspace-fqn}
GPU: {gpu-count}x {gpu-type}
Status: {BUILDING|DEPLOYING|RUNNING}

Endpoints:
  Public URL:   https://{host} (available once RUNNING)
  Internal DNS: {service-name}.{namespace}.svc.cluster.local:8000

OpenAI-compatible API (once RUNNING):
  curl https://{host}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'

Health check:
  curl https://{host}/health

Note: LLM deployments typically take 10-30 minutes for first deploy
(GPU provisioning + model download + loading). Check status with
the applications skill.

部署完成后必须展示以下汇总信息：

LLM部署已提交！

模型：{hf-model-id}
服务：{service-name}
框架：vLLM / TGI / NIM
工作区：{workspace-fqn}
GPU：{gpu-count}x {gpu-type}
状态：{BUILDING|DEPLOYING|RUNNING}

端点：
  公网URL：   https://{host}（服务运行后可用）
  内部DNS：{service-name}.{namespace}.svc.cluster.local:8000

兼容OpenAI的API（服务运行后可用）：
  curl https://{host}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'

健康检查：
  curl https://{host}/health

注意：LLM首次部署通常需要10-30分钟
（GPU供应 + 模型下载 + 加载）。您可以使用applications技能查看部署状态。

Test Once Running

服务运行后测试

When the service reaches RUNNING status:

bash

undefined

当服务状态变为RUNNING时：

bash

undefined

Health check

健康检查

curl https://{HOST}/health

OpenAI-compatible completion (vLLM/TGI)

兼容OpenAI的补全接口测试（vLLM/TGI）

curl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'

undefined

curl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'

undefined

Public URL

公网URL

Same as the

deploy

skill -- look up cluster base domains and construct the host.

Fetch cluster base domains:

$TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID

Pick wildcard domain, strip
```
*.
```
to get base domain

Construct host:

{model-name}-{workspace-name}.{base_domain}

Alternative: path-based routing -- Use the cluster's base domain directly as
```
host
```
and set a unique
```
path
```
prefix.

和

deploy

技能逻辑一致 -- 查询集群基础域名并构建主机地址。

获取集群基础域名：

$TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID

选择通配符域名，去掉
```
*.
```
得到基础域名

构建主机地址：

{model-name}-{workspace-name}.{base_domain}

替代方案：路径路由 -- 直接使用集群基础域名作为
```
host
```
，设置唯一的
```
path
```
前缀。

Deployment Flow Summary

部署流程汇总

Check credentials + workspace (Step 0a, prerequisites)
Discover cluster capabilities -- GPUs, base domains (Step 0)
Get model info -- HuggingFace model ID from user (Step 1)
Call deployment-specs API to get recommended resources (Step 2)
Generate YAML manifest referencing
```
references/llm-manifest-templates.md
```
(Step 3)
Write to
```
tfy-manifest.yaml
```
(Step 3a)

Preview:

tfy apply -f tfy-manifest.yaml --dry-run --show-diff

(Step 4)

Apply:
```
tfy apply -f tfy-manifest.yaml
```
(Step 4)
Verify deployment and return URL (Step 5)

检查凭证和工作区（步骤0a、前置条件）
获取集群能力 -- GPU、基础域名（步骤0）
获取模型信息 -- 向用户询问HuggingFace模型ID（步骤1）
调用部署规格API获取推荐资源配置（步骤2）
参考
```
references/llm-manifest-templates.md
```
生成YAML清单（步骤3）
写入到
```
tfy-manifest.yaml
```
（步骤3a）

预览：

tfy apply -f tfy-manifest.yaml --dry-run --show-diff

（步骤4）

执行部署：
```
tfy apply -f tfy-manifest.yaml
```
（步骤4）
验证部署并返回URL（步骤5）

User Confirmation Checklist

用户确认清单

Success Criteria

成功标准

The LLM deployment has been submitted and the user can see its status in TrueFoundry
The agent has reported the deployment URL (public or internal DNS), model name, framework, GPU type, and workspace
Deployment status is verified automatically immediately after apply/deploy (no extra prompt)
The user has been provided an OpenAI-compatible API curl command to test the model once it is running
The agent has confirmed GPU type, resource sizing, DTYPE, and model configuration with the user before deploying
Health probes are configured with appropriate startup thresholds for the model size

</success_criteria>

LLM部署已提交，用户可以在TrueFoundry中查看部署状态
Agent已向用户反馈部署URL（公网或内部DNS）、模型名称、框架、GPU类型、工作区
部署完成后自动验证状态（不需要额外提示）
已向用户提供服务运行后测试用的兼容OpenAI接口的curl命令
部署前Agent已和用户确认GPU类型、资源大小、DTYPE和模型配置
健康探针已根据模型大小配置了合适的启动阈值

</success_criteria>

Composability

可组合性

Find workspace first: Use
```
workspaces
```
skill to get workspace FQN
Check cluster GPUs: Use
```
workspaces
```
skill for GPU type reference
Manage secrets: Use
```
secrets
```
skill to create/find HF token secret groups
Check deployment status: Use
```
applications
```
skill after deploying
Test after deployment: Use
```
service-test
```
skill to validate the endpoint
View logs: Use
```
logs
```
skill to debug startup issues
Deploy database alongside: Use
```
helm
```
skill for vector DBs, caches, etc.
Benchmark performance: Run load tests against the deployed endpoint to measure throughput/latency
Fine-tune first: Fine-tune externally and deploy the resulting model artifact with this skill
AI Gateway (optional): For unified API access, multi-model routing, and rate limiting, install
```
npx skills add truefoundry/tfy-gateway-skills
```

</references> <troubleshooting>

优先查找工作区：使用
```
workspaces
```
技能获取工作区FQN
检查集群GPU：使用
```
workspaces
```
技能获取GPU类型参考
管理密钥：使用
```
secrets
```
技能创建/查找HF token密钥组
检查部署状态：部署完成后使用
```
applications
```
技能
部署后测试：使用
```
service-test
```
技能验证端点
查看日志：使用
```
logs
```
技能调试启动问题
配套部署数据库：使用
```
helm
```
技能部署向量数据库、缓存等
性能 benchmark：对部署的端点执行压测，衡量吞吐量/延迟
先微调：在外部完成微调后，使用该技能部署生成的模型制品
AI网关（可选）：如需统一API访问、多模型路由、限流，安装
```
npx skills add truefoundry/tfy-gateway-skills
```

</references> <troubleshooting>

Error Handling

错误处理

For common LLM deployment errors (GPU not available, OOM, CUDA errors, model download failures, probe timeouts, invalid GPU types, host configuration issues) and their fixes, see references/llm-errors.md.

常见的LLM部署错误（GPU不可用、OOM、CUDA错误、模型下载失败、探针超时、无效GPU类型、主机配置问题）及修复方案请参考references/llm-errors.md。

CLI Errors

CLI错误

tfy: command not found

-- Install with

pip install 'truefoundry==0.5.0'

```
tfy apply
```
validation errors -- Check YAML syntax, ensure required fields are present
Manifest validation failures -- Check
```
references/llm-manifest-templates.md
```
for correct field names

</troubleshooting>

tfy: command not found

-- 执行

pip install 'truefoundry==0.5.0'

安装

```
tfy apply
```
验证错误 -- 检查YAML语法，确保必填字段已填写
清单验证失败 -- 参考
```
references/llm-manifest-templates.md
```
确认字段名称正确

</troubleshooting>