truefoundry-llm-deploy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
<objective>路由提示:如果用户意图不明确,请使用 references/intent-clarification.md 中的通用澄清模板。
LLM / Model Deployment
LLM / 模型部署
Deploy large language models and ML inference servers to TrueFoundry. Supports vLLM, TGI, and custom model servers with proper GPU allocation, model caching, health probes, and production-ready defaults.
Two paths:
- CLI () -- Write a YAML manifest and apply it. Works everywhere.
tfy apply - REST API (fallback) -- When CLI unavailable, use .
tfy-api.sh
将大语言模型和ML推理服务器部署到TrueFoundry。支持vLLM、TGI和自定义模型服务器,提供合理的GPU分配、模型缓存、健康探针,以及生产级默认配置。
两种部署路径:
- CLI () -- 编写YAML清单并执行部署,全场景适用。
tfy apply - REST API(兜底方案)-- 无法使用CLI时,使用。
tfy-api.sh
When to Use
适用场景
- User says "deploy a model", "deploy LLM", "serve Gemma/Llama/Mistral/..."
- User says "deploy vLLM", "deploy TGI", "inference server"
- User wants to deploy a HuggingFace model for inference
- User wants GPU-accelerated model serving
- User wants to deploy NVIDIA NIM (optimized inference containers)
- 用户提出「部署一个模型」、「部署LLM」、「部署Gemma/Llama/Mistral/...」等需求
- 用户提出「部署vLLM」、「部署TGI」、「推理服务器」等需求
- 用户需要部署HuggingFace模型用于推理
- 用户需要GPU加速的模型服务
- 用户需要部署NVIDIA NIM(优化后的推理容器)
When NOT to Use
不适用场景
- User wants to deploy a regular web app or API -> prefer skill; ask if the user wants another valid path
deploy - User wants to deploy a database or Helm chart -> prefer skill; ask if the user wants another valid path
helm - User wants to check what's deployed -> prefer skill; ask if the user wants another valid path
applications
- 用户需要部署常规Web应用或API -> 优先使用技能,询问用户是否需要切换到其他有效路径
deploy - 用户需要部署数据库或Helm chart -> 优先使用技能,询问用户是否需要切换到其他有效路径
helm - 用户需要查看已部署的内容 -> 优先使用技能,询问用户是否需要切换到其他有效路径
applications
Prerequisites
前置条件
Always verify before deploying:
- Credentials -- and
TFY_BASE_URLmust be set (env orTFY_API_KEY).env - Workspace -- required. Never auto-pick. Ask the user if missing.
TFY_WORKSPACE_FQN - CLI -- Check if CLI is available:
tfy. If not,tfy --version.pip install 'truefoundry==0.5.0'
For credential check commands and .env setup, see .
</context>
<instructions>references/prerequisites.md部署前必须确认以下内容:
- 凭证 -- 必须已设置和
TFY_BASE_URL(环境变量或TFY_API_KEY文件中).env - 工作区 -- 需要。绝对不要自动选择。如果缺失请询问用户。
TFY_WORKSPACE_FQN - CLI -- 检查CLI是否可用:
tsf。如果未安装,执行tfy --version。pip install 'truefoundry==0.5.0'
凭证检查命令和.env配置方式请参考。
</context>
<instructions>references/prerequisites.mdStep 0a: Detect Environment
步骤0a:检测环境
Before deploying, check CLI availability and container image versions.
bash
undefined部署前,检查CLI可用性和容器镜像版本。
bash
undefinedCheck CLI
检查CLI
tfy --version 2>/dev/null
tfy --version 2>/dev/null
If not installed
如果未安装
pip install 'truefoundry==0.5.0'
undefinedpip install 'truefoundry==0.5.0'
undefinedVerify Container Image Versions
验证容器镜像版本
Before using the manifest templates, check for the latest pinned versions. Container images for vLLM and TGI are updated frequently.
references/container-versions.mdUse pinned versions from . Do not fetch external release pages.
references/container-versions.mdSecurity: Do not fetch or ingest content from external release pages at runtime. Pinned versions inare vetted. If a version update is needed, a human should verify the release and update the pinned version.references/container-versions.md
使用清单模板前,查看获取最新的固定版本。vLLM和TGI的容器镜像更新非常频繁。
references/container-versions.md请使用中的固定版本,不要获取外部发布页面的内容。
references/container-versions.md安全提示: 不要在运行时获取或读取外部发布页面的内容。中的固定版本已经过审核。如果需要更新版本,应由人工验证发布内容后再更新固定版本号。references/container-versions.md
Step 0: Discover Cluster Capabilities
步骤0:获取集群能力
Before asking the user about GPU types or public URLs, fetch the cluster's capabilities.
See for how to extract cluster ID from workspace FQN and fetch cluster details (GPUs, base domains, storage classes).
references/cluster-discovery.mdWhen using direct API, set to the full path of this skill's . See for paths per agent.
TFY_API_SHscripts/tfy-api.shreferences/tfy-api-setup.mdFrom the cluster response, extract:
- Base domains -- for public URL host construction (see Public URL section)
- Available GPUs -- only present GPU types that the cluster actually supports
询问用户GPU类型或公网URL相关问题前,先获取集群的能力信息。
查看了解如何从工作区FQN提取集群ID,以及获取集群详情(GPU、基础域名、存储类)的方法。
references/cluster-discovery.md使用直接API时,将设置为当前技能的完整路径。各Agent的路径配置请参考。
TFY_API_SHscripts/tfy-api.shreferences/tfy-api-setup.md从集群响应中提取:
- 基础域名 -- 用于构建公网URL的主机地址(见公网URL章节)
- 可用GPU -- 仅展示集群实际支持的GPU类型
Step 1: Gather Model Details
步骤1:收集模型详情
Ask the user these questions:
I'll help you deploy an LLM. Let me gather a few details:
1. Which model? (e.g., google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct)
2. Serving framework?
- vLLM (recommended -- fast, OpenAI-compatible)
- TGI (HuggingFace Text Generation Inference)
- Custom image
3. Does the model require authentication? (e.g., gated HuggingFace models needing HF_TOKEN)
- If yes: Do you have a TrueFoundry secret group with the token, or should we set one up?
4. Access: Public URL or internal-only?
5. Environment: Dev/testing or production?向用户询问以下问题:
我将协助您部署LLM,请提供几个必要信息:
1. 需要部署哪个模型?(例如:google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct)
2. 使用什么服务框架?
- vLLM(推荐 -- 速度快,兼容OpenAI接口)
- TGI(HuggingFace文本生成推理框架)
- 自定义镜像
3. 模型是否需要身份验证?(例如需要HF_TOKEN的受限HuggingFace模型)
- 如果是:您是否已有包含token的TrueFoundry密钥组,还是需要我们新建?
4. 访问方式:公网URL还是仅内部访问?
5. 部署环境:开发/测试还是生产环境?Step 2: Get Recommended Resources from Deployment Specs API
步骤2:从部署规格API获取推荐资源配置
After the user provides a HuggingFace model ID and workspace, call the deployment-specs API to get recommended GPU, CPU, memory, and storage specs.
First, get the workspace ID from the workspace FQN:
bash
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"Extract the field from the response. Then call:
idbash
$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"This returns recommended specs including GPU type, GPU count, CPU, memory, storage, and max model length. Use these as the starting point for resource allocation instead of guessing from the model size table.
If the API call fails (e.g., model not in catalogue), fall back to the model size table below.
用户提供HuggingFace模型ID和工作区后,调用部署规格API获取推荐的GPU、CPU、内存和存储规格。
首先,从工作区FQN获取工作区ID:
bash
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"从响应中提取字段,然后调用:
idbash
$TFY_API_SH GET "/api/svc/v1/model-catalogues/deployment-specs?huggingfaceHubUrl=https://huggingface.co/${HF_MODEL_ID}&workspaceId=${WORKSPACE_ID}&pipelineTagOverride=text-generation"该接口会返回推荐的配置,包括GPU类型、GPU数量、CPU、内存、存储、最大模型长度。请以此作为资源分配的起始依据,不要根据模型大小表猜测配置。
如果API调用失败(例如模型不在目录中),请使用下方的模型大小与GPU映射表作为兜底。
Fallback: Model Size to GPU Mapping
兜底方案:模型大小与GPU映射表
For full GPU types and DTYPE selection, see .
references/gpu-reference.md| Model Params | Min VRAM (FP16) | Recommended GPU | CPU | Memory | Shared Memory |
|---|---|---|---|---|---|
| < 1B | ~2 GB | T4 (16 GB) | 4 | 16 GB | 15 GB |
| 1B-3B | ~4-6 GB | T4 (16 GB) or A10_8GB | 4-8 | 32 GB | 30 GB |
| 3B-7B | ~6-14 GB | T4 (16 GB) or A10_24GB | 8-10 | 64 GB | 60 GB |
| 7B-13B | ~14-26 GB | A10_24GB or A100_40GB | 10-12 | 90 GB | 88 GB |
| 13B-30B | ~26-60 GB | A100_40GB or A100_80GB | 12-16 | 128 GB | 120 GB |
| 30B-70B | ~60-140 GB | A100_80GB or H100 (multi-GPU) | 16+ | 200 GB+ | 190 GB+ |
Present a resource suggestion table showing GPU, CPU, memory, shared memory, ephemeral storage, and max model length. Include the list of available GPUs from the cluster. If deployment-specs returned values, show those as "Recommended by TrueFoundry" alongside the table.
完整的GPU类型和DTYPE选择指南请参考。
references/gpu-reference.md| 模型参数量 | 最小VRAM(FP16) | 推荐GPU | CPU | 内存 | 共享内存 |
|---|---|---|---|---|---|
| < 1B | ~2 GB | T4 (16 GB) | 4 | 16 GB | 15 GB |
| 1B-3B | ~4-6 GB | T4 (16 GB) 或 A10_8GB | 4-8 | 32 GB | 30 GB |
| 3B-7B | ~6-14 GB | T4 (16 GB) 或 A10_24GB | 8-10 | 64 GB | 60 GB |
| 7B-13B | ~14-26 GB | A10_24GB 或 A100_40GB | 10-12 | 90 GB | 88 GB |
| 13B-30B | ~26-60 GB | A100_40GB 或 A100_80GB | 12-16 | 128 GB | 120 GB |
| 30B-70B | ~60-140 GB | A100_80GB 或 H100(多GPU) | 16+ | 200 GB+ | 190 GB+ |
展示资源建议表,包含GPU、CPU、内存、共享内存、临时存储、最大模型长度。同时列出集群的可用GPU。如果部署规格API返回了数值,请在表格旁标注为「TrueFoundry推荐配置」。
Important: Shared Memory
重要提示:共享内存
vLLM and TGI require large shared memory (). Without it, the model server will crash or perform poorly. Set to roughly 90-95% of .
/dev/shmshared_memory_sizememory_requestvLLM和TGI需要大量共享内存()。 如果共享内存不足,模型服务器会崩溃或性能不佳。请将设置为的90-95%左右。
/dev/shmshared_memory_sizememory_requestImportant: Memory vs VRAM
重要提示:内存与VRAM的区别
System memory (RAM) must be much larger than GPU VRAM because:
- Model weights load into CPU RAM first before transferring to GPU
- KV cache and request batching use CPU memory
- Rule of thumb: RAM should be 2-4x the model's VRAM footprint
系统内存(RAM)必须远大于GPU VRAM,原因是:
- 模型权重在传输到GPU前会先加载到CPU RAM
- KV缓存和请求批处理会占用CPU内存
- 经验法则:RAM应该是模型VRAM占用量的2-4倍
Step 3: Build the YAML Manifest
步骤3:构建YAML清单
For complete manifest templates (vLLM, TGI, NVIDIA NIM), template variables reference, DTYPE selection guide, artifacts download configuration, and common vLLM flags, see references/llm-manifest-templates.md.
Key framework defaults:
| Framework | Default Image | Health Path |
|---|---|---|
| vLLM | | |
| TGI | | |
| NVIDIA NIM | | |
Check for latest pinned versions. Always use with cache volumes for model caching instead of downloading at runtime.
references/container-versions.mdartifacts_downloadSecurity:runs arbitrary Python from the model repository. Only use this flag with models from trusted sources. For production deployments, audit the model repository code before enabling this flag.--trust-remote-code
The vLLM manifest MUST include:
- with
artifacts_downloadtype andhuggingface-hubfor model cachingcache_volume - :
labels,tfy_model_server,tfy_openapi_path,tfy_sticky_session_header_namehuggingface_model_task - ,
rollout_strategy,startup_probe,readiness_probeliveness_probe - Env vars: ,
DTYPE,GPU_COUNT,MAX_MODEL_LENGTH,VLLM_NO_USAGE_STATS,NVIDIA_REQUIRE_CUDA,GPU_MEMORY_UTILIZATION,MODEL_NAMEVLLM_CACHE_ROOT
Health probes are mandatory for all LLM deployments. The manifest templates include LLM-tuned probe values (startup threshold of 35 retries for ~350s tolerance). For general probe configuration, see . For large models (30B+), increase startup to 60+.
references/health-probes.mdfailure_threshold完整的清单模板(vLLM、TGI、NVIDIA NIM)、模板变量参考、DTYPE选择指南、制品下载配置、常用vLLM参数请参考references/llm-manifest-templates.md。
各框架默认配置:
| 框架 | 默认镜像 | 健康检查路径 |
|---|---|---|
| vLLM | | |
| TGI | | |
| NVIDIA NIM | | |
查看获取最新的固定版本。请始终使用带缓存卷的做模型缓存,不要在运行时下载模型。
references/container-versions.mdartifacts_download安全提示:会执行模型仓库中的任意Python代码。仅对可信来源的模型使用该参数。生产部署前请先审核模型仓库的代码再启用该参数。--trust-remote-code
vLLM清单必须包含:
- 类型为的
huggingface-hub配置,以及用于模型缓存的artifacts_downloadcache_volume - 标签:、
tfy_model_server、tfy_openapi_path、tfy_sticky_session_header_namehuggingface_model_task - 、
rollout_strategy、startup_probe、readiness_probeliveness_probe - 环境变量:、
DTYPE、GPU_COUNT、MAX_MODEL_LENGTH、VLLM_NO_USAGE_STATS、NVIDIA_REQUIRE_CUDA、GPU_MEMORY_UTILIZATION、MODEL_NAMEVLLM_CACHE_ROOT
健康探针对所有LLM部署都是必填项。清单模板中包含了适配LLM的探针值(启动阈值35次重试,约350秒容错时间)。对于大模型(30B+),请将启动探针的提升到60以上。
failure_thresholdStep 3a: Write Manifest
步骤3a:编写清单
Write the YAML manifest to . Reference for complete templates and for field definitions.
tfy-manifest.yamlreferences/llm-manifest-templates.mdreferences/manifest-schema.md将YAML清单写入。完整模板请参考,字段定义请参考。
tfy-manifest.yamlreferences/llm-manifest-templates.mdreferences/manifest-schema.mdStep 4: Preview and Apply
步骤4:预览并执行部署
bash
undefinedbash
undefinedPreview
预览
tfy apply -f tfy-manifest.yaml --dry-run --show-diff
tfy apply -f tfy-manifest.yaml --dry-run --show-diff
Apply after user confirms
用户确认后执行部署
tfy apply -f tfy-manifest.yaml
undefinedtfy apply -f tfy-manifest.yaml
undefinedFallback: REST API
兜底方案:REST API
If CLI is not available, convert the YAML manifest to JSON and deploy via REST API. See for the conversion process.
tfyreferences/cli-fallback.mdbash
TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.sh如果无法使用CLI,将YAML清单转换为JSON,通过REST API部署。转换流程请参考。
tfyreferences/cli-fallback.mdbash
TFY_API_SH=~/.claude/skills/truefoundry-llm-deploy/scripts/tfy-api.shGet workspace ID
获取工作区ID
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"
$TFY_API_SH GET "/api/svc/v1/workspaces?fqn=${TFY_WORKSPACE_FQN}"
Deploy (JSON body)
部署(JSON请求体)
$TFY_API_SH PUT /api/svc/v1/apps '{
"manifest": { ... JSON version of the YAML manifest ... },
"workspaceId": "WORKSPACE_ID_HERE"
}'
undefined$TFY_API_SH PUT /api/svc/v1/apps '{
"manifest": { ... YAML清单对应的JSON内容 ... },
"workspaceId": "此处填写工作区ID"
}'
undefinedVia Tool Call
通过工具调用
tfy_applications_create_deployment(
manifest={ ... manifest dict ... },
options={"workspace_id": "ws-internal-id", "force_deploy": false}
)tfy_applications_create_deployment(
manifest={ ... 清单字典 ... },
options={"workspace_id": "ws-internal-id", "force_deploy": false}
)Step 5: Verify Deployment & Return URL
步骤5:验证部署并返回URL
CRITICAL: Always fetch and return the deployment URL and status to the user. A deployment without a reported URL is incomplete.
Do this automatically after deploy, without asking an extra verification prompt.
关键要求:必须自动获取并向用户返回部署URL和状态。没有返回URL的部署是不完整的。
请在部署完成后自动执行该操作,不要额外发起验证询问。
Poll Deployment Status
轮询部署状态
After submitting the manifest, poll for status. Prefer MCP tool calls first:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "MODEL_NAME"})If MCP tool calls are unavailable, fall back to API:
bash
$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=MODEL_NAME'LLM deployments take longer than regular services:
- GPU node provisioning: 5-15 min (if scaling up)
- Model download: 2-10 min (depends on model size and cache)
- Model loading into GPU: 1-5 min
- Total: typically 10-30 min for first deployment
提交清单后,轮询状态。优先使用MCP工具调用:
tfy_applications_list(filters={"workspace_fqn": "工作区FQN", "application_name": "模型名称"})如果无法使用MCP工具调用,兜底使用API:
bash
$TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=工作区FQN&applicationName=模型名称'LLM部署耗时比常规服务更长:
- GPU节点供应:5-15分钟(如果需要扩容)
- 模型下载:2-10分钟(取决于模型大小和缓存情况)
- 模型加载到GPU:1-5分钟
- 总耗时:首次部署通常需要10-30分钟
Report to User
向用户反馈
Always present this summary after deployment:
LLM Deployment submitted!
Model: {hf-model-id}
Service: {service-name}
Framework: vLLM / TGI / NIM
Workspace: {workspace-fqn}
GPU: {gpu-count}x {gpu-type}
Status: {BUILDING|DEPLOYING|RUNNING}
Endpoints:
Public URL: https://{host} (available once RUNNING)
Internal DNS: {service-name}.{namespace}.svc.cluster.local:8000
OpenAI-compatible API (once RUNNING):
curl https://{host}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'
Health check:
curl https://{host}/health
Note: LLM deployments typically take 10-30 minutes for first deploy
(GPU provisioning + model download + loading). Check status with
the applications skill.部署完成后必须展示以下汇总信息:
LLM部署已提交!
模型:{hf-model-id}
服务:{service-name}
框架:vLLM / TGI / NIM
工作区:{workspace-fqn}
GPU:{gpu-count}x {gpu-type}
状态:{BUILDING|DEPLOYING|RUNNING}
端点:
公网URL: https://{host}(服务运行后可用)
内部DNS:{service-name}.{namespace}.svc.cluster.local:8000
兼容OpenAI的API(服务运行后可用):
curl https://{host}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "{model-name}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'
健康检查:
curl https://{host}/health
注意:LLM首次部署通常需要10-30分钟
(GPU供应 + 模型下载 + 加载)。您可以使用applications技能查看部署状态。Test Once Running
服务运行后测试
When the service reaches RUNNING status:
bash
undefined当服务状态变为RUNNING时:
bash
undefinedHealth check
健康检查
curl https://{HOST}/health
curl https://{HOST}/health
OpenAI-compatible completion (vLLM/TGI)
兼容OpenAI的补全接口测试(vLLM/TGI)
curl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
undefinedcurl https://{HOST}/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "{MODEL_NAME}", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }'
undefinedPublic URL
公网URL
Same as the skill -- look up cluster base domains and construct the host.
deploy- Fetch cluster base domains:
$TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID - Pick wildcard domain, strip to get base domain
*. - Construct host:
{model-name}-{workspace-name}.{base_domain} - Alternative: path-based routing -- Use the cluster's base domain directly as and set a unique
hostprefix.path
和技能逻辑一致 -- 查询集群基础域名并构建主机地址。
deploy- 获取集群基础域名:
$TFY_API_SH GET /api/svc/v1/clusters/CLUSTER_ID - 选择通配符域名,去掉得到基础域名
*. - 构建主机地址:
{model-name}-{workspace-name}.{base_domain} - 替代方案:路径路由 -- 直接使用集群基础域名作为,设置唯一的
host前缀。path
Deployment Flow Summary
部署流程汇总
- Check credentials + workspace (Step 0a, prerequisites)
- Discover cluster capabilities -- GPUs, base domains (Step 0)
- Get model info -- HuggingFace model ID from user (Step 1)
- Call deployment-specs API to get recommended resources (Step 2)
- Generate YAML manifest referencing (Step 3)
references/llm-manifest-templates.md - Write to (Step 3a)
tfy-manifest.yaml - Preview: (Step 4)
tfy apply -f tfy-manifest.yaml --dry-run --show-diff - Apply: (Step 4)
tfy apply -f tfy-manifest.yaml - Verify deployment and return URL (Step 5)
- 检查凭证和工作区(步骤0a、前置条件)
- 获取集群能力 -- GPU、基础域名(步骤0)
- 获取模型信息 -- 向用户询问HuggingFace模型ID(步骤1)
- 调用部署规格API获取推荐资源配置(步骤2)
- 参考生成YAML清单(步骤3)
references/llm-manifest-templates.md - 写入到(步骤3a)
tfy-manifest.yaml - 预览:(步骤4)
tfy apply -f tfy-manifest.yaml --dry-run --show-diff - 执行部署:(步骤4)
tfy apply -f tfy-manifest.yaml - 验证部署并返回URL(步骤5)
User Confirmation Checklist
用户确认清单
Before deploying, confirm these with the user:
- Model -- HuggingFace model ID and revision
- Framework -- vLLM, TGI, or NVIDIA NIM
- GPU type & count -- from deployment-specs API or cluster GPUs (Step 2)
- Resources -- CPU, memory, shared memory (deployment-specs recommendation + cluster availability)
- DTYPE -- float16 or bfloat16 (based on GPU)
- Max model length -- context window size
- Access -- public URL or internal-only
- Authentication -- HF token for gated models (from TrueFoundry secrets)
- Environment -- dev (1 replica) or production (2+ replicas)
- Service name -- what to call the deployment
- Auto-shutdown -- Should the deployment auto-stop after inactivity? (useful for dev/staging to save GPU costs)
<success_criteria>
部署前,请和用户确认以下内容:
- 模型 -- HuggingFace模型ID和版本
- 框架 -- vLLM、TGI或NVIDIA NIM
- GPU类型和数量 -- 来自部署规格API或集群GPU列表(步骤2)
- 资源配置 -- CPU、内存、共享内存(部署规格API推荐 + 集群可用资源)
- DTYPE -- float16或bfloat16(根据GPU类型选择)
- 最大模型长度 -- 上下文窗口大小
- 访问方式 -- 公网URL还是仅内部访问
- 身份验证 -- 受限模型需要的HF token(来自TrueFoundry密钥)
- 环境 -- 开发环境(1副本)或生产环境(2+副本)
- 服务名称 -- 部署的命名
- 自动关停 -- 部署是否需要在空闲后自动停止?(适合开发/ staging环境,节省GPU成本)
<success_criteria>
Success Criteria
成功标准
- The LLM deployment has been submitted and the user can see its status in TrueFoundry
- The agent has reported the deployment URL (public or internal DNS), model name, framework, GPU type, and workspace
- Deployment status is verified automatically immediately after apply/deploy (no extra prompt)
- The user has been provided an OpenAI-compatible API curl command to test the model once it is running
- The agent has confirmed GPU type, resource sizing, DTYPE, and model configuration with the user before deploying
- Health probes are configured with appropriate startup thresholds for the model size
</success_criteria>
<references>- LLM部署已提交,用户可以在TrueFoundry中查看部署状态
- Agent已向用户反馈部署URL(公网或内部DNS)、模型名称、框架、GPU类型、工作区
- 部署完成后自动验证状态(不需要额外提示)
- 已向用户提供服务运行后测试用的兼容OpenAI接口的curl命令
- 部署前Agent已和用户确认GPU类型、资源大小、DTYPE和模型配置
- 健康探针已根据模型大小配置了合适的启动阈值
</success_criteria>
<references>Composability
可组合性
- Find workspace first: Use skill to get workspace FQN
workspaces - Check cluster GPUs: Use skill for GPU type reference
workspaces - Manage secrets: Use skill to create/find HF token secret groups
secrets - Check deployment status: Use skill after deploying
applications - Test after deployment: Use skill to validate the endpoint
service-test - View logs: Use skill to debug startup issues
logs - Deploy database alongside: Use skill for vector DBs, caches, etc.
helm - Benchmark performance: Run load tests against the deployed endpoint to measure throughput/latency
- Fine-tune first: Fine-tune externally and deploy the resulting model artifact with this skill
- AI Gateway (optional): For unified API access, multi-model routing, and rate limiting, install
npx skills add truefoundry/tfy-gateway-skills
- 优先查找工作区:使用技能获取工作区FQN
workspaces - 检查集群GPU:使用技能获取GPU类型参考
workspaces - 管理密钥:使用技能创建/查找HF token密钥组
secrets - 检查部署状态:部署完成后使用技能
applications - 部署后测试:使用技能验证端点
service-test - 查看日志:使用技能调试启动问题
logs - 配套部署数据库:使用技能部署向量数据库、缓存等
helm - 性能 benchmark:对部署的端点执行压测,衡量吞吐量/延迟
- 先微调:在外部完成微调后,使用该技能部署生成的模型制品
- AI网关(可选):如需统一API访问、多模型路由、限流,安装
npx skills add truefoundry/tfy-gateway-skills
Error Handling
错误处理
For common LLM deployment errors (GPU not available, OOM, CUDA errors, model download failures, probe timeouts, invalid GPU types, host configuration issues) and their fixes, see references/llm-errors.md.
常见的LLM部署错误(GPU不可用、OOM、CUDA错误、模型下载失败、探针超时、无效GPU类型、主机配置问题)及修复方案请参考references/llm-errors.md。
CLI Errors
CLI错误
- -- Install with
tfy: command not foundpip install 'truefoundry==0.5.0' - validation errors -- Check YAML syntax, ensure required fields are present
tfy apply - Manifest validation failures -- Check for correct field names
references/llm-manifest-templates.md
- -- 执行
tfy: command not found安装pip install 'truefoundry==0.5.0' - 验证错误 -- 检查YAML语法,确保必填字段已填写
tfy apply - 清单验证失败 -- 参考确认字段名称正确
references/llm-manifest-templates.md