building-with-llms

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Building with LLMs

基于LLM的应用开发

Scope

适用范围

Covers

Building and shipping LLM-powered features/apps (assistant, copilot, light agent workflows)
Prompt + tool contract design (instructions, schemas, examples, guardrails)
Data quality + evaluation (test sets, rubrics, red teaming, iteration loop)
Production readiness (latency/cost budgets, logging, fallbacks, safety/security checks)
Using coding agents (Codex/Claude Code) to accelerate engineering safely

When to use

“Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.”
“We need a system prompt + tool definitions + output schema for our LLM workflow.”
“Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.”
“Design a RAG/tool-using agent approach with safety and monitoring.”
“We want to use an AI coding agent to implement this—set constraints and review gates.”

When NOT to use

You need product/portfolio strategy and positioning (use
```
ai-product-strategy
```
).
You need a full PRD/spec set for cross-functional alignment (use
```
writing-prds
```
/
```
writing-specs-designs
```
).

You need primary user research (use

conducting-user-interviews

usability-testing

You are doing model training/research, infra architecture, or bespoke model tuning (delegate to ML/eng; this skill assumes API models).
You only want “which model/provider should we pick?” (treat as an input; if it dominates, do a separate evaluation doc).

涵盖内容

构建并交付基于LLM的功能/应用（助手、Copilot、轻量Agent工作流）
提示词+工具契约设计（指令、Schema、示例、防护规则）
数据质量与评估（测试集、评分标准、红队测试、迭代循环）
生产就绪性（延迟/成本预算、日志记录、降级方案、安全/合规检查）
安全使用代码Agent（Codex/Claude Code）加速工程开发

适用场景

“将这个LLM功能想法转化为包含提示词、评估方案和上线检查项的开发计划。”
“我们需要为LLM工作流设计系统提示词+工具定义+输出Schema。”
“我们的LLM应用不稳定——设计一套评估方案和迭代循环来稳定质量。”
“设计一套带有安全机制和监控的RAG/工具调用型Agent方案。”
“我们希望使用AI代码Agent来实现这个功能——设定约束条件和审核关卡。”

不适用场景

你需要产品/组合战略与定位（请使用
```
ai-product-strategy
```
）。
你需要用于跨职能对齐的完整PRD/规格文档（请使用
```
writing-prds
```
/
```
writing-specs-designs
```
）。
你需要开展初级用户研究（请使用
```
conducting-user-interviews
```
/
```
usability-testing
```
）。
你正在进行模型训练/研究、基础设施架构设计或定制模型调优（请委托给ML/工程团队；本技能基于API调用型模型）。
你仅想了解“我们应该选择哪个模型/服务商？”（将其视为输入；若该问题占主导，请单独撰写评估文档）。

Inputs

输入要求

Minimum required

Use case + target user + what “good” looks like (success metrics + failure modes)
The LLM’s job: generate text, transform data, classify, extract, plan, or take actions via tools
Constraints: privacy/compliance, data sensitivity, latency, cost, reliability, supported regions
Integration surface: UI/workflow, downstream systems/APIs/tools, and any required output schema

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 options (prompting vs RAG vs tool use; autonomy level).
If asked to write code or run commands, request confirmation and use least privilege (no secrets; avoid destructive changes).

最低必填信息

用例+目标用户+“合格”的定义（成功指标+失败场景）
LLM的具体任务：生成文本、转换数据、分类、提取信息、规划，或通过工具执行操作
约束条件：隐私/合规要求、数据敏感度、延迟、成本、可靠性、支持区域
集成场景：UI/工作流、下游系统/API/工具，以及任何要求的输出Schema

缺失信息处理策略

从references/INTAKE.md中提出最多5个问题（每次3–5个）。
若仍有信息缺失，基于明确假设推进，并提供2–3种可选方案（提示词工程 vs RAG vs 工具调用；自主权限等级）。
若被要求编写代码或执行命令，需先请求确认，并遵循最小权限原则（不处理密钥；避免破坏性操作）。

Outputs (deliverables)

输出成果（交付物）

Produce an LLM Build Pack (in chat; or as files if requested), in this order:

Feature brief (goal, users, non-goals, constraints, success + guardrails)
System design sketch (pattern + architecture, context strategy, budgets, failure handling)
Prompt + tool contract (system prompt, tool schemas, output schema, examples, refusal/guardrails)
Data + evaluation plan (test set, rubrics, automated checks, red-team suite, acceptance thresholds)
Build + iteration plan (prototype slice, instrumentation, debugging loop, how to use coding agents safely)
Launch + monitoring plan (logging, dashboards/alerts, fallback/rollback, incident playbook hooks)
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

生成一份LLM Build Pack（聊天形式；或按要求生成文件），顺序如下：

功能简介（目标、用户群体、非目标、约束条件、成功指标+防护规则）
系统设计草图（模式+架构、上下文策略、预算、故障处理）
提示词+工具契约（系统提示词、工具Schema、输出Schema、示例、拒绝规则/防护机制）
数据+评估方案（测试集、评分标准、自动化检查、红队测试套件、验收阈值）
开发+迭代计划（原型切片、监控埋点、调试循环、如何安全使用代码Agent）
上线+监控方案（日志记录、仪表盘/告警、降级/回滚方案、事件响应手册钩子）
风险/待解决问题/下一步行动（必须包含）

模板：references/TEMPLATES.md

Workflow (8 steps)

工作流程（8个步骤）

1) Frame the job, boundary, and “good”

1) 明确任务、边界与“合格”标准

Inputs: Use case, target user, constraints.
Actions: Write a crisp job statement (“The LLM must…”) + 3–5 non-goals. Define success metrics and guardrails (quality, safety, cost, latency).
Outputs: Draft Feature brief.
Checks: A stakeholder can restate what the LLM does and does not do, and how success is measured.

输入： 用例、目标用户、约束条件。
行动： 撰写清晰的任务说明（“LLM必须……”）+3–5条非目标。定义成功指标和防护规则（质量、安全、成本、延迟）。
输出： 草稿版功能简介。
校验： 利益相关者能够重述LLM的能做与不能做，以及如何衡量成功。

2) Choose the minimum viable autonomy pattern

2) 选择最小可行自主权限模式

Inputs: Workflow + risk tolerance.
Actions: Decide assistant vs copilot vs agent-like tool use. Identify “human control points” (review/approve moments) and what the model is never allowed to do.
Outputs: Autonomy decisions captured in Feature brief.
Checks: Any action-taking behavior has explicit permissions, confirmations, and an undo/rollback story.

输入： 工作流+风险容忍度。
行动： 决定采用助手、Copilot还是类Agent的工具调用模式。确定“人工控制点”（审核/批准环节）以及模型绝对禁止执行的操作。
输出： 自主权限决策记录在功能简介中。
校验： 任何执行操作的行为都有明确的权限、确认机制，以及撤销/回滚方案。

3) Design the context strategy (prompting → RAG → tools)

3) 设计上下文策略（提示词工程 → RAG → 工具调用）

Inputs: Data sources, integration points, constraints.
Actions: Decide how the model gets reliable context: instruction hierarchy, retrieval strategy, tool calls, structured inputs. Define the “source of truth” and how conflicts are handled.
Outputs: Draft System design sketch.
Checks: You can explain (a) what data is used, (b) where it comes from, (c) how freshness/authority is enforced.

输入： 数据源、集成点、约束条件。
行动： 确定模型获取可靠上下文的方式：指令层级、检索策略、工具调用、结构化输入。定义“事实来源”以及冲突处理规则。
输出： 草稿版系统设计草图。
校验： 能够解释（a）使用哪些数据，（b）数据来源，（c）如何保证数据的新鲜度/权威性。

4) Draft the prompt + tool contract (make the system legible)

4) 起草提示词+工具契约（提升系统可解释性）

Inputs: Job statement + context strategy + output schema needs.
Actions: Write the system prompt, tool descriptions, and output schema. Add examples and explicit DO/DO NOT rules. Include safe failure behavior (ask clarifying questions, abstain, cite sources).
Outputs: Prompt + tool contract.
Checks: A reviewer can predict behavior for 5–10 representative inputs; contract includes at least 3 hard constraints and examples.

输入： 任务说明+上下文策略+输出Schema需求。
行动： 编写系统提示词、工具描述和输出Schema。添加示例和明确的允许/禁止规则。包含安全故障处理逻辑（询问澄清问题、拒绝执行、引用来源）。
输出： 提示词+工具契约。
校验： 审核人员能够预测5–10个典型输入对应的模型行为；契约包含至少3条硬性约束和示例。

5) Build the eval set + rubric (debug like software)

5) 构建测试集+评分标准（像调试软件一样调试LLM）

Inputs: Expected behaviors + failure modes + edge cases.
Actions: Create a test set covering normal cases, tricky cases, and red-team cases. Define a scoring rubric and acceptance thresholds. Add automated checks where possible (schema validity, citation presence, forbidden content).
Outputs: Data + evaluation plan.
Checks: You can run the same prompts repeatedly and measure improvement/regression; evals cover the top failure modes.

输入： 预期行为+失败场景+边缘案例。
行动： 创建覆盖常规场景、复杂场景和红队测试场景的测试集。定义评分标准和验收阈值。尽可能添加自动化检查（Schema有效性、引用存在性、禁止内容检测）。
输出： 数据+评估方案。
校验： 能够重复运行相同提示词并衡量改进/退化情况；评估覆盖主要失败场景。

6) Prototype a thin slice, using coding agents safely

6) 构建最小可行原型，安全使用代码Agent

Inputs: System sketch + prompt contract + eval plan.
Actions: Implement the smallest end-to-end slice. Use coding agents for “lower hanging fruit” tasks, but keep tight constraints: small diffs, tests, code review, no secret handling.
Outputs: Build + iteration plan (and optionally a prototype plan/checklist).
Checks: You can explain what the agent changed, why, and how it was validated (tests, evals, manual review).

输入： 系统设计草图+提示词契约+评估方案。
行动： 实现最小端到端切片。使用代码Agent完成“低难度”任务，但需严格约束：小范围代码变更、测试、代码审核、不处理密钥。
输出： 开发+迭代计划（可选包含原型计划/检查清单）。
校验： 能够解释Agent做了哪些变更、原因，以及如何验证（测试、评估、人工审核）。

7) Production readiness: budgets, monitoring, and failure handling

7) 生产就绪：预算、监控与故障处理

Inputs: Prototype learnings + constraints.
Actions: Define cost/latency budgets, fallbacks, rate limits, logging fields, and alert thresholds. Address prompt injection/tool misuse risks; add safeguards and review processes.
Outputs: Launch + monitoring plan.
Checks: There is a clear path to detect regressions, cap cost, and safely degrade when the model misbehaves.

输入： 原型学习成果+约束条件。
行动： 定义成本/延迟预算、降级方案、速率限制、日志字段和告警阈值。解决提示词注入/工具滥用风险；添加防护措施和审核流程。
输出： 上线+监控方案。
校验： 存在清晰的路径来检测退化情况、控制成本，并在模型行为异常时安全降级。

8) Quality gate + finalize

8) 质量校验+最终定稿

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Tighten unclear contracts, add missing tests, and always include Risks / Open questions / Next steps.
Outputs: Final LLM Build Pack.
Checks: A team can execute the plan without a meeting; unknowns are explicit and owned.

输入： 完整的Build Pack草稿。
行动： 运行references/CHECKLISTS.md并使用references/RUBRIC.md评分。优化模糊的契约，补充缺失的测试，并且必须包含风险/待解决问题/下一步行动。
输出： 最终版LLM Build Pack。
校验： 团队无需开会即可执行该计划；未知事项明确且有负责人。

Quality gate (required)

质量校验（必填）

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

使用references/CHECKLISTS.md和references/RUBRIC.md。
必须包含：风险、待解决问题、下一步行动。

Examples

示例

Example 1 (RAG copilot): “Use

building-with-llms

to plan a support-response copilot that drafts replies using our internal KB. Constraints: no PII leakage; must cite sources; p95 latency < 3s; cost < $0.10/ticket.”
Expected: LLM Build Pack with prompt/tool contract, eval set (including privacy red-team cases), and monitoring/rollback plan.

Example 2 (tool-using workflow): “Use

building-with-llms

to design an LLM workflow that turns meeting notes into action items and Jira tickets (human review required). Output must be valid JSON.”
Expected: output schema + tool contract + eval plan for structured extraction + guardrails against over-creation.

Boundary example: “Fine-tune/train a new LLM from scratch.”
Response: out of scope; propose an API-model approach and highlight what ML/infra work is required if training is truly needed.

示例1（RAG辅助助手）： “使用

building-with-llms

来规划一个支持回复助手，该助手利用我们的内部知识库起草回复。约束条件：不得泄露PII；必须引用来源；p95延迟<3秒；单工单成本<$0.10。”
预期产出：包含提示词/工具契约、测试集（含隐私红队测试案例）以及监控/回滚方案的LLM Build Pack。

示例2（工具调用工作流）： “使用

building-with-llms

设计一个LLM工作流，将会议纪要转化为行动项和Jira工单（需人工审核）。输出必须为合法JSON。”
预期产出：输出Schema + 工具契约 + 结构化提取评估方案 + 防止过度创建的防护规则。

边界示例： “从头开始微调/训练一个新的LLM。”
回复：超出本技能范围；建议采用API调用型模型方案，若确实需要训练，则说明所需的ML/基础设施工作内容。