building-with-llms

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Building with LLMs

基于LLM的应用开发

Scope

适用范围

Covers
  • Building and shipping LLM-powered features/apps (assistant, copilot, light agent workflows)
  • Prompt + tool contract design (instructions, schemas, examples, guardrails)
  • Data quality + evaluation (test sets, rubrics, red teaming, iteration loop)
  • Production readiness (latency/cost budgets, logging, fallbacks, safety/security checks)
  • Using coding agents (Codex/Claude Code) to accelerate engineering safely
When to use
  • “Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.”
  • “We need a system prompt + tool definitions + output schema for our LLM workflow.”
  • “Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.”
  • “Design a RAG/tool-using agent approach with safety and monitoring.”
  • “We want to use an AI coding agent to implement this—set constraints and review gates.”
When NOT to use
  • You need product/portfolio strategy and positioning (use
    ai-product-strategy
    ).
  • You need a full PRD/spec set for cross-functional alignment (use
    writing-prds
    /
    writing-specs-designs
    ).
  • You need primary user research (use
    conducting-user-interviews
    /
    usability-testing
    ).
  • You are doing model training/research, infra architecture, or bespoke model tuning (delegate to ML/eng; this skill assumes API models).
  • You only want “which model/provider should we pick?” (treat as an input; if it dominates, do a separate evaluation doc).
涵盖内容
  • 构建并交付基于LLM的功能/应用(助手、Copilot、轻量Agent工作流)
  • 提示词+工具契约设计(指令、Schema、示例、防护规则)
  • 数据质量与评估(测试集、评分标准、红队测试、迭代循环)
  • 生产就绪性(延迟/成本预算、日志记录、降级方案、安全/合规检查)
  • 安全使用代码Agent(Codex/Claude Code)加速工程开发
适用场景
  • “将这个LLM功能想法转化为包含提示词、评估方案和上线检查项的开发计划。”
  • “我们需要为LLM工作流设计系统提示词+工具定义+输出Schema。”
  • “我们的LLM应用不稳定——设计一套评估方案和迭代循环来稳定质量。”
  • “设计一套带有安全机制和监控的RAG/工具调用型Agent方案。”
  • “我们希望使用AI代码Agent来实现这个功能——设定约束条件和审核关卡。”
不适用场景
  • 你需要产品/组合战略与定位(请使用
    ai-product-strategy
    )。
  • 你需要用于跨职能对齐的完整PRD/规格文档(请使用
    writing-prds
    /
    writing-specs-designs
    )。
  • 你需要开展初级用户研究(请使用
    conducting-user-interviews
    /
    usability-testing
    )。
  • 你正在进行模型训练/研究、基础设施架构设计或定制模型调优(请委托给ML/工程团队;本技能基于API调用型模型)。
  • 你仅想了解“我们应该选择哪个模型/服务商?”(将其视为输入;若该问题占主导,请单独撰写评估文档)。

Inputs

输入要求

Minimum required
  • Use case + target user + what “good” looks like (success metrics + failure modes)
  • The LLM’s job: generate text, transform data, classify, extract, plan, or take actions via tools
  • Constraints: privacy/compliance, data sensitivity, latency, cost, reliability, supported regions
  • Integration surface: UI/workflow, downstream systems/APIs/tools, and any required output schema
Missing-info strategy
  • Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
  • If details remain missing, proceed with explicit assumptions and provide 2–3 options (prompting vs RAG vs tool use; autonomy level).
  • If asked to write code or run commands, request confirmation and use least privilege (no secrets; avoid destructive changes).
最低必填信息
  • 用例+目标用户+“合格”的定义(成功指标+失败场景)
  • LLM的具体任务:生成文本、转换数据、分类、提取信息、规划,或通过工具执行操作
  • 约束条件:隐私/合规要求、数据敏感度、延迟、成本、可靠性、支持区域
  • 集成场景:UI/工作流、下游系统/API/工具,以及任何要求的输出Schema
缺失信息处理策略
  • references/INTAKE.md中提出最多5个问题(每次3–5个)。
  • 若仍有信息缺失,基于明确假设推进,并提供2–3种可选方案(提示词工程 vs RAG vs 工具调用;自主权限等级)。
  • 若被要求编写代码或执行命令,需先请求确认,并遵循最小权限原则(不处理密钥;避免破坏性操作)。

Outputs (deliverables)

输出成果(交付物)

Produce an LLM Build Pack (in chat; or as files if requested), in this order:
  1. Feature brief (goal, users, non-goals, constraints, success + guardrails)
  2. System design sketch (pattern + architecture, context strategy, budgets, failure handling)
  3. Prompt + tool contract (system prompt, tool schemas, output schema, examples, refusal/guardrails)
  4. Data + evaluation plan (test set, rubrics, automated checks, red-team suite, acceptance thresholds)
  5. Build + iteration plan (prototype slice, instrumentation, debugging loop, how to use coding agents safely)
  6. Launch + monitoring plan (logging, dashboards/alerts, fallback/rollback, incident playbook hooks)
  7. Risks / Open questions / Next steps (always included)
Templates: references/TEMPLATES.md
生成一份LLM Build Pack(聊天形式;或按要求生成文件),顺序如下:
  1. 功能简介(目标、用户群体、非目标、约束条件、成功指标+防护规则)
  2. 系统设计草图(模式+架构、上下文策略、预算、故障处理)
  3. 提示词+工具契约(系统提示词、工具Schema、输出Schema、示例、拒绝规则/防护机制)
  4. 数据+评估方案(测试集、评分标准、自动化检查、红队测试套件、验收阈值)
  5. 开发+迭代计划(原型切片、监控埋点、调试循环、如何安全使用代码Agent)
  6. 上线+监控方案(日志记录、仪表盘/告警、降级/回滚方案、事件响应手册钩子)
  7. 风险/待解决问题/下一步行动(必须包含)
模板:references/TEMPLATES.md

Workflow (8 steps)

工作流程(8个步骤)

1) Frame the job, boundary, and “good”

1) 明确任务、边界与“合格”标准

  • Inputs: Use case, target user, constraints.
  • Actions: Write a crisp job statement (“The LLM must…”) + 3–5 non-goals. Define success metrics and guardrails (quality, safety, cost, latency).
  • Outputs: Draft Feature brief.
  • Checks: A stakeholder can restate what the LLM does and does not do, and how success is measured.
  • 输入: 用例、目标用户、约束条件。
  • 行动: 撰写清晰的任务说明(“LLM必须……”)+3–5条非目标。定义成功指标和防护规则(质量、安全、成本、延迟)。
  • 输出: 草稿版功能简介
  • 校验: 利益相关者能够重述LLM的能做与不能做,以及如何衡量成功。

2) Choose the minimum viable autonomy pattern

2) 选择最小可行自主权限模式

  • Inputs: Workflow + risk tolerance.
  • Actions: Decide assistant vs copilot vs agent-like tool use. Identify “human control points” (review/approve moments) and what the model is never allowed to do.
  • Outputs: Autonomy decisions captured in Feature brief.
  • Checks: Any action-taking behavior has explicit permissions, confirmations, and an undo/rollback story.
  • 输入: 工作流+风险容忍度。
  • 行动: 决定采用助手、Copilot还是类Agent的工具调用模式。确定“人工控制点”(审核/批准环节)以及模型绝对禁止执行的操作。
  • 输出: 自主权限决策记录在功能简介中。
  • 校验: 任何执行操作的行为都有明确的权限、确认机制,以及撤销/回滚方案。

3) Design the context strategy (prompting → RAG → tools)

3) 设计上下文策略(提示词工程 → RAG → 工具调用)

  • Inputs: Data sources, integration points, constraints.
  • Actions: Decide how the model gets reliable context: instruction hierarchy, retrieval strategy, tool calls, structured inputs. Define the “source of truth” and how conflicts are handled.
  • Outputs: Draft System design sketch.
  • Checks: You can explain (a) what data is used, (b) where it comes from, (c) how freshness/authority is enforced.
  • 输入: 数据源、集成点、约束条件。
  • 行动: 确定模型获取可靠上下文的方式:指令层级、检索策略、工具调用、结构化输入。定义“事实来源”以及冲突处理规则。
  • 输出: 草稿版系统设计草图
  • 校验: 能够解释(a)使用哪些数据,(b)数据来源,(c)如何保证数据的新鲜度/权威性。

4) Draft the prompt + tool contract (make the system legible)

4) 起草提示词+工具契约(提升系统可解释性)

  • Inputs: Job statement + context strategy + output schema needs.
  • Actions: Write the system prompt, tool descriptions, and output schema. Add examples and explicit DO/DO NOT rules. Include safe failure behavior (ask clarifying questions, abstain, cite sources).
  • Outputs: Prompt + tool contract.
  • Checks: A reviewer can predict behavior for 5–10 representative inputs; contract includes at least 3 hard constraints and examples.
  • 输入: 任务说明+上下文策略+输出Schema需求。
  • 行动: 编写系统提示词、工具描述和输出Schema。添加示例和明确的允许/禁止规则。包含安全故障处理逻辑(询问澄清问题、拒绝执行、引用来源)。
  • 输出: 提示词+工具契约
  • 校验: 审核人员能够预测5–10个典型输入对应的模型行为;契约包含至少3条硬性约束和示例。

5) Build the eval set + rubric (debug like software)

5) 构建测试集+评分标准(像调试软件一样调试LLM)

  • Inputs: Expected behaviors + failure modes + edge cases.
  • Actions: Create a test set covering normal cases, tricky cases, and red-team cases. Define a scoring rubric and acceptance thresholds. Add automated checks where possible (schema validity, citation presence, forbidden content).
  • Outputs: Data + evaluation plan.
  • Checks: You can run the same prompts repeatedly and measure improvement/regression; evals cover the top failure modes.
  • 输入: 预期行为+失败场景+边缘案例。
  • 行动: 创建覆盖常规场景、复杂场景和红队测试场景的测试集。定义评分标准和验收阈值。尽可能添加自动化检查(Schema有效性、引用存在性、禁止内容检测)。
  • 输出: 数据+评估方案
  • 校验: 能够重复运行相同提示词并衡量改进/退化情况;评估覆盖主要失败场景。

6) Prototype a thin slice, using coding agents safely

6) 构建最小可行原型,安全使用代码Agent

  • Inputs: System sketch + prompt contract + eval plan.
  • Actions: Implement the smallest end-to-end slice. Use coding agents for “lower hanging fruit” tasks, but keep tight constraints: small diffs, tests, code review, no secret handling.
  • Outputs: Build + iteration plan (and optionally a prototype plan/checklist).
  • Checks: You can explain what the agent changed, why, and how it was validated (tests, evals, manual review).
  • 输入: 系统设计草图+提示词契约+评估方案。
  • 行动: 实现最小端到端切片。使用代码Agent完成“低难度”任务,但需严格约束:小范围代码变更、测试、代码审核、不处理密钥。
  • 输出: 开发+迭代计划(可选包含原型计划/检查清单)。
  • 校验: 能够解释Agent做了哪些变更、原因,以及如何验证(测试、评估、人工审核)。

7) Production readiness: budgets, monitoring, and failure handling

7) 生产就绪:预算、监控与故障处理

  • Inputs: Prototype learnings + constraints.
  • Actions: Define cost/latency budgets, fallbacks, rate limits, logging fields, and alert thresholds. Address prompt injection/tool misuse risks; add safeguards and review processes.
  • Outputs: Launch + monitoring plan.
  • Checks: There is a clear path to detect regressions, cap cost, and safely degrade when the model misbehaves.
  • 输入: 原型学习成果+约束条件。
  • 行动: 定义成本/延迟预算、降级方案、速率限制、日志字段和告警阈值。解决提示词注入/工具滥用风险;添加防护措施和审核流程。
  • 输出: 上线+监控方案
  • 校验: 存在清晰的路径来检测退化情况、控制成本,并在模型行为异常时安全降级。

8) Quality gate + finalize

8) 质量校验+最终定稿

  • Inputs: Full draft pack.
  • Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Tighten unclear contracts, add missing tests, and always include Risks / Open questions / Next steps.
  • Outputs: Final LLM Build Pack.
  • Checks: A team can execute the plan without a meeting; unknowns are explicit and owned.
  • 输入: 完整的Build Pack草稿。
  • 行动: 运行references/CHECKLISTS.md并使用references/RUBRIC.md评分。优化模糊的契约,补充缺失的测试,并且必须包含风险/待解决问题/下一步行动
  • 输出: 最终版LLM Build Pack
  • 校验: 团队无需开会即可执行该计划;未知事项明确且有负责人。

Quality gate (required)

质量校验(必填)

  • Use references/CHECKLISTS.md and references/RUBRIC.md.
  • Always include: Risks, Open questions, Next steps.
  • 使用references/CHECKLISTS.mdreferences/RUBRIC.md
  • 必须包含:风险待解决问题下一步行动

Examples

示例

Example 1 (RAG copilot): “Use
building-with-llms
to plan a support-response copilot that drafts replies using our internal KB. Constraints: no PII leakage; must cite sources; p95 latency < 3s; cost < $0.10/ticket.”
Expected: LLM Build Pack with prompt/tool contract, eval set (including privacy red-team cases), and monitoring/rollback plan.
Example 2 (tool-using workflow): “Use
building-with-llms
to design an LLM workflow that turns meeting notes into action items and Jira tickets (human review required). Output must be valid JSON.”
Expected: output schema + tool contract + eval plan for structured extraction + guardrails against over-creation.
Boundary example: “Fine-tune/train a new LLM from scratch.”
Response: out of scope; propose an API-model approach and highlight what ML/infra work is required if training is truly needed.
示例1(RAG辅助助手): “使用
building-with-llms
来规划一个支持回复助手,该助手利用我们的内部知识库起草回复。约束条件:不得泄露PII;必须引用来源;p95延迟<3秒;单工单成本<$0.10。”
预期产出:包含提示词/工具契约、测试集(含隐私红队测试案例)以及监控/回滚方案的LLM Build Pack。
示例2(工具调用工作流): “使用
building-with-llms
设计一个LLM工作流,将会议纪要转化为行动项和Jira工单(需人工审核)。输出必须为合法JSON。”
预期产出:输出Schema + 工具契约 + 结构化提取评估方案 + 防止过度创建的防护规则。
边界示例: “从头开始微调/训练一个新的LLM。”
回复:超出本技能范围;建议采用API调用型模型方案,若确实需要训练,则说明所需的ML/基础设施工作内容。