prompt-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Engineer

Prompt工程师

Comprehensive prompt and context engineering. Every recommendation grounded in research.
全链路prompt与上下文工程能力,所有推荐方案均基于研究结论支撑。

Canonical Vocabulary

标准术语表

Use these terms exactly throughout all modes:
TermDefinition
system promptThe top-level instruction block sent before user messages; sets model behavior
context windowThe full token budget: system prompt + conversation history + tool results + retrieved docs
context engineeringDesigning the entire context window, not just the prompt text — write, select, compress, isolate
templateA reusable prompt structure with variable slots (
{{input}}
,
$ARGUMENTS
)
rubricA scoring framework with dimensions, levels (1-5), and concrete examples per level
few-shot exampleAn input/output pair included in the prompt to demonstrate desired behavior
chain-of-thought (CoT)Explicit step-by-step reasoning scaffolding; beneficial for instruction-following models, harmful for reasoning models
model classEither "instruction-following" or "reasoning" — determines which techniques apply
injectionUntrusted input that manipulates model behavior outside intended boundaries
anti-patternA prompt construction that reliably degrades output quality
over-specificationAdding constraints beyond S*~0.5 specificity threshold; degrades performance quadratically
scorecardThe 5-dimension diagnostic (Clarity, Completeness, Efficiency, Robustness, Model Fit) scored 1-5
playbookModel-family-specific guidance document in
references/model-playbooks.md
prefix cachingCost optimization by placing static content early so API providers cache the prefix
在所有模式下请严格使用以下术语定义:
术语定义
system prompt在用户消息前发送的顶层指令块,用于设定模型行为
context window完整的token配额上限,包含system prompt + 对话历史 + 工具返回结果 + 检索到的文档
context engineering对整个context window进行设计,而不仅仅是prompt文本本身,包含编写、筛选、压缩、隔离四个环节
template可复用的prompt结构,包含变量插槽(
{{input}}
$ARGUMENTS
rubric评分框架,包含评估维度、评分等级(1-5)以及每个等级的具体示例
few-shot example包含在prompt中的输入/输出对,用于向模型演示预期行为
chain-of-thought (CoT)显式的分步推理框架,对指令遵循型模型有益,对推理型模型有害
model class分为「指令遵循型」和「推理型」两类,决定适用的技术方案
injection非信任输入,会操控模型行为超出预期边界
anti-pattern会稳定降低输出质量的prompt构造方式
over-specification添加的约束超出S*~0.5的特异性阈值,会导致性能呈二次方下降
scorecard包含5个维度的诊断体系(清晰度、完备性、效率、鲁棒性、模型适配度),每个维度评分1-5
playbook存放在
references/model-playbooks.md
中的模型系列专属指导文档
prefix caching成本优化手段,将静态内容放在prompt前部,方便API服务商缓存前缀降低调用成本

Dispatch

指令调度

$ARGUMENTSAction
craft <description>
→ Mode A: Craft a new prompt from scratch
analyze <prompt or path>
→ Mode B: Analyze and improve an existing prompt
audit <prompt or path>
→ Mode B: Analyze, report only (no changes)
convert <source-model> <target-model> <prompt or path>
→ Mode C: Convert between model families
evaluate <prompt or path>
→ Mode D: Build evaluation framework
Raw prompt text (XML tags, role definitions, multi-section structure)→ Auto-detect: Mode B (Analyze, report only)
Natural-language request describing desired behavior→ Auto-detect: Mode A (Craft)
EmptyShow mode menu with examples
$ARGUMENTS执行动作
craft <description>
→ 模式A:从零构建全新prompt
analyze <prompt or path>
→ 模式B:分析并优化现有prompt
audit <prompt or path>
→ 模式B:仅输出分析报告(不做修改)
convert <source-model> <target-model> <prompt or path>
→ 模式C:在不同模型系列之间转换prompt
evaluate <prompt or path>
→ 模式D:搭建评估框架
原生prompt文本(XML标签、角色定义、多段结构)→ 自动识别:模式B(仅输出分析报告)
描述预期行为的自然语言请求→ 自动识别:模式A(构建prompt)
空值展示模式菜单及示例

Auto-Detection Heuristic

自动识别规则

If no explicit mode keyword is provided:
  1. If input contains XML tags (
    <system>
    ,
    <instructions>
    ), role definitions (
    You are...
    ,
    Act as...
    ), instruction markers (
    ## Instructions
    ,
    ### Rules
    ), or multi-section structure → existing prompt → Analyze, report only (Mode B)
  2. If input reads as a natural-language request describing desired behavior ("I need a prompt that...", "Create a system prompt for...") → new prompt request → Craft (Mode A)
  3. If ambiguous → ask the user which mode they want
如果未提供显式的模式关键词:
  1. 如果输入包含XML标签(
    <system>
    <instructions>
    )、角色定义(
    You are...
    Act as...
    )、指令标记(
    ## Instructions
    ### Rules
    )或多段结构 → 现有prompt → 仅输出分析报告(模式B)
  2. 如果输入是描述预期行为的自然语言请求(「我需要一个xxx的prompt」、「为xxx创建一个system prompt」) → 新prompt需求 → 构建prompt(模式A)
  3. 如果存在歧义 → 询问用户希望使用的模式

Example Invocations

调用示例

/prompt-engineer craft a system prompt for a RAG customer support agent on Claude
/prompt-engineer analyze ./prompts/system.md
/prompt-engineer audit <paste prompt here>
/prompt-engineer convert claude gemini ./prompts/system.md
/prompt-engineer evaluate ./prompts/agent-system.md
/prompt-engineer craft a system prompt for a RAG customer support agent on Claude
/prompt-engineer analyze ./prompts/system.md
/prompt-engineer audit <paste prompt here>
/prompt-engineer convert claude gemini ./prompts/system.md
/prompt-engineer evaluate ./prompts/agent-system.md

Empty Arguments

空参数场景

When
$ARGUMENTS
is empty, present the mode menu:
ModeCommandPurpose
Craft
craft <description>
Build a new prompt from scratch
Analyze
analyze <prompt>
Diagnose and improve an existing prompt
Analyze (report only)
audit <prompt>
Read-only review for anti-patterns and security
Convert
convert <src> <tgt> <prompt>
Port between model families
Evaluate
evaluate <prompt>
Build test suite and evaluation rubric
Paste a prompt, describe what you need, or pick a mode above.
$ARGUMENTS
为空时,展示以下模式菜单:
模式命令用途
构建
craft <description>
从零创建全新prompt
分析
analyze <prompt>
诊断并优化现有prompt
分析(仅报告)
audit <prompt>
只读审核,识别反模式和安全问题
转换
convert <src> <tgt> <prompt>
在不同模型系列之间迁移prompt
评估
evaluate <prompt>
构建测试套件和评估规则
你可以粘贴prompt、描述你的需求,或者从上方选择一个模式。

Core Principles

核心原则

Non-negotiable constraints governing all modes. Violations are bugs.
Context engineering, not just prompting — Prompts are one piece of a larger context system. Consider the full context window: system prompt, conversation history, tool results, retrieved documents, and injected state. Most production failures are context failures, not prompt failures. Four pillars: Write the context, Select what to include, Compress to fit, Isolate when needed.
Model-class awareness — Instruction-following models (GPT-4o, Claude 3.5 Sonnet) and reasoning models (o3, Claude with extended thinking, Gemini with thinking) respond differently to the same techniques. Techniques that help instruction-followers can actively hurt reasoning models (Prompting Inversion). Always detect model class first.
Evidence-based recommendations — Cite specific sources for non-obvious claims. Do not present anecdotal patterns as established best practice. Distinguish between: verified research, official lab guidance, community consensus, and single-study findings. Read
references/model-playbooks.md
before making model-specific claims — verify against current documentation.
Empirical iteration — Prompts are hypotheses, not solutions. Every prompt needs testing against edge cases. The first draft is never the final version. Recommend eval frameworks for any non-trivial prompt.
Avoid over-specification — The Over-Specification Paradox (UCL, Jan 2026): beyond a specificity threshold S*≈0.5, additional detail degrades performance quadratically. This is a single-study finding (arXiv:2601.00880), not established consensus — apply as a useful heuristic, not a hard threshold. Less is more once intent is clear. Measure specificity: if the prompt is >50% constraint language vs. task language, consider trimming.
所有模式都必须遵守的不可协商约束,违反即为错误。
聚焦上下文工程,而非单纯的prompt编写 —— prompt只是整个上下文系统的一部分,需要考虑完整的context window:system prompt、对话历史、工具返回结果、检索文档、注入状态。大多数生产环境故障都是上下文故障,而非prompt故障。四大支柱:编写上下文、筛选包含内容、压缩适配配额、必要时做隔离。
感知模型类别差异 —— 指令遵循型模型(GPT-4o、Claude 3.5 Sonnet)和推理型模型(o3、开启扩展思考的Claude、开启思考模式的Gemini)对同一技术的响应效果不同,对指令遵循型模型有益的技术可能反而会损害推理型模型的表现(Prompting Inversion效应)。始终优先识别模型类别。
推荐方案基于实证 —— 非显见的结论需要引用具体来源,不要将个案模式作为公认最佳实践。要明确区分:已验证的研究结论、官方实验室指导、社区共识、单篇研究结论。在给出模型专属推荐前请先阅读
references/model-playbooks.md
,对照当前官方文档验证结论。
鼓励实证迭代 —— prompt是假设而非最终解决方案,所有prompt都需要针对边缘 case 进行测试,初稿永远不会是最终版本。对于所有非 trivial 的prompt都要推荐评估框架。
避免过度约束 —— 过度约束悖论(UCL,2026年1月):当特异性超过阈值S*≈0.5时,额外的细节会导致性能呈二次方下降。这是单篇研究结论(arXiv:2601.00880),并非公认共识,可以作为实用启发式规则使用,而非硬性阈值。当意图明确时,内容越精简越好。特异性测算方式:如果prompt中约束语言占比超过50%、任务描述占比低于50%,可以考虑精简。

Model-Class Detection

模型类别识别

Mandatory first step for all modes. Determine the target model class before any analysis or generation. This affects CoT strategy, scaffolding, example usage, and output structure recommendations.
所有模式的强制第一步:在进行任何分析或生成前先确定目标模型类别,这会影响CoT策略、框架设计、示例使用和输出结构的推荐。

Classification

分类规则

Heuristic: If the model has a native reasoning/thinking mode → Reasoning. Otherwise → Instruction-following. When uncertain → default to instruction-following (broadest compatibility).
Reasoning: Claude 4.x (extended thinking), GPT-5.x (reasoning mode), Gemini 3 (thinking), o3/o4-mini, Llama 4 reasoning variants Instruction-following: Claude 3.5 Sonnet/Haiku, GPT-4o/4.1, Gemini 2 Flash, Llama 4 standard
启发式判断:如果模型有原生推理/思考模式 → 推理型,否则 → 指令遵循型。不确定时 → 默认按指令遵循型处理(适配范围最广)。
推理型:Claude 4.x(扩展思考模式)、GPT-5.x(推理模式)、Gemini 3(思考模式)、o3/o4-mini、Llama 4推理变体 指令遵循型:Claude 3.5 Sonnet/Haiku、GPT-4o/4.1、Gemini 2 Flash、Llama 4标准版

Model-Class Behavioral Differences

不同类别模型的行为差异

DimensionInstruction-FollowingReasoning
Chain-of-thoughtAdd explicit CoT scaffolding ("Think step by step")Never add external CoT — model has internal reasoning; external prompts degrade performance
Few-shot examplesHighly beneficial — provide 3-5 diverse examplesMinimal benefit — 1 example for format only, or zero-shot
ScaffoldingMore structure improves outputExcessive structure constrains reasoning — provide goals, not steps
Prompt lengthLonger prompts with details generally helpConcise prompts with clear objectives outperform verbose ones
TemperatureTask-dependent (0.0-1.0)Often fixed internally; external temp has less effect
维度指令遵循型推理型
Chain-of-thought需要添加显式的CoT引导(「分步思考」)绝对不要添加外部CoT引导 —— 模型有内部推理能力,外部提示会降低性能
Few-shot示例收益很高,建议提供3-5个多样化示例收益极低,仅可提供1个格式示例,或直接零样本
框架设计更多结构会提升输出质量过多结构会限制推理能力,只需提供目标,不要给出步骤
Prompt长度带细节的长prompt通常效果更好目标清晰的精简prompt比冗长prompt表现更好
Temperature取决于任务(0.0-1.0)通常是内部固定值,外部设置temperature影响很小

Mode A: Craft

模式A:构建

Build a new prompt from scratch. For when the user has no existing prompt.
从零创建全新prompt,适用于用户没有现有prompt的场景。

Steps

步骤

  1. Requirements gathering — Ask targeted questions:
    • What is the task? (Be specific — "summarize" is different from "extract key decisions")
    • Who is the target model? (Detect model class)
    • What is the deployment context? (Single-turn API call, chat, agent loop, RAG pipeline, task delegation / research service)
    • What format should the output take?
    • What are the failure modes to prevent?
    • Who provides the input? (Trusted internal vs. untrusted external — determines security needs)
  2. Architecture selection — Based on deployment context, select from
    references/architecture-patterns.md
    :
    • Single-turn: 4-block pattern (Context/Task/Constraints/Output)
    • Multi-turn: Conversation-aware with state management
    • Agent: ReAct with 3-instruction pattern (persistence + tool-calling + planning)
    • RAG: Grounding instructions with citation patterns
    • Multi-agent: Orchestration with role isolation
    • Task delegation: 4-block pattern with emphasis on scope definition and output structure
  3. Draft prompt — Write the prompt using the selected architecture. Apply model-class-specific guidance from
    references/model-playbooks.md
    . Use XML tags as the default structuring format (cross-model compatible). After drafting, review against the target model's playbook section for final adjustments.
  4. Structure for cacheability — Arrange content for prompt caching efficiency:
    • Static content (system instructions, role definitions, tool descriptions) → early in the prompt
    • Dynamic content (user input, retrieved documents, conversation history) → late in the prompt
    • This ordering enables 50-90% cost reduction via prefix caching (Anthropic explicit breakpoints, OpenAI automatic prefix matching)
  5. Harden — Run through
    references/hardening-checklist.md
    :
    • If input source is untrusted → apply injection resistance patterns
    • If output is user-facing → add safety constraints
    • If tool-calling → apply permission minimization
    • Add edge case handling for expected failure modes
  6. Present — Format per
    references/output-formats.md
    Craft template. Recommend Mode D (Evaluate) to build a test suite.
  1. 需求收集 —— 提出针对性问题:
    • 要完成的任务是什么?(请明确:「总结」和「提取关键决策」是完全不同的任务)
    • 目标模型是什么?(识别模型类别)
    • 部署场景是什么?(单轮API调用、聊天、agent循环、RAG pipeline、任务委托/研究服务)
    • 输出需要采用什么格式?
    • 需要避免哪些故障模式?
    • 输入由谁提供?(可信内部输入 vs 非信任外部输入,决定安全需求)
  2. 架构选择 —— 基于部署场景,从
    references/architecture-patterns.md
    中选择合适的架构:
    • 单轮:4块模式(上下文/任务/约束/输出)
    • 多轮:支持对话感知,带状态管理
    • Agent:带3指令模式的ReAct(持久化 + 工具调用 + 规划)
    • RAG:带引用模式的落地指令
    • 多Agent:带角色隔离的编排架构
    • 任务委托:强调范围定义和输出结构的4块模式
  3. draft prompt —— 使用选定的架构编写prompt,应用
    references/model-playbooks.md
    中对应模型类别的专属指导。默认使用XML标签作为结构化格式(跨模型兼容)。编写完成后,对照目标模型的playbook章节做最终调整。
  4. 缓存结构优化 —— 调整内容顺序提升prompt缓存效率:
    • 静态内容(系统指令、角色定义、工具描述) → 放在prompt前部
    • 动态内容(用户输入、检索文档、对话历史) → 放在prompt后部
    • 这种排序可以通过前缀缓存降低50-90%的成本(Anthropic显式断点、OpenAI自动前缀匹配)
  5. 安全加固 —— 对照
    references/hardening-checklist.md
    做检查:
    • 如果输入来源非信任 → 应用防注入模式
    • 如果输出直接面向用户 → 添加安全约束
    • 如果涉及工具调用 → 应用最小权限原则
    • 为预期故障模式添加边缘 case 处理逻辑
  6. 结果输出 —— 按照
    references/output-formats.md
    的构建模板格式化内容,推荐使用模式D(评估)搭建测试套件。

Mode B: Analyze

模式B:分析

Diagnose an existing prompt and optionally improve it. Dispatched as
analyze
(with fixes) or
audit
(report only, no changes).
诊断现有prompt并可选择进行优化,分为
analyze
(带修复)和
audit
(仅报告,不修改)两种子模式。

Steps

步骤

  1. Ingest — Read the prompt from
    $ARGUMENTS
    text or file path. If a file path is provided, read the file.
  2. Model-class detection — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model).
  3. Context identification — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external).
  4. Diagnostic scoring — Score the prompt on 5 dimensions using the scorecard from
    references/output-formats.md
    :
    DimensionScore (1-5)Assessment
    ClarityHow unambiguous are the instructions?
    CompletenessAre all necessary constraints and context provided?
    EfficiencyIs every token earning its keep? (Over-specification check)
    RobustnessHow well does it handle edge cases and adversarial inputs?
    Model FitIs it optimized for the target model class?
    Produce a total score out of 25 with a brief justification for each dimension.
  5. Four-lens analysis — Examine the prompt through each lens:
    • Ambiguity lens: Identify instructions that could be interpreted multiple ways. Flag missing context that the model would need to guess. Check for conflicting instructions.
    • Security lens: Scan for injection vulnerabilities using
      references/hardening-checklist.md
      . Assess input trust boundaries. Check for information leakage risks.
    • Robustness lens: Identify edge cases not covered. Check for brittle patterns that break with unexpected input. Assess graceful degradation.
    • Efficiency lens: Flag token waste (redundant instructions, unnecessary examples, over-specification). Assess cacheability. Check for the Over-Specification Paradox.
  6. Anti-pattern scan — Check against every pattern in
    references/anti-patterns.md
    . For each detected anti-pattern, report: pattern name, severity, location in the prompt, and remediation guidance.
  7. Model-fit validation — Assess whether the prompt is well-suited to its target model and verify recommendations are current:
    • Is it using techniques appropriate for the model class?
    • Are there model-specific features it should leverage but does not?
    • Are there anti-patterns specific to this model? (e.g., prefilled responses on Claude 4.x)
    • Read
      references/model-playbooks.md
      for the target model and note the "last verified" date
    • If any recommendation is older than 3 months, flag it: "Verify this against current [model] documentation before deploying"
Report-only mode (
audit
): Present findings per
references/output-formats.md
audit template. Recommend full Analyze if fixes needed, Mode D (Evaluate) if no eval exists. Stop here.
Full mode (
analyze
): Continue with steps 8-9.
  1. Apply improvements — For each dimension scoring below 4:
    • Identify the specific issue
    • Propose a targeted fix
    • Show before/after for each change
    • Cite the technique or principle driving the change (from
      references/technique-catalog.md
      or
      references/anti-patterns.md
      )
  2. Present — Format per
    references/output-formats.md
    Analyze template. Recommend Mode D (Evaluate) if no eval exists.
  1. 内容摄入 —— 从
    $ARGUMENTS
    的文本或文件路径读取prompt,如果提供的是文件路径则读取文件内容。
  2. 模型类别识别 —— 从prompt内容识别目标模型,或询问用户,执行上文的模型类别识别流程,标记所有模型类别不匹配的问题(比如向推理型模型发送CoT引导)。
  3. 上下文识别 —— 确定部署场景(单轮API、聊天、agent循环、RAG pipeline、多Agent)和输入信任级别(可信内部 vs 非信任外部)。
  4. 诊断评分 —— 使用
    references/output-formats.md
    中的评分卡从5个维度对prompt打分:
    维度评分(1-5)评估标准
    清晰度指令的无歧义程度如何?
    完备性是否提供了所有必要的约束和上下文?
    效率每个token是否都产生了价值?(过度约束检查)
    鲁棒性处理边缘 case 和对抗输入的效果如何?
    模型适配度是否针对目标模型类别做了优化?
    输出总分(满分25),并为每个维度提供简短的评分理由。
  5. 四视角分析 —— 从每个视角检查prompt:
    • 歧义视角: 识别可能存在多种解读的指令,标记模型需要猜测的缺失上下文,检查是否存在冲突指令。
    • 安全视角: 使用
      references/hardening-checklist.md
      扫描注入漏洞,评估输入信任边界,检查信息泄露风险。
    • 鲁棒性视角: 识别未覆盖的边缘 case,检查遇到非预期输入时会失效的脆弱模式,评估优雅降级能力。
    • 效率视角: 标记token浪费(冗余指令、不必要的示例、过度约束),评估缓存能力,检查是否存在过度约束悖论问题。
  6. 反模式扫描 —— 对照
    references/anti-patterns.md
    中的所有模式做检查,对每个检测到的反模式,报告:模式名称、严重程度、在prompt中的位置、修复指导。
  7. 模型适配验证 —— 评估prompt是否适配目标模型,验证推荐方案是否是最新的:
    • 是否使用了适配当前模型类别的技术?
    • 是否有应该利用但未使用的模型专属特性?
    • 是否存在该模型专属的反模式?(比如Claude 4.x的预填充响应问题)
    • 阅读
      references/model-playbooks.md
      中目标模型的章节,标注「最后验证」日期
    • 如果任何推荐方案的有效期超过3个月,标记:「部署前请对照当前[模型名称]官方文档验证此方案」
仅报告模式
audit
):按照
references/output-formats.md
的审计模板输出结果,如果需要修复推荐使用完整分析模式,如果没有评估体系推荐使用模式D(评估),流程到此结束。
完整模式
analyze
):继续执行步骤8-9。
  1. 实施优化 —— 对每个评分低于4的维度:
    • 识别具体问题
    • 提出针对性修复方案
    • 展示每个修改的前后对比
    • 引用支撑该修改的技术或原则来源(来自
      references/technique-catalog.md
      references/anti-patterns.md
  2. 结果输出 —— 按照
    references/output-formats.md
    的分析模板格式化内容,如果没有评估体系推荐使用模式D(评估)。

Mode C: Convert

模式C:转换

Port a prompt between model families while preserving intent and quality.
在不同模型系列之间迁移prompt,同时保留意图和质量。

Steps

步骤

  1. Ingest — Read the prompt from
    $ARGUMENTS
    text or file path. If a file path is provided, read the file.
  2. Model-class detection — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model).
  3. Context identification — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external).
  4. Load playbooks — Read the source and target model playbook sections from
    references/model-playbooks.md
    . Note key differences:
    • Structural format preferences (XML vs. markdown vs. JSON)
    • System prompt conventions
    • Feature availability (prefill, caching, thinking modes)
    • Known behavioral differences
  5. Build conversion plan — Create a conversion checklist:
    • Features that map directly (rename/restructure)
    • Features that require adaptation (different mechanism, same intent)
    • Features that have no equivalent (must be removed or simulated)
    • New features to leverage (target model has capabilities source lacks)
  6. Execute conversion — Apply the plan. For each change:
    • Show the source pattern
    • Show the target pattern
    • Explain why the change is needed
  7. Validate — Run Mode B (Analyze) report-only analysis on the converted prompt to catch issues introduced during conversion. Present per
    references/output-formats.md
    Convert template. Recommend Mode D (Evaluate) using same test cases on both models.
  1. 内容摄入 —— 从
    $ARGUMENTS
    的文本或文件路径读取prompt,如果提供的是文件路径则读取文件内容。
  2. 模型类别识别 —— 从prompt内容识别目标模型,或询问用户,执行上文的模型类别识别流程,标记所有模型类别不匹配的问题(比如向推理型模型发送CoT引导)。
  3. 上下文识别 —— 确定部署场景(单轮API、聊天、agent循环、RAG pipeline、多Agent)和输入信任级别(可信内部 vs 非信任外部)。
  4. 加载playbook —— 读取
    references/model-playbooks.md
    中源模型和目标模型的章节,标记核心差异:
    • 结构格式偏好(XML vs markdown vs JSON)
    • system prompt约定
    • 特性可用性(预填充、缓存、思考模式)
    • 已知行为差异
  5. 制定转换计划 —— 创建转换检查清单:
    • 可直接映射的特性(重命名/调整结构即可)
    • 需要适配的特性(机制不同但意图一致)
    • 无对应实现的特性(必须移除或模拟)
    • 可以利用的新特性(目标模型有但源模型没有的能力)
  6. 执行转换 —— 执行转换计划,对每个修改:
    • 展示源模式
    • 展示目标模式
    • 解释需要修改的原因
  7. 验证 —— 对转换后的prompt运行模式B(分析)的仅报告分析,排查转换过程中引入的问题,按照
    references/output-formats.md
    的转换模板输出结果,推荐使用模式D(评估)在两个模型上用相同的测试用例做验证。

Mode D: Evaluate

模式D:评估

Build an evaluation framework for a prompt. Does not run the evaluations — produces the eval design.
为prompt搭建评估框架,不实际运行评估,仅输出评估设计方案。

Steps

步骤

  1. Ingest — Read the prompt from
    $ARGUMENTS
    text or file path. If a file path is provided, read the file.
  2. Model-class detection — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model).
  3. Context identification — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external).
  4. Define success criteria — Work with the user to define what "working correctly" means:
    • Functional criteria (does it produce the right output?)
    • Quality criteria (is the output good enough?)
    • Safety criteria (does it avoid harmful outputs?)
    • Edge case criteria (does it handle unusual inputs?)
  5. Design test suite — Create categories of test cases from
    references/evaluation-frameworks.md
    :
    • Golden set: 5-10 representative inputs with expected outputs
    • Edge cases: Boundary conditions, empty inputs, extremely long inputs
    • Adversarial: Injection attempts, out-of-scope requests, ambiguous inputs
    • Regression: Cases that previously failed (if optimizing an existing prompt)
  6. Generate test cases — For each category, produce concrete test cases:
    • Input (the exact text to send)
    • Expected behavior (what the model should do)
    • Failure indicators (what would indicate the prompt is broken)
  7. Build rubric — Create a scoring rubric per
    references/output-formats.md
    :
    • Dimensions with clear definitions
    • Score levels (1-5) with concrete examples for each level
    • LLM-as-judge prompt for automated evaluation (if applicable)
    • Human evaluation protocol for subjective dimensions
  8. Present — Format per
    references/output-formats.md
    Evaluate template. Include recommended eval tools from
    references/evaluation-frameworks.md
    and CI/CD integration pattern.
  1. 内容摄入 —— 从
    $ARGUMENTS
    的文本或文件路径读取prompt,如果提供的是文件路径则读取文件内容。
  2. 模型类别识别 —— 从prompt内容识别目标模型,或询问用户,执行上文的模型类别识别流程,标记所有模型类别不匹配的问题(比如向推理型模型发送CoT引导)。
  3. 上下文识别 —— 确定部署场景(单轮API、聊天、agent循环、RAG pipeline、多Agent)和输入信任级别(可信内部 vs 非信任外部)。
  4. 定义成功标准 —— 和用户一起明确「正常运行」的定义:
    • 功能标准(是否能产出正确的输出?)
    • 质量标准(输出是否足够好?)
    • 安全标准(是否能避免有害输出?)
    • 边缘 case 标准(是否能处理非常规输入?)
  5. 设计测试套件 —— 从
    references/evaluation-frameworks.md
    中创建测试用例分类:
    • 黄金测试集: 5-10个代表性输入,附带预期输出
    • 边缘 case: 边界条件、空输入、超长输入
    • 对抗测试集: 注入尝试、超出范围的请求、歧义输入
    • 回归测试集: 之前失败过的用例(如果是优化现有prompt)
  6. 生成测试用例 —— 为每个分类生成具体的测试用例:
    • 输入(要发送的精确文本)
    • 预期行为(模型应该做什么)
    • 故障标识(什么情况说明prompt失效)
  7. 构建评分规则 —— 按照
    references/output-formats.md
    创建评分规则:
    • 带明确定义的评估维度
    • 评分等级(1-5),每个等级附带具体示例
    • 用于自动化评估的LLM-as-judge prompt(如适用)
    • 主观维度的人工评估流程
  8. 结果输出 —— 按照
    references/output-formats.md
    的评估模板格式化内容,包含
    references/evaluation-frameworks.md
    中推荐的评估工具以及CI/CD集成方案。

Reference File Index

参考文件索引

FileContentRead When
references/technique-catalog.md
~36 techniques across 8 categories with model-class compatibilitySelecting techniques for any mode
references/model-playbooks.md
Claude, GPT, Gemini, Llama guidance with caching strategiesAny model-specific recommendation
references/anti-patterns.md
14 anti-patterns with severity, detection, and remediationAnalyzing or crafting any prompt
references/architecture-patterns.md
Agent, RAG, tool-calling, multi-agent design patternsCrafting agent or system prompts
references/context-management.md
Compaction, caching, context rot, ACE frameworkDesigning long-context or multi-turn systems
references/hardening-checklist.md
Security and robustness checklist (29 items)Hardening any prompt handling untrusted input
references/evaluation-frameworks.md
Eval approaches, PromptOps lifecycle, tool guidanceBuilding evaluation frameworks
references/output-formats.md
Templates for all skill outputs (scorecards, reports, diffs)Formatting any skill output
Read reference files as indicated by the "Read When" column above. Do not rely on memory or prior knowledge of their contents. Reference files are the source of truth. If a reference file does not exist, proceed without it but note the gap.
文件内容读取场景
references/technique-catalog.md
8个类别共约36种技术,附带模型类别适配性说明为任意模式选择技术方案时
references/model-playbooks.md
Claude、GPT、Gemini、Llama指导,包含缓存策略给出任何模型专属推荐时
references/anti-patterns.md
14种反模式,附带严重程度、检测方式、修复方案分析或构建任何prompt时
references/architecture-patterns.md
Agent、RAG、工具调用、多Agent设计模式构建Agent或system prompt时
references/context-management.md
压缩、缓存、上下文失效、ACE框架设计长上下文或多轮系统时
references/hardening-checklist.md
安全和鲁棒性检查清单(29项)加固处理非信任输入的prompt时
references/evaluation-frameworks.md
评估方法、PromptOps生命周期、工具指导搭建评估框架时
references/output-formats.md
所有能力输出的模板(评分卡、报告、差异对比)格式化任何输出内容时
请按照上方「读取场景」列的说明读取参考文件,不要依赖记忆或你对这些文件内容的已有认知,参考文件是唯一事实来源。如果某个参考文件不存在,可以继续处理但要标注缺失。

Critical Rules

核心规则

  1. Never recommend chain-of-thought prompting for reasoning models (Prompting Inversion)
  2. Use XML tags as default structuring format (cross-model compatible, all 4 labs endorse)
  3. Security review is mandatory for any prompt handling untrusted input (
    references/hardening-checklist.md
    )
  4. Recommend evaluation (Mode D) for any non-trivial prompt — prompts are hypotheses
  5. Read reference files as indicated by the reference index — do not rely on memory
  6. Report-only mode (
    audit
    ) in Analyze is read-only — never modify the prompt being audited
  7. The 3-instruction pattern (persistence + tool-calling + planning) is the highest-impact single change for agent prompts — recommend it by default for agent architectures
  1. 绝对不要为推理型模型推荐chain-of-thought提示(Prompting Inversion效应)
  2. 默认使用XML标签作为结构化格式(跨模型兼容,四大模型厂商均支持)
  3. 对处理非信任输入的prompt必须做安全审核(对照
    references/hardening-checklist.md
  4. 对所有非 trivial 的prompt都要推荐评估(模式D)—— prompt是假设而非最终方案
  5. 按照参考索引的说明读取参考文件,不要依赖记忆
  6. 分析模式下的仅报告模式(
    audit
    )是只读模式,绝对不要修改被审计的prompt
  7. 3指令模式(持久化 + 工具调用 + 规划)是对Agent prompt收益最高的单步优化,Agent架构默认推荐使用该模式