prompt-lab

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prompt Lab

Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.

用结构化方法论替代反复试错的提示词工程：目标定义、现有提示词分析、变体生成（指令清晰度、示例策略、输出格式规范）、评估准则设计、测试用例创建以及失效模式识别。

Reference Files

参考文件

File	Contents	Load When
`references/prompt-patterns.md`	Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output	Always
`references/evaluation-metrics.md`	Quality metrics (accuracy, format compliance, completeness), rubric design	Evaluation needed
`references/failure-modes.md`	Common prompt failure taxonomy, detection strategies, mitigations	Failure analysis requested
`references/output-constraints.md`	Techniques for constraining LLM output format, JSON mode, schema enforcement	Format control needed

文件路径	内容说明	加载时机
`references/prompt-patterns.md`	提示词结构目录：零样本、少样本、CoT、角色设定、结构化输出	始终加载
`references/evaluation-metrics.md`	质量指标（准确性、格式合规性、完整性）、评估准则设计	需要进行评估时
`references/failure-modes.md`	常见提示词失效分类、检测策略、缓解方案	需进行失效分析时
`references/output-constraints.md`	约束LLM输出格式的技术、JSON模式、schema强制执行	需要控制输出格式时

Prerequisites

前提条件

Clear objective: what should the prompt accomplish?
Target model (GPT-4, Claude, open-source) — prompting techniques vary by model
Current prompt (if improving) or task description (if creating)

明确的目标：提示词需要实现什么？
目标模型（GPT-4、Claude、开源模型）——提示词技巧因模型而异
现有提示词（如果是改进场景）或任务描述（如果是创建场景）

Workflow

工作流程

Phase 1: Define Objective

阶段1：定义目标

Task specification — What should the LLM produce? Be specific: "Classify customer support tickets into 5 categories" not "Handle support tickets."
Success criteria — How do you know the output is correct? Define measurable criteria before writing any prompt.
Failure modes — What does a bad output look like? Missing information? Wrong format? Hallucinated content? Refusal to answer?

任务规范 —— LLM需要生成什么内容？请具体化：比如“将客户支持工单分类为5个类别”，而非“处理支持工单”。
成功标准 —— 如何判断输出是否正确？在编写任何提示词之前，先定义可衡量的标准。
失效模式 —— 糟糕的输出是什么样的？信息缺失？格式错误？幻觉内容？拒绝回答？

Phase 2: Analyze Current Prompt

阶段2：分析现有提示词

If an existing prompt is provided:

Structure assessment — Is the instruction clear? Are examples provided? Is the output format specified?
Ambiguity detection — Where could the model misinterpret the instruction?
Missing components — What's not specified that should be? (output format, tone, length constraints, edge case handling)
Failure mode mapping — Which known failure patterns (see
```
references/failure-modes.md
```
) apply to this prompt?

如果提供了现有提示词：

结构评估 —— 指令是否清晰？是否提供了示例？是否指定了输出格式？
歧义检测 —— 模型可能会误解指令的地方在哪里？
缺失组件 —— 有哪些应该指定但未指定的内容？（输出格式、语气、长度限制、边缘情况处理）
失效模式映射 —— 哪些已知的失效模式（参考
```
references/failure-modes.md
```
）适用于此提示词？

Phase 3: Generate Variants

阶段3：生成变体

Create 2-4 prompt variants, each testing a different hypothesis:

Variant Type	Hypothesis	When to Use
Direct instruction	Clear instruction is sufficient	Simple tasks, capable models
Few-shot	Examples improve output consistency	Pattern-following tasks
Chain-of-thought	Reasoning improves accuracy	Multi-step logic, math, analysis
Persona/role	Role framing improves tone/expertise	Domain-specific tasks
Structured output	Format specification prevents errors	JSON, CSV, specific templates

For each variant:

State the hypothesis (why this variant might work)
Identify the risk (what could go wrong)
Provide the complete prompt text

创建2-4个提示词变体，每个变体测试一个不同的假设：

变体类型	假设内容	使用场景
直接指令式	清晰的指令已足够满足需求	简单任务、能力较强的模型
少样本（Few-shot）	示例可提升输出一致性	需遵循模式的任务
思维链（Chain-of-thought）	推理可提升准确性	多步骤逻辑、数学、分析类任务
角色设定式	角色框架可提升语气/专业性	特定领域任务
结构化输出式	格式规范可防止错误	JSON、CSV、特定模板类任务

针对每个变体：

说明假设（为什么此变体可能有效）
识别风险（可能出现的问题）
提供完整的提示词文本

Phase 4: Design Evaluation

阶段4：设计评估方案

Rubric — Define weighted criteria:

Criterion	What It Measures	Typical Weight
Correctness	Output matches expected answer	30-50%
Format compliance	Follows specified structure	15-25%
Completeness	All required elements present	15-25%
Conciseness	No unnecessary content	5-15%
Tone/style	Matches requested voice	5-10%

Test cases — Minimum 5 cases covering:
- Happy path (standard input)
- Edge cases (unusual but valid input)
- Adversarial cases (inputs designed to confuse)
- Boundary cases (minimum/maximum input)

评估准则 —— 定义带权重的标准：

评估标准	衡量内容	典型权重
正确性	输出与预期答案匹配度	30-50%
格式合规性	是否遵循指定结构	15-25%
完整性	是否包含所有必要元素	15-25%
简洁性	无冗余内容	5-15%
语气/风格	是否符合要求的语气	5-10%

测试用例 —— 至少5个测试用例，覆盖：
- 常规路径（标准输入）
- 边缘情况（不常见但有效的输入）
- 对抗性用例（用于迷惑模型的输入）
- 边界情况（最小/最大输入）

Phase 5: Output

阶段5：输出

Present variants, rubric, and test cases in a structured format ready for execution.

以结构化格式呈现变体、评估准则和测试用例，便于执行。

Output Format

输出格式

text

undefined

text

undefined

Prompt Lab: {Task Name}

Prompt Lab: {任务名称}

Objective

目标

{What the prompt should achieve — specific and measurable}

{提示词需要实现的具体、可衡量的内容}

Success Criteria

成功标准

{Criterion 1 — measurable}
{Criterion 2 — measurable}

{可衡量的标准1}
{可衡量的标准2}

Current Prompt Analysis

现有提示词分析

{If existing prompt provided}

Strengths: {what works}
Weaknesses: {what fails or is ambiguous}
Missing: {what's not specified}

{如果提供了现有提示词}

优势： {有效的部分}
劣势： {失效或存在歧义的部分}
缺失： {未指定的必要内容}

Variants

变体

Variant A: {Strategy Name}

变体A：{策略名称}


{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}


{完整的提示词文本}

```text
**假设：** {此方法可能有效的原因}
**风险：** {可能出现的问题}

Variant B: {Strategy Name}

变体B：{策略名称}


{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}


{完整的提示词文本}

```text
**假设：** {此方法可能有效的原因}
**风险：** {可能出现的问题}

Variant C: {Strategy Name}

变体C：{策略名称}


{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}


{完整的提示词文本}

```text
**假设：** {此方法可能有效的原因}
**风险：** {可能出现的问题}

Evaluation Rubric

评估准则

Criterion	Weight	Scoring
{criterion}	{%}	{how to score: 0-3 scale or pass/fail}

评估标准	权重	评分方式
{标准名称}	{%}	{评分方式：0-3分制或通过/不通过}

Test Cases

测试用例

#	Input	Expected Output	Tests Criteria
1	{standard input}	{expected}	Correctness, Format
2	{edge case}	{expected}	Completeness
3	{adversarial}	{expected}	Robustness

#	输入	预期输出	测试标准
1	{标准输入}	{预期结果}	正确性、格式
2	{边缘情况}	{预期结果}	完整性
3	{对抗性用例}	{预期结果}	鲁棒性

Failure Modes to Monitor

需监控的失效模式

{Failure mode 1}: {detection method}
{Failure mode 2}: {detection method}

{失效模式1}：{检测方法}
{失效模式2}：{检测方法}

Recommended Next Steps

建议后续步骤

Run all variants against the test suite
Score using the rubric
Select the highest-scoring variant
Iterate on the winner with targeted improvements

undefined

在测试套件上运行所有变体
使用评估准则评分
选择得分最高的变体
针对获胜变体进行针对性改进

undefined

Calibration Rules

校准规则

One variable per variant. Each variant should change ONE thing from the baseline. Changing instruction style AND examples AND format simultaneously makes results uninterpretable.
Test before declaring success. A prompt that works on 3 examples may fail on the 4th. Minimum 5 diverse test cases before concluding a variant works.
Failure modes are more valuable than successes. Understanding WHY a prompt fails guides improvement more than confirming it works.
Model-specific optimization. A prompt optimized for GPT-4 may not work for Claude or Llama. Always note the target model.
Simplest effective prompt wins. If a zero-shot prompt scores as well as a few-shot prompt, use the zero-shot. Fewer tokens = lower cost + latency.

每个变体仅改变一个变量。每个变体应相对于基线只改变一个要素。同时改变指令风格、示例和格式会导致结果无法解释。
测试后再宣告成功。在3个示例上有效的提示词可能在第4个示例上失效。在得出变体有效的结论之前，至少需要5个多样化的测试用例。
失效模式比成功更有价值。理解提示词失效的原因比确认其有效更能指导改进。
针对模型优化。为GPT-4优化的提示词可能不适用于Claude或Llama。请始终注明目标模型。
最简单的有效提示词获胜。如果零样本提示词的得分与少样本提示词相同，请使用零样本提示词。更少的token = 更低的成本 + 更短的延迟。

Error Handling

错误处理

Problem	Resolution
No clear objective	Ask the user to define what "good output" looks like with 2-3 examples.
Prompt is for a task LLMs are bad at (math, counting)	Flag the limitation. Suggest tool-augmented approaches or pre/post-processing.
Too many variables to test	Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing.
No existing prompt to analyze	Start with the simplest possible prompt. The first variant IS the baseline.
Output format requirements are strict	Use structured output mode (JSON mode, function calling) instead of prompt-only constraints.

问题描述	解决方案
无明确目标	要求用户通过2-3个示例定义“优质输出”的样子。
提示词针对LLM不擅长的任务（数学、计数）	标记此限制。建议使用工具增强的方法或预处理/后处理步骤。
需要测试的变量过多	首先关注影响最大的变量。迭代优化优于组合测试。
无现有提示词可分析	从最简单的提示词开始。第一个变体就是基线。
输出格式要求严格	使用结构化输出模式（JSON模式、函数调用）而非仅依赖提示词约束。

When NOT to Use

禁用场景

Push back if:

The task doesn't need an LLM (deterministic rules, regex, SQL) — use the right tool
The user wants prompt execution, not design — this skill designs and evaluates, it doesn't run prompts
The prompt is for safety-critical decisions without human review — LLM output should not be the sole input

如果出现以下情况，请拒绝使用：

任务不需要LLM（确定性规则、正则表达式、SQL）——使用合适的工具
用户想要执行提示词，而非设计提示词——此技能用于设计和评估，不运行提示词
提示词用于无需人工审核的安全关键决策——LLM输出不应作为唯一输入