prompt-lab
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrompt Lab
Prompt Lab
Replaces trial-and-error prompt engineering with structured methodology: objective
definition, current prompt analysis, variant generation (instruction clarity, example
strategies, output format specification), evaluation rubric design, test case creation,
and failure mode identification.
用结构化方法论替代反复试错的提示词工程:目标定义、现有提示词分析、变体生成(指令清晰度、示例策略、输出格式规范)、评估准则设计、测试用例创建以及失效模式识别。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
| Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
| Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
| Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
| 文件路径 | 内容说明 | 加载时机 |
|---|---|---|
| 提示词结构目录:零样本、少样本、CoT、角色设定、结构化输出 | 始终加载 |
| 质量指标(准确性、格式合规性、完整性)、评估准则设计 | 需要进行评估时 |
| 常见提示词失效分类、检测策略、缓解方案 | 需进行失效分析时 |
| 约束LLM输出格式的技术、JSON模式、schema强制执行 | 需要控制输出格式时 |
Prerequisites
前提条件
- Clear objective: what should the prompt accomplish?
- Target model (GPT-4, Claude, open-source) — prompting techniques vary by model
- Current prompt (if improving) or task description (if creating)
- 明确的目标:提示词需要实现什么?
- 目标模型(GPT-4、Claude、开源模型)——提示词技巧因模型而异
- 现有提示词(如果是改进场景)或任务描述(如果是创建场景)
Workflow
工作流程
Phase 1: Define Objective
阶段1:定义目标
- Task specification — What should the LLM produce? Be specific: "Classify customer support tickets into 5 categories" not "Handle support tickets."
- Success criteria — How do you know the output is correct? Define measurable criteria before writing any prompt.
- Failure modes — What does a bad output look like? Missing information? Wrong format? Hallucinated content? Refusal to answer?
- 任务规范 —— LLM需要生成什么内容?请具体化:比如“将客户支持工单分类为5个类别”,而非“处理支持工单”。
- 成功标准 —— 如何判断输出是否正确?在编写任何提示词之前,先定义可衡量的标准。
- 失效模式 —— 糟糕的输出是什么样的?信息缺失?格式错误?幻觉内容?拒绝回答?
Phase 2: Analyze Current Prompt
阶段2:分析现有提示词
If an existing prompt is provided:
- Structure assessment — Is the instruction clear? Are examples provided? Is the output format specified?
- Ambiguity detection — Where could the model misinterpret the instruction?
- Missing components — What's not specified that should be? (output format, tone, length constraints, edge case handling)
- Failure mode mapping — Which known failure patterns (see ) apply to this prompt?
references/failure-modes.md
如果提供了现有提示词:
- 结构评估 —— 指令是否清晰?是否提供了示例?是否指定了输出格式?
- 歧义检测 —— 模型可能会误解指令的地方在哪里?
- 缺失组件 —— 有哪些应该指定但未指定的内容?(输出格式、语气、长度限制、边缘情况处理)
- 失效模式映射 —— 哪些已知的失效模式(参考)适用于此提示词?
references/failure-modes.md
Phase 3: Generate Variants
阶段3:生成变体
Create 2-4 prompt variants, each testing a different hypothesis:
| Variant Type | Hypothesis | When to Use |
|---|---|---|
| Direct instruction | Clear instruction is sufficient | Simple tasks, capable models |
| Few-shot | Examples improve output consistency | Pattern-following tasks |
| Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis |
| Persona/role | Role framing improves tone/expertise | Domain-specific tasks |
| Structured output | Format specification prevents errors | JSON, CSV, specific templates |
For each variant:
- State the hypothesis (why this variant might work)
- Identify the risk (what could go wrong)
- Provide the complete prompt text
创建2-4个提示词变体,每个变体测试一个不同的假设:
| 变体类型 | 假设内容 | 使用场景 |
|---|---|---|
| 直接指令式 | 清晰的指令已足够满足需求 | 简单任务、能力较强的模型 |
| 少样本(Few-shot) | 示例可提升输出一致性 | 需遵循模式的任务 |
| 思维链(Chain-of-thought) | 推理可提升准确性 | 多步骤逻辑、数学、分析类任务 |
| 角色设定式 | 角色框架可提升语气/专业性 | 特定领域任务 |
| 结构化输出式 | 格式规范可防止错误 | JSON、CSV、特定模板类任务 |
针对每个变体:
- 说明假设(为什么此变体可能有效)
- 识别风险(可能出现的问题)
- 提供完整的提示词文本
Phase 4: Design Evaluation
阶段4:设计评估方案
-
Rubric — Define weighted criteria:
Criterion What It Measures Typical Weight Correctness Output matches expected answer 30-50% Format compliance Follows specified structure 15-25% Completeness All required elements present 15-25% Conciseness No unnecessary content 5-15% Tone/style Matches requested voice 5-10% -
Test cases — Minimum 5 cases covering:
- Happy path (standard input)
- Edge cases (unusual but valid input)
- Adversarial cases (inputs designed to confuse)
- Boundary cases (minimum/maximum input)
-
评估准则 —— 定义带权重的标准:
评估标准 衡量内容 典型权重 正确性 输出与预期答案匹配度 30-50% 格式合规性 是否遵循指定结构 15-25% 完整性 是否包含所有必要元素 15-25% 简洁性 无冗余内容 5-15% 语气/风格 是否符合要求的语气 5-10% -
测试用例 —— 至少5个测试用例,覆盖:
- 常规路径(标准输入)
- 边缘情况(不常见但有效的输入)
- 对抗性用例(用于迷惑模型的输入)
- 边界情况(最小/最大输入)
Phase 5: Output
阶段5:输出
Present variants, rubric, and test cases in a structured format ready for execution.
以结构化格式呈现变体、评估准则和测试用例,便于执行。
Output Format
输出格式
text
undefinedtext
undefinedPrompt Lab: {Task Name}
Prompt Lab: {任务名称}
Objective
目标
{What the prompt should achieve — specific and measurable}
{提示词需要实现的具体、可衡量的内容}
Success Criteria
成功标准
- {Criterion 1 — measurable}
- {Criterion 2 — measurable}
- {可衡量的标准1}
- {可衡量的标准2}
Current Prompt Analysis
现有提示词分析
{If existing prompt provided}
- Strengths: {what works}
- Weaknesses: {what fails or is ambiguous}
- Missing: {what's not specified}
{如果提供了现有提示词}
- 优势: {有效的部分}
- 劣势: {失效或存在歧义的部分}
- 缺失: {未指定的必要内容}
Variants
变体
Variant A: {Strategy Name}
变体A:{策略名称}
{Complete prompt text}
```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
{完整的提示词文本}
```text
**假设:** {此方法可能有效的原因}
**风险:** {可能出现的问题}Variant B: {Strategy Name}
变体B:{策略名称}
{Complete prompt text}
```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
{完整的提示词文本}
```text
**假设:** {此方法可能有效的原因}
**风险:** {可能出现的问题}Variant C: {Strategy Name}
变体C:{策略名称}
{Complete prompt text}
```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
{完整的提示词文本}
```text
**假设:** {此方法可能有效的原因}
**风险:** {可能出现的问题}Evaluation Rubric
评估准则
| Criterion | Weight | Scoring |
|---|---|---|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
| 评估标准 | 权重 | 评分方式 |
|---|---|---|
| {标准名称} | {%} | {评分方式:0-3分制或通过/不通过} |
Test Cases
测试用例
| # | Input | Expected Output | Tests Criteria |
|---|---|---|---|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
| # | 输入 | 预期输出 | 测试标准 |
|---|---|---|---|
| 1 | {标准输入} | {预期结果} | 正确性、格式 |
| 2 | {边缘情况} | {预期结果} | 完整性 |
| 3 | {对抗性用例} | {预期结果} | 鲁棒性 |
Failure Modes to Monitor
需监控的失效模式
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
- {失效模式1}:{检测方法}
- {失效模式2}:{检测方法}
Recommended Next Steps
建议后续步骤
- Run all variants against the test suite
- Score using the rubric
- Select the highest-scoring variant
- Iterate on the winner with targeted improvements
undefined- 在测试套件上运行所有变体
- 使用评估准则评分
- 选择得分最高的变体
- 针对获胜变体进行针对性改进
undefinedCalibration Rules
校准规则
- One variable per variant. Each variant should change ONE thing from the baseline. Changing instruction style AND examples AND format simultaneously makes results uninterpretable.
- Test before declaring success. A prompt that works on 3 examples may fail on the 4th. Minimum 5 diverse test cases before concluding a variant works.
- Failure modes are more valuable than successes. Understanding WHY a prompt fails guides improvement more than confirming it works.
- Model-specific optimization. A prompt optimized for GPT-4 may not work for Claude or Llama. Always note the target model.
- Simplest effective prompt wins. If a zero-shot prompt scores as well as a few-shot prompt, use the zero-shot. Fewer tokens = lower cost + latency.
- 每个变体仅改变一个变量。每个变体应相对于基线只改变一个要素。同时改变指令风格、示例和格式会导致结果无法解释。
- 测试后再宣告成功。在3个示例上有效的提示词可能在第4个示例上失效。在得出变体有效的结论之前,至少需要5个多样化的测试用例。
- 失效模式比成功更有价值。理解提示词失效的原因比确认其有效更能指导改进。
- 针对模型优化。为GPT-4优化的提示词可能不适用于Claude或Llama。请始终注明目标模型。
- 最简单的有效提示词获胜。如果零样本提示词的得分与少样本提示词相同,请使用零样本提示词。更少的token = 更低的成本 + 更短的延迟。
Error Handling
错误处理
| Problem | Resolution |
|---|---|
| No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. |
| Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. |
| Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |
| 问题描述 | 解决方案 |
|---|---|
| 无明确目标 | 要求用户通过2-3个示例定义“优质输出”的样子。 |
| 提示词针对LLM不擅长的任务(数学、计数) | 标记此限制。建议使用工具增强的方法或预处理/后处理步骤。 |
| 需要测试的变量过多 | 首先关注影响最大的变量。迭代优化优于组合测试。 |
| 无现有提示词可分析 | 从最简单的提示词开始。第一个变体就是基线。 |
| 输出格式要求严格 | 使用结构化输出模式(JSON模式、函数调用)而非仅依赖提示词约束。 |
When NOT to Use
禁用场景
Push back if:
- The task doesn't need an LLM (deterministic rules, regex, SQL) — use the right tool
- The user wants prompt execution, not design — this skill designs and evaluates, it doesn't run prompts
- The prompt is for safety-critical decisions without human review — LLM output should not be the sole input
如果出现以下情况,请拒绝使用:
- 任务不需要LLM(确定性规则、正则表达式、SQL)——使用合适的工具
- 用户想要执行提示词,而非设计提示词——此技能用于设计和评估,不运行提示词
- 提示词用于无需人工审核的安全关键决策——LLM输出不应作为唯一输入