prompt-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Engineering

提示词工程

When to Use

适用场景

  • Crafting or refining prompts for LLM-based features
  • Improving output quality, consistency, or reliability
  • Designing system prompts for AI agents or chatbots
  • Implementing structured output (JSON, specific formats)
  • Defending against prompt injection attacks
  • Building prompt templates for reusable workflows
  • Evaluating and iterating on prompt performance
  • 为基于LLM的功能编写或优化Prompt
  • 提升输出质量、一致性或可靠性
  • 为AI Agent或聊天机器人设计系统提示词
  • 实现结构化输出(JSON、特定格式)
  • 防御Prompt注入攻击
  • 为可复用工作流构建Prompt模板
  • 对Prompt性能进行评估和迭代

Instructions

使用指南

1. Prompt Structure Fundamentals

1. Prompt结构基础

A well-structured prompt has these components (in order of importance):
  1. Role/Context — Who is the model? What domain expertise applies?
  2. Task — What exactly should it do? Be specific and unambiguous.
  3. Constraints — Format, length, tone, what to avoid.
  4. Examples — Input/output pairs demonstrating desired behavior.
  5. Input — The actual data to process.
Principles:
  • Be explicit. LLMs do not read minds — state what you want and what you do not want.
  • Put the most important instructions first and last (primacy and recency effects).
  • Use delimiters to separate sections:
    ###
    ,
    ---
    , XML tags, triple backticks.
  • Shorter is not always better — a well-structured 500-word prompt beats an ambiguous 50-word one.
一个结构完善的Prompt包含以下组件(按重要性排序):
  1. 角色/上下文 —— 模型需要扮演什么身份?适用哪些领域专业知识?
  2. 任务 —— 具体需要完成什么工作?表述要具体清晰,无歧义。
  3. 约束条件 —— 格式、长度、语气、需要避免的内容等。
  4. 示例 —— 展示预期输出效果的输入/输出对。
  5. 输入 —— 需要处理的实际数据。
设计原则:
  • 表述明确。LLM不会读心——清楚说明你想要和不想要的内容。
  • 最重要的指令放在开头和结尾(首因和近因效应)。
  • 使用分隔符区分不同模块:
    ###
    ---
    、XML标签、反引号块。
  • 短不一定好:结构清晰的500字Prompt比表述模糊的50字Prompt效果更好。

2. Core Techniques

2. 核心技术

See
references/techniques-catalog.md
for detailed templates and examples.
Zero-shot: Direct instruction with no examples. Works for simple, well-defined tasks.
Few-shot: Provide 2-5 input/output examples before the actual input. The model learns the pattern from examples. Choose diverse, representative examples. Order matters — put the most similar example last.
Chain-of-Thought (CoT): Add "Let's think step by step" or provide reasoning examples. Dramatically improves math, logic, and multi-step tasks. Can be combined with few-shot (show reasoning in examples).
Self-consistency: Generate multiple responses with temperature > 0, then take the majority answer. Best for factual or reasoning tasks where there is one correct answer.
Structured output: Request JSON, XML, or specific formats. Use JSON mode when available. Provide the exact schema in the prompt. Validate output programmatically.
详细模板和示例可查看
references/techniques-catalog.md
Zero-shot: 不提供示例,直接给出指令。适用于简单、定义清晰的任务。
Few-shot: 在实际输入前提供2-5组输入/输出示例,模型会从示例中学习规则。选择多样化、有代表性的示例,顺序很重要——把最相似的示例放在最后。
Chain-of-Thought (CoT): 加入"Let's think step by step"语句,或提供推理过程示例,能大幅提升数学、逻辑和多步骤任务的效果,可与few-shot结合使用(在示例中展示推理过程)。
Self-consistency: 用temperature > 0的参数生成多个回复,然后取多数答案。最适合只有唯一正确答案的事实类或推理类任务。
Structured output: 要求输出JSON、XML或特定格式,支持JSON模式时请开启,在Prompt中提供明确的Schema,通过编程方式校验输出。

3. System Prompt Design

3. 系统提示词设计

System prompts set persistent behavior for the entire conversation:
  • Define the persona, expertise, and communication style
  • Set hard constraints (what the model must never do)
  • Establish output format expectations
  • Include domain-specific knowledge or rules
Best practices:
  • Keep system prompts focused — one clear role, not five
  • Use positive instructions ("always do X") over negative ("never do Y") where possible
  • Test with adversarial inputs to ensure constraints hold
  • Version your system prompts and track changes like code
系统提示词会为整个对话设定固定的行为规则:
  • 定义人设、专业领域和沟通风格
  • 设定硬性约束(模型绝对不能做的事)
  • 明确输出格式要求
  • 加入领域特定的知识或规则
最佳实践:
  • 系统提示词要聚焦——只设定一个清晰的角色,不要同时设置多个
  • 尽量使用正向指令("always do X")而非负向指令("never do Y")
  • 用对抗性输入测试,确保约束生效
  • 对系统提示词做版本管理,像代码一样跟踪变更

4. Temperature and Sampling Parameters

4. Temperature和采样参数

  • Temperature (0.0 - 2.0): Controls randomness. 0.0 = deterministic, 1.0 = default creative, >1.0 = very random.
    • Use 0.0-0.3 for factual tasks, code generation, structured output
    • Use 0.5-0.8 for creative writing, brainstorming
    • Use 0.0 for reproducible evaluations
  • Top-p (0.0 - 1.0): Nucleus sampling. 0.9 means consider tokens comprising top 90% probability. Alternative to temperature — do not adjust both simultaneously.
  • Max tokens: Set to expected output length + buffer. Too low truncates output; too high wastes quota.
  • Stop sequences: Define strings that halt generation. Useful for structured extraction.
  • Temperature (0.0 - 2.0): 控制输出随机性。0.0 = 确定性输出,1.0 = 默认创意度,>1.0 = 高度随机。
    • 事实类任务、代码生成、结构化输出使用0.0-0.3
    • 创意写作、头脑风暴使用0.5-0.8
    • 可复现的评估使用0.0
  • Top-p (0.0 - 1.0): 核采样。0.9表示仅考虑累计概率占前90%的token,是Temperature的替代方案——不要同时调整两个参数。
  • Max tokens: 设置为预期输出长度+缓冲值。值太低会截断输出,太高会浪费配额。
  • Stop sequences: 定义停止生成的字符串,适用于结构化提取场景。

5. Prompt Injection Defense

5. Prompt注入防御

Prompt injection is when user input manipulates the model's behavior by overriding instructions.
Defense layers:
  1. Input sanitization: Strip or escape known injection patterns. Detect
    ignore previous instructions
    type phrases.
  2. Delimited input: Wrap user input in clear delimiters and instruct the model to treat the delimited content as data only, never as instructions.
  3. Output validation: Verify output conforms to expected format. Reject unexpected formats.
  4. Privilege separation: Use separate LLM calls for different trust levels. Do not mix system logic and user input in one prompt.
  5. Canary tokens: Include a secret token in the system prompt. If it appears in output, injection may have occurred.
No defense is perfect. Layer multiple approaches and assume breach.
Prompt注入是指用户输入通过覆盖原有指令来操纵模型行为的攻击方式。
防御层:
  1. 输入清洗: 移除或转义已知的注入模式,检测
    ignore previous instructions
    这类短语。
  2. 输入分隔: 用明确的分隔符包裹用户输入,指示模型仅将分隔符内的内容视为数据,永远不要当作指令。
  3. 输出校验: 验证输出是否符合预期格式,拒绝不符合格式的输出。
  4. 权限隔离: 不同信任级别的任务调用单独的LLM接口,不要在同一个Prompt中混合系统逻辑和用户输入。
  5. 金丝雀令牌: 在系统提示词中加入一个秘密令牌,如果输出中出现该令牌,说明可能发生了注入攻击。
没有绝对完美的防御方案,需要叠加多种方法,做好被突破的预案。

6. Prompt Templates and Iteration

6. Prompt模板与迭代

Build reusable templates with variable slots:
You are a {role} specializing in {domain}.

Analyze the following {input_type}:
---
{input}
---

Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)
Iteration process:
  1. Start with a simple prompt that captures the core task
  2. Test on 10-20 diverse inputs
  3. Identify failure modes (wrong format, missing info, hallucination)
  4. Add constraints or examples to address each failure mode
  5. Retest — ensure fixes do not break previously working cases
  6. Document the prompt version and test results
构建带变量插槽的可复用模板:
You are a {role} specializing in {domain}.

Analyze the following {input_type}:
---
{input}
---

Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)
迭代流程:
  1. 从能覆盖核心任务的简单Prompt开始
  2. 用10-20组多样化输入测试
  3. 识别失败模式(格式错误、信息缺失、幻觉)
  4. 增加约束或示例来解决每一类失败模式
  5. 重新测试——确保修复不会破坏之前正常的场景
  6. 记录Prompt版本和测试结果

7. Evaluation

7. 评估

Measure prompt quality systematically:
  • Build an eval set of 20-50 input/expected-output pairs
  • Score each output (binary pass/fail, or rubric-based 1-5)
  • Track metrics across prompt versions: accuracy, format compliance, latency
  • Use LLM-as-judge for subjective quality (see llm-evaluation skill)
  • Automate eval runs in CI when prompt changes are deployed
系统化衡量Prompt质量:
  • 构建包含20-50组输入/预期输出对的评估集
  • 为每个输出打分(二进制通过/失败,或基于评分规则的1-5分)
  • 跟踪不同Prompt版本的指标:准确率、格式合规率、延迟
  • 对主观质量使用LLM-as-judge(参考llm-evaluation skill)
  • 部署Prompt变更时,在CI中自动运行评估

Examples

示例

Designing a Classification Prompt

设计分类Prompt

User needs to classify support tickets into categories. Design a few-shot prompt with 3-5 example tickets per category. Include edge cases. Use temperature 0.0 for consistency. Request JSON output:
{"category": "...", "confidence": "high|medium|low"}
. Validate output schema programmatically. Measure accuracy against labeled test set.
用户需要将支持工单分类到不同类目。设计一个few-shot Prompt,每个类目提供3-5个示例工单,包含边界case。使用temperature 0.0保证一致性,要求JSON输出:
{"category": "...", "confidence": "high|medium|low"}
。通过编程方式校验输出Schema,基于标注测试集计算准确率。

Building a Code Review Agent

构建代码审查Agent

User wants an LLM-powered code review assistant. Design a system prompt defining the reviewer persona (senior engineer, specific language expertise). Include review criteria: correctness, performance, readability, security. Use structured output for findings. Add injection defense for code that might contain adversarial comments. Test with intentionally bad code to verify the agent catches issues.
用户想要一个LLM驱动的代码审查助手。设计系统提示词,定义审查者人设(资深工程师,特定编程语言专家),包含审查标准:正确性、性能、可读性、安全性。使用结构化输出展示发现的问题,为可能包含对抗性评论的代码添加注入防御。用故意编写的劣质代码测试,验证Agent能识别问题。

Optimizing an Underperforming Prompt

优化效果不佳的Prompt

User reports their summarization prompt produces inconsistent output. Diagnose: test on 20 inputs, categorize failures (too long, misses key points, wrong tone). Add length constraints, provide few-shot examples of ideal summaries, add chain-of-thought for complex documents. A/B test the old vs. new prompt on the eval set. Track improvement in format compliance and content accuracy.
用户反馈他们的摘要Prompt输出不稳定。诊断方法:用20组输入测试,对失败场景分类(过长、遗漏关键点、语气错误)。添加长度约束,提供few-shot理想摘要示例,为复杂文档加入chain-of-thought要求。在评估集上对新旧Prompt做A/B测试,跟踪格式合规率和内容准确率的提升情况。