prompt-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prompt Engineering

提示词工程

When to Use

适用场景

Crafting or refining prompts for LLM-based features
Improving output quality, consistency, or reliability
Designing system prompts for AI agents or chatbots
Implementing structured output (JSON, specific formats)
Defending against prompt injection attacks
Building prompt templates for reusable workflows
Evaluating and iterating on prompt performance

为基于LLM的功能编写或优化Prompt
提升输出质量、一致性或可靠性
为AI Agent或聊天机器人设计系统提示词
实现结构化输出（JSON、特定格式）
防御Prompt注入攻击
为可复用工作流构建Prompt模板
对Prompt性能进行评估和迭代

Instructions

使用指南

1. Prompt Structure Fundamentals

1. Prompt结构基础

A well-structured prompt has these components (in order of importance):

Role/Context — Who is the model? What domain expertise applies?
Task — What exactly should it do? Be specific and unambiguous.
Constraints — Format, length, tone, what to avoid.
Examples — Input/output pairs demonstrating desired behavior.
Input — The actual data to process.

Principles:

Be explicit. LLMs do not read minds — state what you want and what you do not want.
Put the most important instructions first and last (primacy and recency effects).
Use delimiters to separate sections:
```
###
```
,
```
---
```
, XML tags, triple backticks.
Shorter is not always better — a well-structured 500-word prompt beats an ambiguous 50-word one.

一个结构完善的Prompt包含以下组件（按重要性排序）：

角色/上下文 —— 模型需要扮演什么身份？适用哪些领域专业知识？
任务 —— 具体需要完成什么工作？表述要具体清晰，无歧义。
约束条件 —— 格式、长度、语气、需要避免的内容等。
示例 —— 展示预期输出效果的输入/输出对。
输入 —— 需要处理的实际数据。

设计原则：

表述明确。LLM不会读心——清楚说明你想要和不想要的内容。
最重要的指令放在开头和结尾（首因和近因效应）。
使用分隔符区分不同模块：
```
###
```
、
```
---
```
、XML标签、反引号块。
短不一定好：结构清晰的500字Prompt比表述模糊的50字Prompt效果更好。

2. Core Techniques

2. 核心技术

See

references/techniques-catalog.md

for detailed templates and examples.

Zero-shot: Direct instruction with no examples. Works for simple, well-defined tasks.

Few-shot: Provide 2-5 input/output examples before the actual input. The model learns the pattern from examples. Choose diverse, representative examples. Order matters — put the most similar example last.

Chain-of-Thought (CoT): Add "Let's think step by step" or provide reasoning examples. Dramatically improves math, logic, and multi-step tasks. Can be combined with few-shot (show reasoning in examples).

Self-consistency: Generate multiple responses with temperature > 0, then take the majority answer. Best for factual or reasoning tasks where there is one correct answer.

Structured output: Request JSON, XML, or specific formats. Use JSON mode when available. Provide the exact schema in the prompt. Validate output programmatically.

详细模板和示例可查看

references/techniques-catalog.md

。

Zero-shot： 不提供示例，直接给出指令。适用于简单、定义清晰的任务。

Few-shot： 在实际输入前提供2-5组输入/输出示例，模型会从示例中学习规则。选择多样化、有代表性的示例，顺序很重要——把最相似的示例放在最后。

Chain-of-Thought (CoT)： 加入"Let's think step by step"语句，或提供推理过程示例，能大幅提升数学、逻辑和多步骤任务的效果，可与few-shot结合使用（在示例中展示推理过程）。

Self-consistency： 用temperature > 0的参数生成多个回复，然后取多数答案。最适合只有唯一正确答案的事实类或推理类任务。

Structured output： 要求输出JSON、XML或特定格式，支持JSON模式时请开启，在Prompt中提供明确的Schema，通过编程方式校验输出。

3. System Prompt Design

3. 系统提示词设计

System prompts set persistent behavior for the entire conversation:

Define the persona, expertise, and communication style
Set hard constraints (what the model must never do)
Establish output format expectations
Include domain-specific knowledge or rules

Best practices:

Keep system prompts focused — one clear role, not five
Use positive instructions ("always do X") over negative ("never do Y") where possible
Test with adversarial inputs to ensure constraints hold
Version your system prompts and track changes like code

系统提示词会为整个对话设定固定的行为规则：

定义人设、专业领域和沟通风格
设定硬性约束（模型绝对不能做的事）
明确输出格式要求
加入领域特定的知识或规则

最佳实践：

系统提示词要聚焦——只设定一个清晰的角色，不要同时设置多个
尽量使用正向指令（"always do X"）而非负向指令（"never do Y"）
用对抗性输入测试，确保约束生效
对系统提示词做版本管理，像代码一样跟踪变更

4. Temperature and Sampling Parameters

4. Temperature和采样参数

Temperature (0.0 - 2.0): Controls randomness. 0.0 = deterministic, 1.0 = default creative, >1.0 = very random.
- Use 0.0-0.3 for factual tasks, code generation, structured output
- Use 0.5-0.8 for creative writing, brainstorming
- Use 0.0 for reproducible evaluations
Top-p (0.0 - 1.0): Nucleus sampling. 0.9 means consider tokens comprising top 90% probability. Alternative to temperature — do not adjust both simultaneously.
Max tokens: Set to expected output length + buffer. Too low truncates output; too high wastes quota.
Stop sequences: Define strings that halt generation. Useful for structured extraction.

Temperature (0.0 - 2.0)： 控制输出随机性。0.0 = 确定性输出，1.0 = 默认创意度，>1.0 = 高度随机。
- 事实类任务、代码生成、结构化输出使用0.0-0.3
- 创意写作、头脑风暴使用0.5-0.8
- 可复现的评估使用0.0
Top-p (0.0 - 1.0)： 核采样。0.9表示仅考虑累计概率占前90%的token，是Temperature的替代方案——不要同时调整两个参数。
Max tokens： 设置为预期输出长度+缓冲值。值太低会截断输出，太高会浪费配额。
Stop sequences： 定义停止生成的字符串，适用于结构化提取场景。

5. Prompt Injection Defense

5. Prompt注入防御

Prompt injection is when user input manipulates the model's behavior by overriding instructions.

Defense layers:

Input sanitization: Strip or escape known injection patterns. Detect
```
ignore previous instructions
```
type phrases.
Delimited input: Wrap user input in clear delimiters and instruct the model to treat the delimited content as data only, never as instructions.
Output validation: Verify output conforms to expected format. Reject unexpected formats.
Privilege separation: Use separate LLM calls for different trust levels. Do not mix system logic and user input in one prompt.
Canary tokens: Include a secret token in the system prompt. If it appears in output, injection may have occurred.

No defense is perfect. Layer multiple approaches and assume breach.

Prompt注入是指用户输入通过覆盖原有指令来操纵模型行为的攻击方式。

防御层：

输入清洗： 移除或转义已知的注入模式，检测
```
ignore previous instructions
```
这类短语。
输入分隔： 用明确的分隔符包裹用户输入，指示模型仅将分隔符内的内容视为数据，永远不要当作指令。
输出校验： 验证输出是否符合预期格式，拒绝不符合格式的输出。
权限隔离： 不同信任级别的任务调用单独的LLM接口，不要在同一个Prompt中混合系统逻辑和用户输入。
金丝雀令牌： 在系统提示词中加入一个秘密令牌，如果输出中出现该令牌，说明可能发生了注入攻击。

没有绝对完美的防御方案，需要叠加多种方法，做好被突破的预案。

6. Prompt Templates and Iteration

6. Prompt模板与迭代

Build reusable templates with variable slots:

You are a {role} specializing in {domain}.

Analyze the following {input_type}:
---
{input}
---

Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)

Iteration process:

Start with a simple prompt that captures the core task
Test on 10-20 diverse inputs
Identify failure modes (wrong format, missing info, hallucination)
Add constraints or examples to address each failure mode
Retest — ensure fixes do not break previously working cases
Document the prompt version and test results

构建带变量插槽的可复用模板：

You are a {role} specializing in {domain}.

Analyze the following {input_type}:
---
{input}
---

Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)

迭代流程：

从能覆盖核心任务的简单Prompt开始
用10-20组多样化输入测试
识别失败模式（格式错误、信息缺失、幻觉）
增加约束或示例来解决每一类失败模式
重新测试——确保修复不会破坏之前正常的场景
记录Prompt版本和测试结果

7. Evaluation

7. 评估

Measure prompt quality systematically:

Build an eval set of 20-50 input/expected-output pairs
Score each output (binary pass/fail, or rubric-based 1-5)
Track metrics across prompt versions: accuracy, format compliance, latency
Use LLM-as-judge for subjective quality (see llm-evaluation skill)
Automate eval runs in CI when prompt changes are deployed

系统化衡量Prompt质量：

构建包含20-50组输入/预期输出对的评估集
为每个输出打分（二进制通过/失败，或基于评分规则的1-5分）
跟踪不同Prompt版本的指标：准确率、格式合规率、延迟
对主观质量使用LLM-as-judge（参考llm-evaluation skill）
部署Prompt变更时，在CI中自动运行评估

Examples

示例

Designing a Classification Prompt

设计分类Prompt

User needs to classify support tickets into categories. Design a few-shot prompt with 3-5 example tickets per category. Include edge cases. Use temperature 0.0 for consistency. Request JSON output:

{"category": "...", "confidence": "high|medium|low"}

. Validate output schema programmatically. Measure accuracy against labeled test set.

用户需要将支持工单分类到不同类目。设计一个few-shot Prompt，每个类目提供3-5个示例工单，包含边界case。使用temperature 0.0保证一致性，要求JSON输出：

{"category": "...", "confidence": "high|medium|low"}

。通过编程方式校验输出Schema，基于标注测试集计算准确率。

Building a Code Review Agent

构建代码审查Agent

User wants an LLM-powered code review assistant. Design a system prompt defining the reviewer persona (senior engineer, specific language expertise). Include review criteria: correctness, performance, readability, security. Use structured output for findings. Add injection defense for code that might contain adversarial comments. Test with intentionally bad code to verify the agent catches issues.

用户想要一个LLM驱动的代码审查助手。设计系统提示词，定义审查者人设（资深工程师，特定编程语言专家），包含审查标准：正确性、性能、可读性、安全性。使用结构化输出展示发现的问题，为可能包含对抗性评论的代码添加注入防御。用故意编写的劣质代码测试，验证Agent能识别问题。

Optimizing an Underperforming Prompt

优化效果不佳的Prompt

User reports their summarization prompt produces inconsistent output. Diagnose: test on 20 inputs, categorize failures (too long, misses key points, wrong tone). Add length constraints, provide few-shot examples of ideal summaries, add chain-of-thought for complex documents. A/B test the old vs. new prompt on the eval set. Track improvement in format compliance and content accuracy.

用户反馈他们的摘要Prompt输出不稳定。诊断方法：用20组输入测试，对失败场景分类（过长、遗漏关键点、语气错误）。添加长度约束，提供few-shot理想摘要示例，为复杂文档加入chain-of-thought要求。在评估集上对新旧Prompt做A/B测试，跟踪格式合规率和内容准确率的提升情况。