prompt-engineering
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrompt Engineering
提示词工程
When to Use
适用场景
- Crafting or refining prompts for LLM-based features
- Improving output quality, consistency, or reliability
- Designing system prompts for AI agents or chatbots
- Implementing structured output (JSON, specific formats)
- Defending against prompt injection attacks
- Building prompt templates for reusable workflows
- Evaluating and iterating on prompt performance
- 为基于LLM的功能编写或优化Prompt
- 提升输出质量、一致性或可靠性
- 为AI Agent或聊天机器人设计系统提示词
- 实现结构化输出(JSON、特定格式)
- 防御Prompt注入攻击
- 为可复用工作流构建Prompt模板
- 对Prompt性能进行评估和迭代
Instructions
使用指南
1. Prompt Structure Fundamentals
1. Prompt结构基础
A well-structured prompt has these components (in order of importance):
- Role/Context — Who is the model? What domain expertise applies?
- Task — What exactly should it do? Be specific and unambiguous.
- Constraints — Format, length, tone, what to avoid.
- Examples — Input/output pairs demonstrating desired behavior.
- Input — The actual data to process.
Principles:
- Be explicit. LLMs do not read minds — state what you want and what you do not want.
- Put the most important instructions first and last (primacy and recency effects).
- Use delimiters to separate sections: ,
###, XML tags, triple backticks.--- - Shorter is not always better — a well-structured 500-word prompt beats an ambiguous 50-word one.
一个结构完善的Prompt包含以下组件(按重要性排序):
- 角色/上下文 —— 模型需要扮演什么身份?适用哪些领域专业知识?
- 任务 —— 具体需要完成什么工作?表述要具体清晰,无歧义。
- 约束条件 —— 格式、长度、语气、需要避免的内容等。
- 示例 —— 展示预期输出效果的输入/输出对。
- 输入 —— 需要处理的实际数据。
设计原则:
- 表述明确。LLM不会读心——清楚说明你想要和不想要的内容。
- 最重要的指令放在开头和结尾(首因和近因效应)。
- 使用分隔符区分不同模块:、
###、XML标签、反引号块。--- - 短不一定好:结构清晰的500字Prompt比表述模糊的50字Prompt效果更好。
2. Core Techniques
2. 核心技术
See for detailed templates and examples.
references/techniques-catalog.mdZero-shot: Direct instruction with no examples. Works for simple, well-defined tasks.
Few-shot: Provide 2-5 input/output examples before the actual input. The model learns the pattern from examples. Choose diverse, representative examples. Order matters — put the most similar example last.
Chain-of-Thought (CoT): Add "Let's think step by step" or provide reasoning examples. Dramatically improves math, logic, and multi-step tasks. Can be combined with few-shot (show reasoning in examples).
Self-consistency: Generate multiple responses with temperature > 0, then take the majority answer. Best for factual or reasoning tasks where there is one correct answer.
Structured output: Request JSON, XML, or specific formats. Use JSON mode when available. Provide the exact schema in the prompt. Validate output programmatically.
详细模板和示例可查看。
references/techniques-catalog.mdZero-shot: 不提供示例,直接给出指令。适用于简单、定义清晰的任务。
Few-shot: 在实际输入前提供2-5组输入/输出示例,模型会从示例中学习规则。选择多样化、有代表性的示例,顺序很重要——把最相似的示例放在最后。
Chain-of-Thought (CoT): 加入"Let's think step by step"语句,或提供推理过程示例,能大幅提升数学、逻辑和多步骤任务的效果,可与few-shot结合使用(在示例中展示推理过程)。
Self-consistency: 用temperature > 0的参数生成多个回复,然后取多数答案。最适合只有唯一正确答案的事实类或推理类任务。
Structured output: 要求输出JSON、XML或特定格式,支持JSON模式时请开启,在Prompt中提供明确的Schema,通过编程方式校验输出。
3. System Prompt Design
3. 系统提示词设计
System prompts set persistent behavior for the entire conversation:
- Define the persona, expertise, and communication style
- Set hard constraints (what the model must never do)
- Establish output format expectations
- Include domain-specific knowledge or rules
Best practices:
- Keep system prompts focused — one clear role, not five
- Use positive instructions ("always do X") over negative ("never do Y") where possible
- Test with adversarial inputs to ensure constraints hold
- Version your system prompts and track changes like code
系统提示词会为整个对话设定固定的行为规则:
- 定义人设、专业领域和沟通风格
- 设定硬性约束(模型绝对不能做的事)
- 明确输出格式要求
- 加入领域特定的知识或规则
最佳实践:
- 系统提示词要聚焦——只设定一个清晰的角色,不要同时设置多个
- 尽量使用正向指令("always do X")而非负向指令("never do Y")
- 用对抗性输入测试,确保约束生效
- 对系统提示词做版本管理,像代码一样跟踪变更
4. Temperature and Sampling Parameters
4. Temperature和采样参数
- Temperature (0.0 - 2.0): Controls randomness. 0.0 = deterministic, 1.0 = default creative, >1.0 = very random.
- Use 0.0-0.3 for factual tasks, code generation, structured output
- Use 0.5-0.8 for creative writing, brainstorming
- Use 0.0 for reproducible evaluations
- Top-p (0.0 - 1.0): Nucleus sampling. 0.9 means consider tokens comprising top 90% probability. Alternative to temperature — do not adjust both simultaneously.
- Max tokens: Set to expected output length + buffer. Too low truncates output; too high wastes quota.
- Stop sequences: Define strings that halt generation. Useful for structured extraction.
- Temperature (0.0 - 2.0): 控制输出随机性。0.0 = 确定性输出,1.0 = 默认创意度,>1.0 = 高度随机。
- 事实类任务、代码生成、结构化输出使用0.0-0.3
- 创意写作、头脑风暴使用0.5-0.8
- 可复现的评估使用0.0
- Top-p (0.0 - 1.0): 核采样。0.9表示仅考虑累计概率占前90%的token,是Temperature的替代方案——不要同时调整两个参数。
- Max tokens: 设置为预期输出长度+缓冲值。值太低会截断输出,太高会浪费配额。
- Stop sequences: 定义停止生成的字符串,适用于结构化提取场景。
5. Prompt Injection Defense
5. Prompt注入防御
Prompt injection is when user input manipulates the model's behavior by overriding instructions.
Defense layers:
- Input sanitization: Strip or escape known injection patterns. Detect type phrases.
ignore previous instructions - Delimited input: Wrap user input in clear delimiters and instruct the model to treat the delimited content as data only, never as instructions.
- Output validation: Verify output conforms to expected format. Reject unexpected formats.
- Privilege separation: Use separate LLM calls for different trust levels. Do not mix system logic and user input in one prompt.
- Canary tokens: Include a secret token in the system prompt. If it appears in output, injection may have occurred.
No defense is perfect. Layer multiple approaches and assume breach.
Prompt注入是指用户输入通过覆盖原有指令来操纵模型行为的攻击方式。
防御层:
- 输入清洗: 移除或转义已知的注入模式,检测这类短语。
ignore previous instructions - 输入分隔: 用明确的分隔符包裹用户输入,指示模型仅将分隔符内的内容视为数据,永远不要当作指令。
- 输出校验: 验证输出是否符合预期格式,拒绝不符合格式的输出。
- 权限隔离: 不同信任级别的任务调用单独的LLM接口,不要在同一个Prompt中混合系统逻辑和用户输入。
- 金丝雀令牌: 在系统提示词中加入一个秘密令牌,如果输出中出现该令牌,说明可能发生了注入攻击。
没有绝对完美的防御方案,需要叠加多种方法,做好被突破的预案。
6. Prompt Templates and Iteration
6. Prompt模板与迭代
Build reusable templates with variable slots:
You are a {role} specializing in {domain}.
Analyze the following {input_type}:
---
{input}
---
Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)Iteration process:
- Start with a simple prompt that captures the core task
- Test on 10-20 diverse inputs
- Identify failure modes (wrong format, missing info, hallucination)
- Add constraints or examples to address each failure mode
- Retest — ensure fixes do not break previously working cases
- Document the prompt version and test results
构建带变量插槽的可复用模板:
You are a {role} specializing in {domain}.
Analyze the following {input_type}:
---
{input}
---
Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)迭代流程:
- 从能覆盖核心任务的简单Prompt开始
- 用10-20组多样化输入测试
- 识别失败模式(格式错误、信息缺失、幻觉)
- 增加约束或示例来解决每一类失败模式
- 重新测试——确保修复不会破坏之前正常的场景
- 记录Prompt版本和测试结果
7. Evaluation
7. 评估
Measure prompt quality systematically:
- Build an eval set of 20-50 input/expected-output pairs
- Score each output (binary pass/fail, or rubric-based 1-5)
- Track metrics across prompt versions: accuracy, format compliance, latency
- Use LLM-as-judge for subjective quality (see llm-evaluation skill)
- Automate eval runs in CI when prompt changes are deployed
系统化衡量Prompt质量:
- 构建包含20-50组输入/预期输出对的评估集
- 为每个输出打分(二进制通过/失败,或基于评分规则的1-5分)
- 跟踪不同Prompt版本的指标:准确率、格式合规率、延迟
- 对主观质量使用LLM-as-judge(参考llm-evaluation skill)
- 部署Prompt变更时,在CI中自动运行评估
Examples
示例
Designing a Classification Prompt
设计分类Prompt
User needs to classify support tickets into categories. Design a few-shot prompt with 3-5 example tickets per category. Include edge cases. Use temperature 0.0 for consistency. Request JSON output: . Validate output schema programmatically. Measure accuracy against labeled test set.
{"category": "...", "confidence": "high|medium|low"}用户需要将支持工单分类到不同类目。设计一个few-shot Prompt,每个类目提供3-5个示例工单,包含边界case。使用temperature 0.0保证一致性,要求JSON输出:。通过编程方式校验输出Schema,基于标注测试集计算准确率。
{"category": "...", "confidence": "high|medium|low"}Building a Code Review Agent
构建代码审查Agent
User wants an LLM-powered code review assistant. Design a system prompt defining the reviewer persona (senior engineer, specific language expertise). Include review criteria: correctness, performance, readability, security. Use structured output for findings. Add injection defense for code that might contain adversarial comments. Test with intentionally bad code to verify the agent catches issues.
用户想要一个LLM驱动的代码审查助手。设计系统提示词,定义审查者人设(资深工程师,特定编程语言专家),包含审查标准:正确性、性能、可读性、安全性。使用结构化输出展示发现的问题,为可能包含对抗性评论的代码添加注入防御。用故意编写的劣质代码测试,验证Agent能识别问题。
Optimizing an Underperforming Prompt
优化效果不佳的Prompt
User reports their summarization prompt produces inconsistent output. Diagnose: test on 20 inputs, categorize failures (too long, misses key points, wrong tone). Add length constraints, provide few-shot examples of ideal summaries, add chain-of-thought for complex documents. A/B test the old vs. new prompt on the eval set. Track improvement in format compliance and content accuracy.
用户反馈他们的摘要Prompt输出不稳定。诊断方法:用20组输入测试,对失败场景分类(过长、遗漏关键点、语气错误)。添加长度约束,提供few-shot理想摘要示例,为复杂文档加入chain-of-thought要求。在评估集上对新旧Prompt做A/B测试,跟踪格式合规率和内容准确率的提升情况。