prompt-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When this skill is activated, always start your first response with the 🧢 emoji.

激活此技能后，你的首次回复请始终以🧢表情符号开头。

Prompt Engineering

提示词工程

Prompt engineering is the practice of designing inputs to language models to reliably elicit high-quality, accurate, and appropriately formatted outputs. It covers everything from writing system instructions to multi-step reasoning pipelines and retrieval-augmented generation. Effective prompting reduces hallucinations, improves consistency, and unlocks capabilities the model already has but needs guidance to apply. The techniques here apply across providers (OpenAI, Anthropic, Google) with minor syntactic differences.

提示词工程是指设计语言模型输入，以稳定获取高质量、准确且格式恰当的输出的实践。它涵盖了从编写系统指令到多步推理管道和检索增强生成的所有内容。有效的提示词可以减少幻觉、提高一致性，并解锁模型本身具备但需要引导才能发挥的能力。这里的技术适用于所有提供商（OpenAI、Anthropic、Google），仅存在细微的语法差异。

When to use this skill

何时使用此技能

Trigger this skill when the task involves:

Writing or refining a system prompt for an agent or chatbot
Implementing chain-of-thought reasoning to improve accuracy on hard tasks
Designing few-shot examples to steer model behavior
Building a RAG pipeline (retrieval + context injection + generation)
Getting structured JSON/schema output from a model reliably
Chaining multiple LLM calls (decomposition, routing, verification)
Evaluating or benchmarking prompt quality across dimensions
Choosing between zero-shot, few-shot, fine-tuning, or RAG approaches
Debugging inconsistent or hallucinated model outputs

Do NOT trigger this skill for:

Model training, fine-tuning infrastructure, or RLHF pipelines (those are ML engineering)
Framework-specific agent wiring (use the
```
mastra
```
or relevant framework skill instead)

当任务涉及以下内容时，触发此技能：

为Agent或聊天机器人编写或优化系统提示词
实现思维链推理以提升复杂任务的准确性
设计少样本示例以引导模型行为
构建RAG管道（检索 + 上下文注入 + 生成）
稳定获取模型输出的结构化JSON/ Schema
链式调用多个LLM（分解、路由、验证）
从多维度评估或基准测试提示词质量
在零样本、少样本、微调或RAG方法之间做选择
调试模型输出不一致或产生幻觉的问题

请勿在以下场景触发此技能：

模型训练、微调基础设施或RLHF管道（这些属于机器学习工程范畴）
特定框架的Agent连接配置（请使用
```
mastra
```
或相关框架技能）

Key principles

核心原则

Be specific and explicit - Vague instructions produce vague outputs. State the audience, format, length, tone, and constraints in every prompt.
Provide context before instruction - Background and examples before the task reduces ambiguity. The model reads top-to-bottom; front-load what matters.
Use structured output - Request JSON, markdown tables, or a fixed schema when downstream code will consume the response. Pair with schema validation and retries.
Iterate and evaluate - Treat prompts as code. Version them, test against a golden eval set, and measure regressions before deploying changes.
Decompose complex tasks - A single prompt asking the model to research, reason, and format simultaneously degrades quality. Break into sequential or parallel calls.

具体明确 - 模糊的指令会产生模糊的输出。在每个提示词中说明受众、格式、长度、语气和约束条件。
先提供上下文，再给出指令 - 任务前的背景信息和示例可减少歧义。模型是从上到下读取内容的，将重要信息前置。
使用结构化输出 - 当下游代码需要处理响应时，请求返回JSON、Markdown表格或固定Schema。配合Schema验证和重试机制。
迭代与评估 - 将提示词视为代码。对其进行版本控制，针对黄金评估集测试，并在部署变更前衡量回归情况。
分解复杂任务 - 单个提示词同时要求模型进行研究、推理和格式化会降低输出质量。将任务拆分为顺序或并行调用。

Core concepts

核心概念

System / user / assistant roles

系统/用户/助手角色

Role	Purpose	Notes
`system`	Persistent instructions, persona, constraints	Set once; applies to full conversation
`user`	The human turn - questions, tasks, data	Can include injected context (RAG, tool output)
`assistant`	Model response (or prefill to steer format)	Prefilling forces a specific start token

角色	用途	注意事项
`system`	持久化指令、角色设定、约束条件	只需设置一次；适用于整个对话
`user`	人类的交互轮次 - 问题、任务、数据	可包含注入的上下文（RAG、工具输出）
`assistant`	模型响应（或预填充内容以引导格式）	预填充可强制使用特定的起始标记

Temperature and sampling

温度参数与采样

```
temperature: 0
```
- Deterministic, best for factual extraction and structured output
```
temperature: 0.3-0.7
```
- Balanced creativity and coherence; good for most tasks
```
temperature: 1.0+
```
- High diversity; useful for brainstorming, risky for factual tasks
```
top_p
```
(nucleus sampling) - Alternative to temperature; values 0.9-0.95 are common
Never set both
```
temperature
```
and
```
top_p
```
to non-default at the same time

```
temperature: 0
```
- 确定性输出，最适合事实提取和结构化输出
```
temperature: 0.3-0.7
```
- 平衡创造性和连贯性；适用于大多数任务
```
temperature: 1.0+
```
- 高多样性；适用于头脑风暴，但不适合事实类任务
```
top_p
```
（核采样）- 温度参数的替代方案；常用值为0.9-0.95
切勿同时将
```
temperature
```
和
```
top_p
```
设置为非默认值

Token economics

Token成本控制

Input tokens cost less than output tokens on most providers - keep outputs focused
Longer context = slower TTFT (time to first token) and higher cost
Few-shot examples consume significant tokens; choose examples carefully
Use
```
max_tokens
```
to cap runaway responses

在大多数提供商处，输入Token的成本低于输出Token - 请聚焦输出内容
上下文越长，TTFT（首个Token生成时间）越慢，成本越高
少样本示例会消耗大量Token；请谨慎选择示例
使用
```
max_tokens
```
限制过长的响应

Context window management

上下文窗口管理

Modern models: 128K-1M token windows, but quality degrades near limits ("lost in the middle")
Place critical instructions at the start and end of long prompts
For RAG: inject only top-K retrieved chunks, not entire documents
Summarize long conversation history rather than passing raw transcripts

现代模型：128K-1M Token窗口，但接近上限时质量会下降（“中间信息丢失”）
将关键指令放在长提示词的开头和结尾
对于RAG：仅注入Top-K检索到的片段，而非整个文档
总结长对话历史，而非传递原始转录内容

Prompt vs fine-tuning decision

提示词与微调的决策对比

Scenario	Approach
New behavior, few examples	Zero-shot or few-shot prompting
Consistent style/format needed	Few-shot or system prompt
Thousands of labeled examples + consistent task	Fine-tuning
Domain knowledge too large for context	RAG
Latency-critical, repeated same task	Fine-tune for smaller/faster model

场景	方法
新行为，示例较少	零样本或少样本提示词
需要一致的风格/格式	少样本或系统提示词
数千个标注示例 + 任务一致	微调
领域知识超出上下文容量	RAG
延迟敏感、重复执行相同任务	微调以使用更小/更快的模型

Common tasks

常见任务

Write effective system prompts

编写有效的系统提示词

Template:

You are [PERSONA] helping [AUDIENCE] with [DOMAIN].

Your responsibilities:
- [CORE TASK 1]
- [CORE TASK 2]

Constraints:
- [HARD RULE 1 - what to never do]
- [HARD RULE 2]

Output format: [FORMAT DESCRIPTION]

Concrete example:

You are a senior code reviewer helping software engineers improve TypeScript code quality.

Your responsibilities:
- Identify bugs, logic errors, and type safety issues
- Suggest idiomatic improvements with brief reasoning
- Flag security vulnerabilities explicitly

Constraints:
- Never rewrite the entire file unprompted; focus on the diff
- Do not praise code unless it exemplifies a non-obvious pattern worth reinforcing

Output format: Return a markdown list of findings. Each item: [SEVERITY] - description.

Anti-patterns:

"Be helpful, harmless, and honest" (too generic - the model already knows this)
Contradictory constraints ("be concise" and "explain everything in detail")
No output format specification when downstream parsing is required

模板：

You are [PERSONA] helping [AUDIENCE] with [DOMAIN].

Your responsibilities:
- [CORE TASK 1]
- [CORE TASK 2]

Constraints:
- [HARD RULE 1 - what to never do]
- [HARD RULE 2]

Output format: [FORMAT DESCRIPTION]

具体示例：

You are a senior code reviewer helping software engineers improve TypeScript code quality.

Your responsibilities:
- Identify bugs, logic errors, and type safety issues
- Suggest idiomatic improvements with brief reasoning
- Flag security vulnerabilities explicitly

Constraints:
- Never rewrite the entire file unprompted; focus on the diff
- Do not praise code unless it exemplifies a non-obvious pattern worth reinforcing

Output format: Return a markdown list of findings. Each item: [SEVERITY] - description.

反模式：

“请提供帮助、无害且诚实的回复”（过于通用 - 模型本身已了解此要求）
相互矛盾的约束（“保持简洁”和“详细解释所有内容”）
当下游需要解析时未指定输出格式

Implement chain-of-thought

实现思维链

Zero-shot CoT - append "Let's think step by step." to trigger reasoning:

User: A store has 3 boxes of apples, each containing 12 apples. They sell 15 apples.
      How many remain? Let's think step by step.

Structured CoT - define explicit reasoning steps:

System: When solving math or logic problems, follow this structure:
  1. UNDERSTAND: Restate what is being asked
  2. PLAN: List the operations needed
  3. EXECUTE: Work through each step
  4. ANSWER: State the final answer clearly

User: [problem]

Self-consistency (sample multiple reasoning paths, majority-vote the answer):

python

answers = []
for _ in range(5):
    response = llm.complete(cot_prompt, temperature=0.7)
    answers.append(extract_answer(response))
final_answer = Counter(answers).most_common(1)[0][0]

Use CoT for arithmetic, logic, multi-step planning, and ambiguous classification. Skip CoT for simple lookup tasks - it adds tokens without benefit.

零样本思维链 - 追加“Let's think step by step.”触发推理：

User: A store has 3 boxes of apples, each containing 12 apples. They sell 15 apples.
      How many remain? Let's think step by step.

结构化思维链 - 定义明确的推理步骤：

System: When solving math or logic problems, follow this structure:
  1. UNDERSTAND: Restate what is being asked
  2. PLAN: List the operations needed
  3. EXECUTE: Work through each step
  4. ANSWER: State the final answer clearly

User: [problem]

自一致性（采样多条推理路径，以多数投票确定答案）：

python

answers = []
for _ in range(5):
    response = llm.complete(cot_prompt, temperature=0.7)
    answers.append(extract_answer(response))
final_answer = Counter(answers).most_common(1)[0][0]

思维链适用于算术、逻辑、多步规划和模糊分类任务。简单的查询任务无需使用思维链 - 这会增加Token消耗却无收益。

Design few-shot examples

设计少样本示例

Selection criteria:

Cover the most common input patterns (not edge cases for initial shot selection)
Include at least one negative/refusal example if the model should decline certain inputs
Keep formatting identical across all examples - models learn from structural patterns

Ordering:

Most representative examples first; most recent (closest to the query) last
For classification: interleave classes rather than grouping them

Formatting template:

System: Classify the sentiment of customer reviews as POSITIVE, NEGATIVE, or NEUTRAL.

User: Review: "The product arrived on time but the packaging was damaged."
Assistant: NEGATIVE

User: Review: "Exactly as described, fast shipping. Very happy!"
Assistant: POSITIVE

User: Review: "It works."
Assistant: NEUTRAL

User: Review: "{actual_review}"

3-8 examples typically saturate few-shot gains. More examples rarely help and consume context budget that could be used for the actual input.

选择标准：

覆盖最常见的输入模式（初始选择示例时无需考虑边缘情况）
如果模型需要拒绝某些输入，至少包含一个负面/拒绝示例
所有示例的格式保持一致 - 模型会从结构模式中学习

排序规则：

最具代表性的示例放在前面；与查询最相关的最新示例放在最后
分类任务：交替展示不同类别，而非按类别分组

格式模板：

System: Classify the sentiment of customer reviews as POSITIVE, NEGATIVE, or NEUTRAL.

User: Review: "The product arrived on time but the packaging was damaged."
Assistant: NEGATIVE

User: Review: "Exactly as described, fast shipping. Very happy!"
Assistant: POSITIVE

User: Review: "It works."
Assistant: NEUTRAL

User: Review: "{actual_review}"

3-8个示例通常就能达到少样本的最大收益。更多示例几乎没有帮助，还会占用可用于实际输入的上下文预算。

Build a RAG prompt pipeline

构建RAG提示词管道

Step 1 - Retrieval: embed the query and fetch top-K chunks from a vector store.

Step 2 - Context injection:

System: You are a documentation assistant. Answer questions using ONLY the provided
        context. If the answer is not in the context, say "I don't have that information."

Context:
---
{retrieved_chunk_1}
---
{retrieved_chunk_2}
---

User: {user_question}

Step 3 - Generation with citation:

System: [...as above...]
        After your answer, list sources as: Sources: [chunk title or ID]

User: How do I configure authentication?

Key decisions:

Chunk size: 256-512 tokens for precision; 1024 for broader context
Overlap: 10-20% of chunk size to avoid cutting mid-sentence
Reranking: use a cross-encoder reranker after initial retrieval to improve top-K quality
Query rewriting: expand ambiguous queries before embedding for better recall

Never inject raw retrieved text without a clear delimiter. Models need structural separation to distinguish context from instructions.

步骤1 - 检索： 将查询语句嵌入后，从向量库中获取Top-K片段。

步骤2 - 上下文注入：

System: You are a documentation assistant. Answer questions using ONLY the provided
        context. If the answer is not in the context, say "I don't have that information."

Context:
---
{retrieved_chunk_1}
---
{retrieved_chunk_2}
---

User: {user_question}

步骤3 - 带引用的生成：

System: [...as above...]
        After your answer, list sources as: Sources: [chunk title or ID]

User: How do I configure authentication?

关键决策：

片段大小：256-512 Token以保证精度；1024 Token可获取更广泛的上下文
重叠率：片段大小的10-20%，避免在句子中间截断
重排序：初始检索后使用交叉编码器重排序器提升Top-K质量
查询重写：在嵌入前扩展模糊查询以提高召回率

切勿在没有明确分隔符的情况下注入原始检索文本。模型需要结构分隔来区分上下文和指令。

Get structured JSON output

获取结构化JSON输出

Schema enforcement via function calling / structured output (preferred):

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract person info from: Alice Smith, 32, engineer"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "role": {"type": "string"}
                },
                "required": ["name", "age", "role"]
            }
        }
    }
)

Prompt-based fallback with retry:

python

def extract_json(prompt: str, schema: dict, max_retries=3) -> dict:
    for attempt in range(max_retries):
        raw = llm.complete(f"{prompt}\n\nRespond with valid JSON matching: {schema}")
        try:
            data = json.loads(raw)
            validate(data, schema)  # jsonschema
            return data
        except (json.JSONDecodeError, ValidationError) as e:
            prompt += f"\n\nPrevious response was invalid: {e}. Fix and retry."
    raise RuntimeError("Failed to get valid JSON after retries")

Always validate parsed JSON against a schema - do not trust model-generated structure blindly. Use
response_format: json_object
as a minimum guardrail.

通过函数调用/结构化输出强制Schema（推荐）：

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract person info from: Alice Smith, 32, engineer"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "role": {"type": "string"}
                },
                "required": ["name", "age", "role"]
            }
        }
    }
)

基于提示词的回退方案（带重试）：

python

def extract_json(prompt: str, schema: dict, max_retries=3) -> dict:
    for attempt in range(max_retries):
        raw = llm.complete(f"{prompt}\n\nRespond with valid JSON matching: {schema}")
        try:
            data = json.loads(raw)
            validate(data, schema)  # jsonschema
            return data
        except (json.JSONDecodeError, ValidationError) as e:
            prompt += f"\n\nPrevious response was invalid: {e}. Fix and retry."
    raise RuntimeError("Failed to get valid JSON after retries")

始终根据Schema验证解析后的JSON - 不要盲目信任模型生成的结构。至少使用
response_format: json_object
作为防护措施。

Implement prompt chaining

实现提示词链式调用

Decomposition pattern - split a complex task into sequential LLM calls:

python

undefined

分解模式 - 将复杂任务拆分为多个顺序LLM调用：

python

undefined

Step 1: Research

research = llm.complete(f"List key facts about: {topic}")

Step 2: Outline

outline = llm.complete(f"Given these facts:\n{research}\n\nCreate a structured outline.")

Step 3: Write

article = llm.complete(f"Outline:\n{outline}\n\nWrite the full article.")


**Routing pattern** - use a classifier call to select the right downstream prompt:
```python
intent = llm.complete(
    f"Classify this request as one of [refund, technical, billing, other]: {user_message}"
)
handler_prompt = PROMPTS[intent.strip().lower()]
response = llm.complete(handler_prompt.format(message=user_message))

Verification pattern - add a critic call after generation:

python

draft = llm.complete(task_prompt)
critique = llm.complete(
    f"Review this output for accuracy and completeness:\n{draft}\n\n"
    "List any errors or missing information. If none, respond 'APPROVED'."
)
if "APPROVED" not in critique:
    final = llm.complete(f"Revise based on this critique:\n{critique}\n\nDraft:\n{draft}")

article = llm.complete(f"Outline:\n{outline}\n\nWrite the full article.")


**路由模式** - 使用分类器调用选择合适的下游提示词：
```python
intent = llm.complete(
    f"Classify this request as one of [refund, technical, billing, other]: {user_message}"
)
handler_prompt = PROMPTS[intent.strip().lower()]
response = llm.complete(handler_prompt.format(message=user_message))

验证模式 - 生成后添加校验调用：

python

draft = llm.complete(task_prompt)
critique = llm.complete(
    f"Review this output for accuracy and completeness:\n{draft}\n\n"
    "List any errors or missing information. If none, respond 'APPROVED'."
)
if "APPROVED" not in critique:
    final = llm.complete(f"Revise based on this critique:\n{critique}\n\nDraft:\n{draft}")

Evaluate prompt quality

评估提示词质量

Metric	How to measure	Target
Accuracy	Compare to golden answers on eval set	Task-dependent; establish baseline
Consistency	Run same prompt N times, measure output variance	< 10% divergence for deterministic tasks
Format compliance	Parse output programmatically; count failures	> 99% for production structured output
Latency	P50/P95 TTFT and total response time	Set SLA before optimizing
Cost	Input + output tokens x price per token	Track per-request; alert on spikes
Hallucination rate	Human eval or reference-based metrics (RAGAS for RAG)	Establish red lines

Eval harness pattern:

python

results = []
for case in eval_set:
    output = llm.complete(prompt.format(**case["inputs"]))
    results.append({
        "id": case["id"],
        "pass": case["expected"] in output,
        "output": output,
    })
print(f"Pass rate: {sum(r['pass'] for r in results) / len(results):.1%}")

指标	测量方式	目标
准确性	与评估集的黄金答案对比	取决于任务；先建立基准
一致性	同一提示词运行N次，测量输出差异	确定性任务的差异<10%
格式合规性	以编程方式解析输出；统计失败次数	生产环境结构化输出的合规性>99%
延迟	P50/P95的TTFT和总响应时间	优化前先设置SLA
成本	输入+输出Token数 × 每Token价格	跟踪每个请求的成本；异常时触发告警
幻觉率	人工评估或基于参考的指标（RAG场景使用RAGAS）	设定红线阈值

评估框架模式：

python

results = []
for case in eval_set:
    output = llm.complete(prompt.format(**case["inputs"]))
    results.append({
        "id": case["id"],
        "pass": case["expected"] in output,
        "output": output,
    })
print(f"Pass rate: {sum(r['pass'] for r in results) / len(results):.1%}")

Anti-patterns / common mistakes

反模式/常见错误

Anti-pattern	Problem	Fix
Asking multiple unrelated questions in one prompt	Model answers one well, ignores others	One task per prompt; chain calls
System prompt with no output format	Responses vary wildly across runs	Always specify format, length, structure
Using temperature > 0 for structured extraction	JSON parse failures increase dramatically	Set `temperature: 0` for deterministic tasks
Injecting entire documents into context	"Lost in the middle" - model ignores center of context	Chunk and retrieve only relevant passages
No eval set before shipping a prompt	No way to detect regressions	Build a 20+ case eval set before production
Trusting model output without validation	Downstream failures, security issues	Parse + validate + retry on failure

反模式	问题	修复方案
单个提示词中包含多个不相关问题	模型会很好地回答一个问题，忽略其他问题	每个提示词对应一个任务；使用链式调用
系统提示词未指定输出格式	不同运行的响应差异极大	始终指定格式、长度和结构
结构化提取任务使用temperature>0	JSON解析失败率大幅上升	确定性任务设置 `temperature: 0`
将整个文档注入上下文	“中间信息丢失” - 模型会忽略上下文中间部分	拆分文档并仅注入相关片段
上线提示词前未准备评估集	无法检测回归问题	上线前构建包含20+案例的评估集
未验证就信任模型输出	导致下游故障、安全问题	解析+验证+失败重试

References

参考资料

For a comprehensive catalog of 15+ individual prompting techniques with examples and effectiveness notes, load:

```
references/techniques-catalog.md
```
- zero-shot, CoT, self-consistency, ToT, ReAct, meta-prompting, role prompting, and more

Only load the references file when selecting or comparing specific techniques - it is long and will consume context.

如需获取包含15+种提示词技术的综合目录（含示例和有效性说明），请加载：

```
references/techniques-catalog.md
```
- 零样本、思维链、自一致性、树状思维、ReAct、元提示词、角色提示词等更多技术

仅在选择或比较特定技术时加载参考文件 - 该文件较长，会占用上下文容量。

prompt-engineering

Original

Translation

Prompt Engineering

提示词工程

When to use this skill

何时使用此技能

Key principles

核心原则

Core concepts

核心概念

System / user / assistant roles

系统/用户/助手角色

Temperature and sampling

温度参数与采样

Token economics

Token成本控制

Context window management

上下文窗口管理

Prompt vs fine-tuning decision

提示词与微调的决策对比

Common tasks

常见任务

Write effective system prompts

编写有效的系统提示词

Implement chain-of-thought

实现思维链

Design few-shot examples

设计少样本示例

Build a RAG prompt pipeline

构建RAG提示词管道

Get structured JSON output

获取结构化JSON输出

Implement prompt chaining

实现提示词链式调用

Step 1: Research

Step 1: Research

Step 2: Outline

Step 2: Outline

Step 3: Write

Step 3: Write

Evaluate prompt quality

评估提示词质量

Anti-patterns / common mistakes

反模式/常见错误

References

参考资料

Related skills

相关技能