Senior Prompt Engineer
Overview
Design, test, and optimize prompts for large language models. This skill covers systematic prompt engineering including few-shot example design, chain-of-thought reasoning, system prompt architecture, structured output specification, parameter tuning, evaluation methodology, A/B testing, and prompt version management.
Announce at start: "I'm using the senior-prompt-engineer skill for prompt design and optimization."
Phase 1: Requirements
Goal: Define the task objective, quality criteria, and constraints before writing any prompt.
Actions
- Define the task objective clearly
- Identify input/output format requirements
- Determine quality criteria (accuracy, tone, format)
- Assess edge cases and failure modes
- Choose model and parameter constraints
STOP — Do NOT proceed to Phase 2 until:
Phase 2: Prompt Design
Goal: Draft the prompt with proper architecture, examples, and constraints.
Actions
- Draft system prompt with role, constraints, and format
- Design few-shot examples (3-5 representative cases)
- Add chain-of-thought scaffolding if reasoning is needed
- Specify output structure (JSON, markdown, etc.)
- Add error handling instructions
Prompt Architecture Layers
| Layer | Purpose | Example |
|---|
| 1. Identity | Who the model is | "You are a sentiment classifier..." |
| 2. Context | What it knows/has access to | "You have access to product reviews..." |
| 3. Task | What to do | "Classify each review as positive/negative/neutral" |
| 4. Constraints | What NOT to do | "Never include PII in output" |
| 5. Format | How to structure output | "Respond in JSON: {classification, confidence}" |
| 6. Examples | Demonstrations | 3-5 representative input/output pairs |
| 7. Metacognition | Handling uncertainty | "If uncertain, classify as neutral and explain" |
System Prompt Template
[Role] You are a [specific role] that [specific capability].
[Context] You have access to [tools/knowledge]. The user will provide [input type].
[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]
[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]
[Output Format]
Respond in the following format:
[format specification]
[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>
STOP — Do NOT proceed to Phase 3 until:
Phase 3: Evaluation and Iteration
Goal: Measure prompt quality and iterate toward targets.
Actions
- Create evaluation dataset (50+ examples minimum)
- Define scoring rubric (automated + human metrics)
- Run baseline evaluation
- Iterate on prompt with targeted improvements
- A/B test promising variants
- Version and document the final prompt
STOP — Evaluation complete when:
Few-Shot Example Design
Selection Criteria Decision Table
| Criterion | Explanation | Example |
|---|
| Representative | Cover most common input types | Include typical emails, not just edge cases |
| Diverse | Include edge cases and boundaries | Short + long, positive + negative |
| Ordered | Simple to complex progression | Obvious case first, ambiguous last |
| Balanced | Equal representation of categories | Not 4 positive and 1 negative |
Example Count Guidelines
| Task Complexity | Examples Needed |
|---|
| Simple classification | 2-3 |
| Moderate generation | 3-5 |
| Complex reasoning | 5-8 |
| Format-sensitive | 3-5 (focus on format consistency) |
Example Format
<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>
Chain-of-Thought Patterns
CoT Pattern Decision Table
| Pattern | Use When | Example |
|---|
| Standard CoT | Multi-step reasoning | "Think step by step: 1. Identify... 2. Analyze..." |
| Structured CoT | Need parseable reasoning | XML tags: then |
| Self-Consistency | High-stakes decisions | Generate 3 solutions, pick most common |
| No CoT | Simple factual lookups, format conversion | Skip reasoning overhead |
When to Use CoT
| Task Type | Use CoT? | Rationale |
|---|
| Mathematical reasoning | Yes | Step-by-step prevents errors |
| Multi-step logic | Yes | Makes reasoning transparent |
| Classification with justification | Yes | Improves accuracy and explainability |
| Simple factual lookup | No | Adds latency without accuracy gain |
| Direct format conversion | No | No reasoning needed |
| Very short responses | No | CoT overhead exceeds benefit |
Structured Output
Output Format Decision Table
| Format | Use When | Parsing |
|---|
| JSON | Machine-consumed output | |
| Markdown | Human-readable structured text | Regex or markdown parser |
| XML tags | Sections need clear boundaries | XML parser or regex |
| YAML | Configuration-like output | YAML parser |
| Plain text | Simple, unstructured response | No parsing needed |
JSON Output Example
Respond with a JSON object matching this schema:
{
"classification": "positive" | "negative" | "neutral",
"confidence": number between 0 and 1,
"reasoning": "brief explanation",
"key_phrases": ["array", "of", "phrases"]
}
Do not include any text outside the JSON object.
Temperature and Top-P Tuning
| Use Case | Temperature | Top-P | Rationale |
|---|
| Code generation | 0.0-0.2 | 0.9 | Deterministic, correct |
| Classification | 0.0 | 1.0 | Consistent results |
| Creative writing | 0.7-1.0 | 0.95 | Diverse, interesting |
| Summarization | 0.2-0.4 | 0.9 | Faithful but fluent |
| Brainstorming | 0.8-1.2 | 0.95 | Maximum diversity |
| Data extraction | 0.0 | 0.9 | Precise, reliable |
Rules
- Temperature 0 for tasks requiring consistency and correctness
- Higher temperature for creative tasks
- Top-P rarely needs tuning (keep at 0.9-1.0)
- Do not use both high temperature AND low top-p (contradictory)
Evaluation Metrics
Automated Metrics
| Metric | Measures | Use For |
|---|
| Exact Match | Output equals expected | Classification, extraction |
| F1 Score | Precision + recall balance | Multi-label tasks |
| BLEU/ROUGE | N-gram overlap | Summarization, translation |
| JSON validity | Parseable structured output | Structured generation |
| Regex match | Output matches pattern | Format compliance |
Human Evaluation Dimensions
| Dimension | Scale | Description |
|---|
| Accuracy | 1-5 | Factual correctness |
| Relevance | 1-5 | Addresses the actual question |
| Coherence | 1-5 | Logical flow and structure |
| Completeness | 1-5 | Covers all required aspects |
| Tone | 1-5 | Matches desired voice |
| Conciseness | 1-5 | No unnecessary content |
Evaluation Dataset Requirements
- Minimum 50 examples for statistical significance
- Cover all input categories proportionally
- Include edge cases (10-20% of dataset)
- Gold labels reviewed by 2+ evaluators
- Version-controlled alongside prompts
A/B Testing Process
- Define hypothesis: "Prompt B will improve [metric] by [amount]"
- Hold all variables constant except the prompt change
- Run both variants on the same evaluation set
- Calculate metric differences with confidence intervals
- Require statistical significance (p < 0.05) before adopting
What to A/B Test
| Variable | Expected Impact |
|---|
| Instruction phrasing (imperative vs descriptive) | Moderate |
| Number of few-shot examples | Moderate |
| Example ordering | Low-moderate |
| CoT presence/absence | High for reasoning tasks |
| Output format specification | High for structured output |
| Constraint placement (beginning vs end) | Low |
Prompt Versioning
Version File Format
yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
accuracy: 0.94
f1: 0.92
eval_dataset: sentiment-eval-v3
system_prompt: |
You are a sentiment classifier...
examples:
- input: "..."
output: "..."
Versioning Rules
- Semantic versioning: major.minor (major = behavior change, minor = refinement)
- Every version includes evaluation metrics
- Link to evaluation dataset version
- Document what changed and why
- Keep previous versions for rollback
Anti-Patterns / Common Mistakes
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|
| Vague instructions ("be helpful") | Unreliable, inconsistent output | Specific instructions with examples |
| Contradictory constraints | Model cannot satisfy both | Review for consistency |
| Examples that do not match task | Confuses the model | Examples must reflect real use |
| Over-engineering simple tasks | Wasted tokens, slower | Match prompt complexity to task complexity |
| No evaluation framework | Guessing at quality | Define metrics before iterating |
| Optimizing for single example | Overfitting to one case | Optimize for the distribution |
| Assuming cross-model portability | Different models need different prompts | Test on target model |
| Skipping version control | Cannot rollback or compare | Version every prompt with metrics |
Integration Points
| Skill | Relationship |
|---|
| LLM-as-judge evaluates prompt output quality |
| Prompt evaluation datasets serve as acceptance tests |
| Prompt testing follows the evaluation methodology |
| Statistical testing validates A/B results |
| Prompt changes reviewed like code changes |
| Prompt readability follows clean code naming principles |
Skill Type
FLEXIBLE — Adapt prompting techniques to the specific model, task, and quality requirements. The evaluation and versioning practices are strongly recommended but can be scaled to project size. Always version production prompts.