senior-prompt-engineer

Original🇺🇸 English
Translated

Use when the user needs prompt design, optimization, few-shot examples, chain-of-thought patterns, structured output, evaluation metrics, or prompt versioning. Triggers: new prompt creation, prompt optimization, few-shot example design, structured output specification, A/B testing prompts, evaluation framework setup.

2installs
Added on

NPX Install

npx skill4agent add pixel-process-ug/superkit-agents senior-prompt-engineer

Senior Prompt Engineer

Overview

Design, test, and optimize prompts for large language models. This skill covers systematic prompt engineering including few-shot example design, chain-of-thought reasoning, system prompt architecture, structured output specification, parameter tuning, evaluation methodology, A/B testing, and prompt version management.
Announce at start: "I'm using the senior-prompt-engineer skill for prompt design and optimization."

Phase 1: Requirements

Goal: Define the task objective, quality criteria, and constraints before writing any prompt.

Actions

  1. Define the task objective clearly
  2. Identify input/output format requirements
  3. Determine quality criteria (accuracy, tone, format)
  4. Assess edge cases and failure modes
  5. Choose model and parameter constraints

STOP — Do NOT proceed to Phase 2 until:

  • Task objective is stated in one sentence
  • Input format and output format are defined
  • Quality criteria are measurable
  • Edge cases are listed
  • Model selection is justified

Phase 2: Prompt Design

Goal: Draft the prompt with proper architecture, examples, and constraints.

Actions

  1. Draft system prompt with role, constraints, and format
  2. Design few-shot examples (3-5 representative cases)
  3. Add chain-of-thought scaffolding if reasoning is needed
  4. Specify output structure (JSON, markdown, etc.)
  5. Add error handling instructions

Prompt Architecture Layers

LayerPurposeExample
1. IdentityWho the model is"You are a sentiment classifier..."
2. ContextWhat it knows/has access to"You have access to product reviews..."
3. TaskWhat to do"Classify each review as positive/negative/neutral"
4. ConstraintsWhat NOT to do"Never include PII in output"
5. FormatHow to structure output"Respond in JSON: {classification, confidence}"
6. ExamplesDemonstrations3-5 representative input/output pairs
7. MetacognitionHandling uncertainty"If uncertain, classify as neutral and explain"

System Prompt Template

[Role] You are a [specific role] that [specific capability].

[Context] You have access to [tools/knowledge]. The user will provide [input type].

[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]

[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]

[Output Format]
Respond in the following format:
[format specification]

[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>

STOP — Do NOT proceed to Phase 3 until:

  • All 7 layers are addressed (or intentionally omitted with rationale)
  • Examples are representative and diverse
  • Output format is unambiguous
  • Constraints are specific (not vague)

Phase 3: Evaluation and Iteration

Goal: Measure prompt quality and iterate toward targets.

Actions

  1. Create evaluation dataset (50+ examples minimum)
  2. Define scoring rubric (automated + human metrics)
  3. Run baseline evaluation
  4. Iterate on prompt with targeted improvements
  5. A/B test promising variants
  6. Version and document the final prompt

STOP — Evaluation complete when:

  • Evaluation dataset covers all input categories
  • Metrics meet defined quality thresholds
  • A/B test shows statistical significance (p < 0.05)
  • Final prompt is versioned with metrics

Few-Shot Example Design

Selection Criteria Decision Table

CriterionExplanationExample
RepresentativeCover most common input typesInclude typical emails, not just edge cases
DiverseInclude edge cases and boundariesShort + long, positive + negative
OrderedSimple to complex progressionObvious case first, ambiguous last
BalancedEqual representation of categoriesNot 4 positive and 1 negative

Example Count Guidelines

Task ComplexityExamples Needed
Simple classification2-3
Moderate generation3-5
Complex reasoning5-8
Format-sensitive3-5 (focus on format consistency)

Example Format

<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>

Chain-of-Thought Patterns

CoT Pattern Decision Table

PatternUse WhenExample
Standard CoTMulti-step reasoning"Think step by step: 1. Identify... 2. Analyze..."
Structured CoTNeed parseable reasoningXML tags:
<analysis>...</analysis>
then
<answer>...</answer>
Self-ConsistencyHigh-stakes decisionsGenerate 3 solutions, pick most common
No CoTSimple factual lookups, format conversionSkip reasoning overhead

When to Use CoT

Task TypeUse CoT?Rationale
Mathematical reasoningYesStep-by-step prevents errors
Multi-step logicYesMakes reasoning transparent
Classification with justificationYesImproves accuracy and explainability
Simple factual lookupNoAdds latency without accuracy gain
Direct format conversionNoNo reasoning needed
Very short responsesNoCoT overhead exceeds benefit

Structured Output

Output Format Decision Table

FormatUse WhenParsing
JSONMachine-consumed output
JSON.parse()
MarkdownHuman-readable structured textRegex or markdown parser
XML tagsSections need clear boundariesXML parser or regex
YAMLConfiguration-like outputYAML parser
Plain textSimple, unstructured responseNo parsing needed

JSON Output Example

Respond with a JSON object matching this schema:
{
  "classification": "positive" | "negative" | "neutral",
  "confidence": number between 0 and 1,
  "reasoning": "brief explanation",
  "key_phrases": ["array", "of", "phrases"]
}

Do not include any text outside the JSON object.

Temperature and Top-P Tuning

Use CaseTemperatureTop-PRationale
Code generation0.0-0.20.9Deterministic, correct
Classification0.01.0Consistent results
Creative writing0.7-1.00.95Diverse, interesting
Summarization0.2-0.40.9Faithful but fluent
Brainstorming0.8-1.20.95Maximum diversity
Data extraction0.00.9Precise, reliable

Rules

  • Temperature 0 for tasks requiring consistency and correctness
  • Higher temperature for creative tasks
  • Top-P rarely needs tuning (keep at 0.9-1.0)
  • Do not use both high temperature AND low top-p (contradictory)

Evaluation Metrics

Automated Metrics

MetricMeasuresUse For
Exact MatchOutput equals expectedClassification, extraction
F1 ScorePrecision + recall balanceMulti-label tasks
BLEU/ROUGEN-gram overlapSummarization, translation
JSON validityParseable structured outputStructured generation
Regex matchOutput matches patternFormat compliance

Human Evaluation Dimensions

DimensionScaleDescription
Accuracy1-5Factual correctness
Relevance1-5Addresses the actual question
Coherence1-5Logical flow and structure
Completeness1-5Covers all required aspects
Tone1-5Matches desired voice
Conciseness1-5No unnecessary content

Evaluation Dataset Requirements

  • Minimum 50 examples for statistical significance
  • Cover all input categories proportionally
  • Include edge cases (10-20% of dataset)
  • Gold labels reviewed by 2+ evaluators
  • Version-controlled alongside prompts

A/B Testing Process

  1. Define hypothesis: "Prompt B will improve [metric] by [amount]"
  2. Hold all variables constant except the prompt change
  3. Run both variants on the same evaluation set
  4. Calculate metric differences with confidence intervals
  5. Require statistical significance (p < 0.05) before adopting

What to A/B Test

VariableExpected Impact
Instruction phrasing (imperative vs descriptive)Moderate
Number of few-shot examplesModerate
Example orderingLow-moderate
CoT presence/absenceHigh for reasoning tasks
Output format specificationHigh for structured output
Constraint placement (beginning vs end)Low

Prompt Versioning

Version File Format

yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
  accuracy: 0.94
  f1: 0.92
  eval_dataset: sentiment-eval-v3
system_prompt: |
  You are a sentiment classifier...
examples:
  - input: "..."
    output: "..."

Versioning Rules

  • Semantic versioning: major.minor (major = behavior change, minor = refinement)
  • Every version includes evaluation metrics
  • Link to evaluation dataset version
  • Document what changed and why
  • Keep previous versions for rollback

Anti-Patterns / Common Mistakes

Anti-PatternWhy It Is WrongCorrect Approach
Vague instructions ("be helpful")Unreliable, inconsistent outputSpecific instructions with examples
Contradictory constraintsModel cannot satisfy bothReview for consistency
Examples that do not match taskConfuses the modelExamples must reflect real use
Over-engineering simple tasksWasted tokens, slowerMatch prompt complexity to task complexity
No evaluation frameworkGuessing at qualityDefine metrics before iterating
Optimizing for single exampleOverfitting to one caseOptimize for the distribution
Assuming cross-model portabilityDifferent models need different promptsTest on target model
Skipping version controlCannot rollback or compareVersion every prompt with metrics

Integration Points

SkillRelationship
llm-as-judge
LLM-as-judge evaluates prompt output quality
acceptance-testing
Prompt evaluation datasets serve as acceptance tests
testing-strategy
Prompt testing follows the evaluation methodology
senior-data-scientist
Statistical testing validates A/B results
code-review
Prompt changes reviewed like code changes
clean-code
Prompt readability follows clean code naming principles

Skill Type

FLEXIBLE — Adapt prompting techniques to the specific model, task, and quality requirements. The evaluation and versioning practices are strongly recommended but can be scaled to project size. Always version production prompts.