Prompt Engineering
When to Use
- Crafting or refining prompts for LLM-based features
- Improving output quality, consistency, or reliability
- Designing system prompts for AI agents or chatbots
- Implementing structured output (JSON, specific formats)
- Defending against prompt injection attacks
- Building prompt templates for reusable workflows
- Evaluating and iterating on prompt performance
Instructions
1. Prompt Structure Fundamentals
A well-structured prompt has these components (in order of importance):
- Role/Context — Who is the model? What domain expertise applies?
- Task — What exactly should it do? Be specific and unambiguous.
- Constraints — Format, length, tone, what to avoid.
- Examples — Input/output pairs demonstrating desired behavior.
- Input — The actual data to process.
Principles:
- Be explicit. LLMs do not read minds — state what you want and what you do not want.
- Put the most important instructions first and last (primacy and recency effects).
- Use delimiters to separate sections: , , XML tags, triple backticks.
- Shorter is not always better — a well-structured 500-word prompt beats an ambiguous 50-word one.
2. Core Techniques
See
references/techniques-catalog.md
for detailed templates and examples.
Zero-shot: Direct instruction with no examples. Works for simple, well-defined tasks.
Few-shot: Provide 2-5 input/output examples before the actual input. The model learns the pattern from examples. Choose diverse, representative examples. Order matters — put the most similar example last.
Chain-of-Thought (CoT): Add "Let's think step by step" or provide reasoning examples. Dramatically improves math, logic, and multi-step tasks. Can be combined with few-shot (show reasoning in examples).
Self-consistency: Generate multiple responses with temperature > 0, then take the majority answer. Best for factual or reasoning tasks where there is one correct answer.
Structured output: Request JSON, XML, or specific formats. Use JSON mode when available. Provide the exact schema in the prompt. Validate output programmatically.
3. System Prompt Design
System prompts set persistent behavior for the entire conversation:
- Define the persona, expertise, and communication style
- Set hard constraints (what the model must never do)
- Establish output format expectations
- Include domain-specific knowledge or rules
Best practices:
- Keep system prompts focused — one clear role, not five
- Use positive instructions ("always do X") over negative ("never do Y") where possible
- Test with adversarial inputs to ensure constraints hold
- Version your system prompts and track changes like code
4. Temperature and Sampling Parameters
- Temperature (0.0 - 2.0): Controls randomness. 0.0 = deterministic, 1.0 = default creative, >1.0 = very random.
- Use 0.0-0.3 for factual tasks, code generation, structured output
- Use 0.5-0.8 for creative writing, brainstorming
- Use 0.0 for reproducible evaluations
- Top-p (0.0 - 1.0): Nucleus sampling. 0.9 means consider tokens comprising top 90% probability. Alternative to temperature — do not adjust both simultaneously.
- Max tokens: Set to expected output length + buffer. Too low truncates output; too high wastes quota.
- Stop sequences: Define strings that halt generation. Useful for structured extraction.
5. Prompt Injection Defense
Prompt injection is when user input manipulates the model's behavior by overriding instructions.
Defense layers:
- Input sanitization: Strip or escape known injection patterns. Detect
ignore previous instructions
type phrases.
- Delimited input: Wrap user input in clear delimiters and instruct the model to treat the delimited content as data only, never as instructions.
- Output validation: Verify output conforms to expected format. Reject unexpected formats.
- Privilege separation: Use separate LLM calls for different trust levels. Do not mix system logic and user input in one prompt.
- Canary tokens: Include a secret token in the system prompt. If it appears in output, injection may have occurred.
No defense is perfect. Layer multiple approaches and assume breach.
6. Prompt Templates and Iteration
Build reusable templates with variable slots:
You are a {role} specializing in {domain}.
Analyze the following {input_type}:
---
{input}
---
Provide your analysis in the following format:
- Summary: (1-2 sentences)
- Key findings: (bullet points)
- Recommendations: (numbered list)
Iteration process:
- Start with a simple prompt that captures the core task
- Test on 10-20 diverse inputs
- Identify failure modes (wrong format, missing info, hallucination)
- Add constraints or examples to address each failure mode
- Retest — ensure fixes do not break previously working cases
- Document the prompt version and test results
7. Evaluation
Measure prompt quality systematically:
- Build an eval set of 20-50 input/expected-output pairs
- Score each output (binary pass/fail, or rubric-based 1-5)
- Track metrics across prompt versions: accuracy, format compliance, latency
- Use LLM-as-judge for subjective quality (see llm-evaluation skill)
- Automate eval runs in CI when prompt changes are deployed
Examples
Designing a Classification Prompt
User needs to classify support tickets into categories. Design a few-shot prompt with 3-5 example tickets per category. Include edge cases. Use temperature 0.0 for consistency. Request JSON output:
{"category": "...", "confidence": "high|medium|low"}
. Validate output schema programmatically. Measure accuracy against labeled test set.
Building a Code Review Agent
User wants an LLM-powered code review assistant. Design a system prompt defining the reviewer persona (senior engineer, specific language expertise). Include review criteria: correctness, performance, readability, security. Use structured output for findings. Add injection defense for code that might contain adversarial comments. Test with intentionally bad code to verify the agent catches issues.
Optimizing an Underperforming Prompt
User reports their summarization prompt produces inconsistent output. Diagnose: test on 20 inputs, categorize failures (too long, misses key points, wrong tone). Add length constraints, provide few-shot examples of ideal summaries, add chain-of-thought for complex documents. A/B test the old vs. new prompt on the eval set. Track improvement in format compliance and content accuracy.