Prompt Writer
Write and improve production-grade prompts for any LLM, grounded in official documentation, peer-reviewed research, and patterns from production systems like Claude Code, Cursor, Manus, and Devin.
Step 0: Determine Mode
Before starting, identify the task:
| User intent | Mode | What to do |
|---|
| "Write a prompt for..." | Write | Follow Steps 1-5 below |
| "Improve/fix/optimize this prompt" | Improve | See Improve Mode below, then references/prompt-improvement.md
for full workflow |
| "Review this prompt" | Review | Run Step 4 (Review and Harden) on their prompt |
| "Write a tool description" | Tool | Jump to the Tool Descriptions section in Step 3 |
Current as of March 2026. Model guidance reflects Claude 4.6, GPT-5.4, Gemini 3, Llama 4, DeepSeek R1/V3. Verify against official docs for newer releases.
Improve Mode (Quick Summary)
When the user has an existing prompt that isn't working:
- Diagnose — run through the 6-dimension defect checklist (specification, input, structure, context, performance, maintainability)
- Quick fixes — apply the principled instructions checklist (remove politeness, add delimiters, specify audience/format, reframe negatives)
- Meta-prompt — give an LLM the prompt + 3-5 failure examples and ask it to rewrite
- Rewrite — apply patterns: add specificity, add role+audience, add output schema, add contrastive examples
- Evaluate — compare before/after on test cases, check for regressions
For the full workflow with detailed patterns and examples, see
references/prompt-improvement.md
.
Boundaries
This skill covers writing and improving prompts. It does not cover:
- Prompt injection red-teaming (use security-specific tools)
- Full RAG pipeline architecture (prompt is one piece; retrieval design is separate)
- Fine-tuning or training data curation
Step 1: Understand the Context
Before writing, gather:
- Target model — Which model family? (See Model Decision Matrix below)
- Use case — System prompt, task prompt, agent prompt, tool description?
- Tool access — What tools/functions will the agent have?
- Audience — Who interacts with this? Technical users, consumers, internal?
- Constraints — Length, tone, safety requirements, domain boundaries?
- Failure modes — What should the agent refuse or avoid?
If the user hasn't specified these, ask before writing. A prompt without context is a prompt that will fail.
Step 2: Choose the Right Structure
Universal structure adapted from Anthropic, OpenAI, and Google's official guides:
1. Role & Identity — Who is this agent?
2. Core Instructions — What are the behavioral rules?
3. Tool/Capability Guide — How to use available tools
4. Reasoning Guidance — When and how to think step-by-step
5. Output Format — What shape should responses take?
6. Examples — 2-5 demonstrations of desired behavior
7. Context/Reference — Domain knowledge, loaded last
8. Final Reminders — Closing behavioral anchors
Not every prompt needs all sections. A simple chatbot needs 1, 2, 5, 6. A complex agent needs all eight. A task prompt may need only 2, 5, 6.
Step 3: Write Each Section
Role & Identity
One to three sentences. Anchors all subsequent behavior.
You are a senior backend engineer specializing in Python and Django.
You help developers debug production issues by analyzing logs,
tracing errors, and suggesting minimal fixes.
Audience framing amplifies this — stating who the output is for improves relevance by up to 100% in benchmarks.
Core Instructions
Structure with markdown headers or XML tags. Key principles:
Explain the why, not just the what. Models generalize better from motivation than bare commands. Instead of "Never use ellipses", write "Avoid ellipses because the TTS engine cannot pronounce them."
Use positive framing. "Write in complete sentences" outperforms "Don't use fragments." Tell the agent what TO do. Research confirms positive instructions actively boost desired token probabilities, while negatives only weakly suppress.
Be specific enough to verify. "Use 2-space indentation" is testable. "Format code properly" is not.
Resolve contradictions. Models waste reasoning tokens reconciling conflicts. Add explicit priority: "If rule A and rule B conflict, rule A takes precedence."
Match language intensity to model — this is critical:
| Model Family | Guidance Style |
|---|
| Claude 4.x | Conversational. Explain reasoning. Avoid CAPS/MUST/NEVER — causes overtriggering. Use "prefer", "because". |
| GPT-5.x | Direct but not aggressive. Naturally thorough — don't over-prompt for thoroughness. |
| GPT-4.1 | More literal than predecessors. A single clarifying sentence corrects behavior. |
| Gemini 3 | Short, direct specs only. Logic over persuasion. No flowery language. |
| DeepSeek R1 | NO system prompt. Put everything in user message. No few-shot. No "think step by step." |
| Mistral | Hierarchical markdown sections. Explicit step-by-step helps. |
| Cohere | Minimal preamble. Pass docs via API, not prompt. RAG-first. |
| Qwen/QwQ | Standard. For QwQ: non-greedy sampling required (temp=0.6, top_p=0.95). |
| Grok | Verbose, detailed system prompts work best. XML/MD both fine. |
| Nova | CAPS DO/MUST/DO NOT encouraged (opposite of Claude!). Numbered steps. |
| Smaller/Mini | Most literal. Critical rules FIRST. Numbered steps. Full execution order. |
For full model-specific guidance, see
references/model-specific-tuning.md
.
The Prompting Inversion (Critical Finding)
Techniques effective for mid-tier models can actively harm frontier models. As capability increases, simpler prompts work better. Constraints that prevent common-sense errors in weaker models induce hyper-literalism in stronger ones.
Rule: always calibrate prompt complexity to model capability. When in doubt, start simple.
Tool Descriptions
Tool descriptions are prompts — 97.1% of tool descriptions in production have quality defects. Each tool needs a 3-4 sentence contract:
- What it does and what it returns
- When to use it (activation criteria)
- When NOT to use it
- Key parameter guidance and limitations
search_codebase: Searches the repository for code matching a query.
Returns matching lines with file paths and line numbers. Use when
you need to find function definitions, usage patterns, or specific
snippets. Do not use for reading entire files — use read_file instead.
Query supports regex. Results are limited to 50 matches.
The six essentials (from MCP tool description research, 2026):
- Clear purpose with return data specified
- Activation criteria (when to use)
- Documented limitations and failure modes
- Parameter intent, not just data types
- 3-4+ sentences proportional to complexity
- Examples showing both successful and failing cases
Reasoning Guidance
Match reasoning instructions to the model:
- Reasoning models (Claude extended thinking, GPT o-series, DeepSeek R1, Gemini with thinking, QwQ): Already reason internally. "Think step by step" provides marginal gains at 35-600% more latency. Use "verify your answer against [criteria]" instead.
- Non-reasoning models (GPT-4.1, Claude Haiku, Mistral Small): "Think step by step" provides ~50% improvement on math/symbolic tasks. Worth the cost.
- Simple tasks on any model: Skip reasoning entirely. CoT can hurt performance on straightforward tasks.
The Think Tool pattern (+54% on complex policy tasks): Give agents a no-op tool called
— a callable scratchpad between tool calls. Different from extended thinking. Include domain-specific examples showing HOW to use it.
Chain-of-Draft (cost-efficient alternative to CoT): When reasoning is needed but cost/latency matters, limit each reasoning step to ~5 words. Matches CoT accuracy at 7.6% of the tokens: "Think step by step, but keep only a minimum draft for each step, 5 words at most."
For agents, three reminders consistently improve performance (+20%):
- Persistence: "Continue working until the task is fully resolved."
- Tool use: "Use tools to gather information rather than guessing."
- Planning: "Plan your approach before taking action, and reflect on results after."
Output Format
Be explicit. Without format guidance, models add prose, markdown fences, or unnecessary structure.
The CTCO pattern (from OpenAI's GPT-5.4 guide — most reliable anti-hallucination structure):
- Context — Who is the model? What is the background state?
- Task — The single, atomic action required
- Constraints — Negative constraints and scope limits
- Output — Exact format specification (JSON schema, sections, length)
Respond with a JSON object containing:
- "summary": one-sentence description
- "severity": "low" | "medium" | "high" | "critical"
- "fix": the recommended code change as a unified diff
Examples (Few-Shot)
Examples are the highest-impact technique — up to 90% accuracy improvement.
The 2-5 rule: Major gains after 2; diminishing returns after 5. More can paradoxically degrade performance (the "few-shot dilemma" — Gemma 7B dropped from 77.9% to 39.9%).
Quality over quantity: TF-IDF-based example selection outperforms random sampling by 1%+ while avoiding degradation. Choose examples relevant to the actual use case.
Contrastive examples dramatically improve reasoning: showing both a wrong approach (with why it's wrong) and the correct approach. GSM8K: 35.9% → 88.8% with GPT-4 using this pattern.
Order matters: Alternate positive and negative examples to avoid bias.
Format by model:
- Claude:
<examples><example><user>...</user><assistant>...</assistant></example></examples>
- GPT/others: Markdown headers or numbered examples
- DeepSeek R1: NO examples (degrades reasoning)
Context/Reference
Place long-form reference material AFTER instructions and examples. Queries placed after long context improve quality by up to 30%.
Lost-in-the-middle: Models attend more to the beginning and end of prompts (architectural, not training artifact). Place highest-priority content at the start and end. Never bury critical instructions in the middle.
For Claude, use XML tags:
,
,
.
Final Reminders
Close with 1-3 critical behavioral anchors. Models weight content at the beginning and end more heavily.
Step 4: Review and Harden
Structural quality:
Language quality:
Robustness:
Efficiency:
See
references/security-patterns.md
for defense patterns.
Step 5: Test and Iterate
- Try adversarial inputs — edge cases, ambiguous requests, rule-breaking attempts
- Check failure modes — graceful or catastrophic?
- Run 5x on same input — check consistency
- Iterate one variable at a time — changing model AND prompt simultaneously makes attribution impossible
For systematic prompt improvement, see
references/prompt-improvement.md
.
Model Decision Matrix
Quick reference for choosing prompting strategy by model:
| Model | System Prompt | Delimiter Style | Thinking | Key Differentiator |
|---|
| Claude 4.x | Separate role | XML tags | Extended thinking | Soft language, no CAPS |
| GPT-5.x | System + instructions | Markdown | o-series reasoning | CTCO pattern, reasoning_effort |
| GPT-4.1 | System | Markdown | Manual CoT | Ultra-literal, sandwich method |
| Gemini 3 | System instruction | Hierarchy/outline | thinking_level | Temp must stay 1.0, concise |
| Llama 4 | Special tokens | Special tokens | None | Exact template required |
| DeepSeek R1 | NO system prompt | Plain text in user msg | Auto <think> | Zero-shot only, no examples |
| DeepSeek V3 | Yes | Standard | None | Use for speed, R1 for depth |
| Mistral | Prepended to user | Markdown sections | None | Prefix injection, JSON dual-instruct |
| Cohere | Preamble (optional) | Documents API | None | RAG-first, auto-citations |
| Qwen/QwQ | apply_chat_template | ChatML | enable_thinking | Non-greedy required for QwQ |
| Grok 3 | Separate (verbose) | XML or Markdown | reasoning_effort | Real-time X data, cache-stable |
| Nova | Separate | Markdown | None | CAPS encouraged, temp=0 for tools |
For detailed per-model guidance:
references/model-specific-tuning.md
Quick Reference: Prompt Patterns
Agent System Prompt Template
markdown
# Role
[1-3 sentences: who, what domain, what tone]
# Instructions
[Behavioral rules, organized by category]
[Each rule with motivation: "because..."]
# Tools
[For each tool: what it does, when to use, when not to use, limitations]
# Output Format
[Explicit format specification]
# Examples
[2-5 diverse, relevant demonstrations]
# Context
[Reference material, loaded last]
# Reminders
[1-3 critical behavioral anchors]
Common Prompt Recipes
| Scenario | Key Sections | Notes |
|---|
| Simple chatbot | Role, Instructions, Format, Examples | Skip tools and reasoning |
| Coding agent | All 8 sections | Add persistence + planning reminders, think tool |
| Data analyst | Role, Instructions, Tools, Format | Emphasize structured output |
| Content writer | Role, Instructions, Examples, Format | Heavy on examples and tone |
| Customer support | Role, Instructions, Examples, Context | Load FAQ as context |
| Multi-agent orchestrator | Role, Instructions, Tools, Reasoning | Define delegation rules, effort budgets |
| Task prompt (one-shot) | Instructions, Format, Examples | No role needed, be specific |
| Prompt improvement | See references/prompt-improvement.md
| Diagnose → fix → evaluate |
Multimodal Prompting
For models that support vision (Claude, GPT-4o+, Gemini, Llama 4, Grok):
Image placement matters:
- Gemini: image BEFORE text (official recommendation)
- Claude/GPT: image position is flexible, but place relevant images near the text that references them
- Multiple images: reference each by name or position ("In the first image...")
Key patterns:
- Be explicit about what to do with the image: "transcribe", "describe", "extract table data", "compare these two screenshots"
- For OCR/extraction: specify exact output format (JSON, table, list)
- For analysis: provide evaluation criteria, not open-ended "what do you see?"
- For diagrams/charts: ask for specific data points, not general description
Token costs vary significantly:
- Gemini: 258 tokens per image tile (≤384px), video at 263 tokens/sec
- Claude: ~1,600 tokens for a typical image
- Budget accordingly — 10 images can consume 16K+ tokens
Structured Output (JSON Mode)
Each model handles structured output differently:
| Model | How to enable | Key requirement |
|---|
| Claude | Request JSON in instructions + use XML output tags | Specify exact schema in prompt |
| GPT | response_format: {type: "json_object"}
or Structured Outputs with JSON Schema | Must mention "JSON" in prompt |
| Gemini | responseMimeType: "application/json"
+ | Define arrays explicitly (avoids nulls) |
| DeepSeek | response_format: {type: "json_object"}
| Include "json" in prompt + set adequate |
| Mistral | response_format: {type: "json_object"}
| Must ALSO instruct JSON in prompt text (dual requirement) |
| Llama | Instruct in prompt; validate output | No native JSON mode — always validate server-side |
| Cohere | Instruct in prompt | Use document grounding for factual JSON |
| Qwen | Instruct in prompt | Use |
| Grok | Pydantic/JSON Schema via Structured Outputs | Use schema definitions |
| Nova | Instruct in prompt | Set high enough to avoid truncation |
Universal best practice: Always provide the exact JSON schema with field types, not just "return JSON". Show one complete example of the expected output.
Before/After Examples
Vague → Specific (any model)
Before: "Summarize this article"
After: "Summarize this article in 3 bullet points for a product manager. Each bullet: one sentence, starts with an action verb. Focus on decisions needed, not background."
Generic → Model-Tuned (Claude 4.x)
Before: "You MUST ALWAYS follow these CRITICAL rules: NEVER use markdown. ALWAYS respond in JSON."
After: "Respond in JSON format because the output feeds directly into our parser. Avoid markdown formatting since the parser doesn't handle it. Here's the expected schema: {example}"
Overloaded → Decomposed (any model)
Before: "Read this codebase, find all security vulnerabilities, fix them, write tests, update the docs, and create a PR."
After (prompt 1 of 3): "Scan this codebase for security vulnerabilities. For each finding, output: file path, line number, vulnerability type (OWASP category), severity (high/medium/low), and a one-line fix description. Output as JSON array."
Missing Context → Complete (DeepSeek R1)
Before (system prompt): "You are a code reviewer."
After (user message — no system prompt):
"Task: Review this Python function for bugs and performance issues.
Requirements: Focus on correctness first, then performance. Flag any edge cases that would cause exceptions.
Output format: For each issue: {line, issue_type, severity, fix}"
No Examples → With Contrastive Example (GPT-4.1)
Before: "Classify customer feedback as positive, negative, or neutral."
After: "Classify customer feedback as positive, negative, or neutral.
INCORRECT approach: 'The product works but the delivery was late' → positive (wrong: mixed sentiment defaults to negative when a complaint is present)
CORRECT approach: 'The product works but the delivery was late' → negative (delivery complaint outweighs neutral product statement)
Classify: {input}"
Context Engineering (The Current Paradigm)
Prompt engineering is now context engineering — curating the minimal high-signal token set the model needs. Every instruction should earn its place in the context window.
Four strategies (Anthropic):
- Compaction — Summarize and reinitiate near context limits
- Structured note-taking — Agents write persistent notes outside the context window
- Sub-agent architectures — Specialized agents return 1,000-2,000 token summaries
- Hybrid retrieval — Combine upfront loading with just-in-time exploration
Context rot degrades quality in long conversations:
- Poisoning: errors/hallucinations repeatedly referenced
- Distraction: irrelevant content drowning signal
- Confusion: superfluous information degrading quality
- Clash: conflicting instructions from different sources
Prompt caching changes how prompts should be structured:
- Static content first (role, instructions, tool definitions) — cacheable
- Dynamic content last (user query, conversation) — not cached
- Up to 90% cost reduction, 85% latency reduction
- Break-even at 1.4+ cache reads per prefix
Anti-Patterns
| Anti-Pattern | Why It Fails | Fix |
|---|
| ALL-CAPS MUST/NEVER | Claude 4.x overtriggers; GPT-5 wastes tokens (except Nova) | Explain the why; use CAPS only for Nova |
| "Be thorough" | Modern models are already thorough; causes overengineering | Specify exact scope and boundaries |
| Contradictory rules | Models attempt both, degrading quality | Add explicit priority ordering |
| 20+ examples | Burns context; paradoxically degrades performance | Use 2-5 diverse, TF-IDF-selected examples |
| Vague instructions | Not testable, not actionable | Make each rule verifiable |
| Negative-only framing | "Don't X" is weaker than "Do Y" | Reframe as positive instructions |
| Same prompt across models | Each family responds differently (Inversion effect) | Tune per model; start simple for frontier |
| No examples at all | Removes highest-impact technique | Add 2+ demonstrations (except DeepSeek R1) |
| Monolithic wall of text | Hard to parse, sections blur | Use headers, XML tags, or delimiters |
| CoT on reasoning models | 35-600% more latency, marginal gains | Use self-verification instead |
| One-sentence tool descriptions | 97% of tool descriptions have defects | Use 3-4 sentence contracts with limitations |
| Critical info in the middle | Lost-in-the-middle: models attend to start/end | Put critical content at beginning and end |
Security Considerations
For agents with tool access or external data:
- Separate trusted/untrusted content — XML tags or delimiters to mark system vs user input
- Least privilege — only grant tools the agent actually needs
- Input validation guidance — instruct the agent to validate before acting
- Spotlighting — mark untrusted data with delimiters/encoding (reduces attacks from >50% to <2%)
- Graceful refusal — define what the agent should decline and how
See
references/security-patterns.md
for the full defense hierarchy and architectural patterns.
Additional Resources
- — Ready-to-use prompt templates: coding agent, chatbot, data extraction, content writer, orchestrator, per-model skeletons (Claude/GPT/Gemini/DeepSeek/Llama/Cohere/Nova), meta-prompts for rewriting/diagnosing/optimizing, tool description template
references/model-specific-tuning.md
— Per-model guidance for all 10 families: language patterns, structural preferences, tool calling, anti-patterns, migration tips
references/prompt-improvement.md
— Systematic workflow for improving existing prompts: 6-dimension defect taxonomy, diagnostic checklist, rewriting patterns, meta-prompting, evaluation
references/security-patterns.md
— Defense hierarchy, spotlighting, OWASP Top 10 for LLMs, architectural patterns, trust boundaries
references/research-evidence.md
— Key papers with measured improvements: technique rankings, prompting inversion, CoT findings, few-shot dilemma, context engineering, automated optimization
references/production-prompt-anatomy.md
— Structural analysis of Claude Code, Cursor, Manus, Devin, v0: universal patterns, conditional assembly, three eras, caching architecture
Sources
- Anthropic: Prompt Engineering Docs, Context Engineering Blog, Claude 4 Best Practices, Think Tool
- OpenAI: Platform Docs, GPT-4.1/5.x Prompting Guides, Harness Engineering
- Google: Gemini API Prompting Strategies, Gemini 3 Developer Guide
- Meta: Llama 4 Model Cards, Prompting Guide
- DeepSeek: API Docs, Reasoning Model Guide
- Mistral: Prompting Capabilities, Function Calling
- Cohere: Crafting Effective Prompts, RAG Guide
- Qwen: Official Docs, QwQ Guide
- xAI: Grok Docs, Function Calling
- Amazon: Nova Prompting Best Practices
- Research: "Principled Instructions" (2024), "The Prompt Report" (2024), "Chain of Draft" (2025), "Prompting Inversion" (2025), "Few-Shot Dilemma" (2025), "CoT Faithfulness" (Anthropic 2025), "MCP Tool Description Smells" (2026), MASS (ICLR 2026)
- Production: Claude Code, Cursor, Manus AI, Devin, v0, GitHub Copilot system prompt analysis