Prompt Writer

Write and improve production-grade prompts for any LLM, grounded in official documentation, peer-reviewed research, and patterns from production systems like Claude Code, Cursor, Manus, and Devin.

Step 0: Determine Mode

Before starting, identify the task:

User intent	Mode	What to do
"Write a prompt for..."	Write	Follow Steps 1-5 below
"Improve/fix/optimize this prompt"	Improve	See Improve Mode below, then `references/prompt-improvement.md` for full workflow
"Review this prompt"	Review	Run Step 4 (Review and Harden) on their prompt
"Write a tool description"	Tool	Jump to the Tool Descriptions section in Step 3

Current as of March 2026. Model guidance reflects Claude 4.6, GPT-5.4, Gemini 3, Llama 4, DeepSeek R1/V3. Verify against official docs for newer releases.

Improve Mode (Quick Summary)

When the user has an existing prompt that isn't working:

Diagnose — run through the 6-dimension defect checklist (specification, input, structure, context, performance, maintainability)
Quick fixes — apply the principled instructions checklist (remove politeness, add delimiters, specify audience/format, reframe negatives)
Meta-prompt — give an LLM the prompt + 3-5 failure examples and ask it to rewrite
Rewrite — apply patterns: add specificity, add role+audience, add output schema, add contrastive examples
Evaluate — compare before/after on test cases, check for regressions

For the full workflow with detailed patterns and examples, see

references/prompt-improvement.md

Boundaries

This skill covers writing and improving prompts. It does not cover:

Prompt injection red-teaming (use security-specific tools)
Full RAG pipeline architecture (prompt is one piece; retrieval design is separate)
Fine-tuning or training data curation

Step 1: Understand the Context

Before writing, gather:

Target model — Which model family? (See Model Decision Matrix below)
Use case — System prompt, task prompt, agent prompt, tool description?
Tool access — What tools/functions will the agent have?
Audience — Who interacts with this? Technical users, consumers, internal?
Constraints — Length, tone, safety requirements, domain boundaries?
Failure modes — What should the agent refuse or avoid?

If the user hasn't specified these, ask before writing. A prompt without context is a prompt that will fail.

Step 2: Choose the Right Structure

Universal structure adapted from Anthropic, OpenAI, and Google's official guides:

1. Role & Identity        — Who is this agent?
2. Core Instructions      — What are the behavioral rules?
3. Tool/Capability Guide  — How to use available tools
4. Reasoning Guidance     — When and how to think step-by-step
5. Output Format          — What shape should responses take?
6. Examples               — 2-5 demonstrations of desired behavior
7. Context/Reference      — Domain knowledge, loaded last
8. Final Reminders        — Closing behavioral anchors

Not every prompt needs all sections. A simple chatbot needs 1, 2, 5, 6. A complex agent needs all eight. A task prompt may need only 2, 5, 6.

Step 3: Write Each Section

Role & Identity

One to three sentences. Anchors all subsequent behavior.

You are a senior backend engineer specializing in Python and Django.
You help developers debug production issues by analyzing logs,
tracing errors, and suggesting minimal fixes.

Audience framing amplifies this — stating who the output is for improves relevance by up to 100% in benchmarks.

Core Instructions

Structure with markdown headers or XML tags. Key principles:

Explain the why, not just the what. Models generalize better from motivation than bare commands. Instead of "Never use ellipses", write "Avoid ellipses because the TTS engine cannot pronounce them."

Use positive framing. "Write in complete sentences" outperforms "Don't use fragments." Tell the agent what TO do. Research confirms positive instructions actively boost desired token probabilities, while negatives only weakly suppress.

Be specific enough to verify. "Use 2-space indentation" is testable. "Format code properly" is not.

Resolve contradictions. Models waste reasoning tokens reconciling conflicts. Add explicit priority: "If rule A and rule B conflict, rule A takes precedence."

Match language intensity to model — this is critical:

Model Family	Guidance Style
Claude 4.x	Conversational. Explain reasoning. Avoid CAPS/MUST/NEVER — causes overtriggering. Use "prefer", "because".
GPT-5.x	Direct but not aggressive. Naturally thorough — don't over-prompt for thoroughness.
GPT-4.1	More literal than predecessors. A single clarifying sentence corrects behavior.
Gemini 3	Short, direct specs only. Logic over persuasion. No flowery language.
DeepSeek R1	NO system prompt. Put everything in user message. No few-shot. No "think step by step."
Mistral	Hierarchical markdown sections. Explicit step-by-step helps.
Cohere	Minimal preamble. Pass docs via API, not prompt. RAG-first.
Qwen/QwQ	Standard. For QwQ: non-greedy sampling required (temp=0.6, top_p=0.95).
Grok	Verbose, detailed system prompts work best. XML/MD both fine.
Nova	CAPS DO/MUST/DO NOT encouraged (opposite of Claude!). Numbered steps.
Smaller/Mini	Most literal. Critical rules FIRST. Numbered steps. Full execution order.

For full model-specific guidance, see

references/model-specific-tuning.md

The Prompting Inversion (Critical Finding)

Techniques effective for mid-tier models can actively harm frontier models. As capability increases, simpler prompts work better. Constraints that prevent common-sense errors in weaker models induce hyper-literalism in stronger ones.

Rule: always calibrate prompt complexity to model capability. When in doubt, start simple.

Tool Descriptions

Tool descriptions are prompts — 97.1% of tool descriptions in production have quality defects. Each tool needs a 3-4 sentence contract:

What it does and what it returns
When to use it (activation criteria)
When NOT to use it
Key parameter guidance and limitations

search_codebase: Searches the repository for code matching a query.
Returns matching lines with file paths and line numbers. Use when
you need to find function definitions, usage patterns, or specific
snippets. Do not use for reading entire files — use read_file instead.
Query supports regex. Results are limited to 50 matches.

The six essentials (from MCP tool description research, 2026):

Clear purpose with return data specified
Activation criteria (when to use)
Documented limitations and failure modes
Parameter intent, not just data types
3-4+ sentences proportional to complexity
Examples showing both successful and failing cases

Reasoning Guidance

Match reasoning instructions to the model:

Reasoning models (Claude extended thinking, GPT o-series, DeepSeek R1, Gemini with thinking, QwQ): Already reason internally. "Think step by step" provides marginal gains at 35-600% more latency. Use "verify your answer against [criteria]" instead.
Non-reasoning models (GPT-4.1, Claude Haiku, Mistral Small): "Think step by step" provides ~50% improvement on math/symbolic tasks. Worth the cost.
Simple tasks on any model: Skip reasoning entirely. CoT can hurt performance on straightforward tasks.

The Think Tool pattern (+54% on complex policy tasks): Give agents a no-op tool called

think

— a callable scratchpad between tool calls. Different from extended thinking. Include domain-specific examples showing HOW to use it.

Chain-of-Draft (cost-efficient alternative to CoT): When reasoning is needed but cost/latency matters, limit each reasoning step to ~5 words. Matches CoT accuracy at 7.6% of the tokens: "Think step by step, but keep only a minimum draft for each step, 5 words at most."

For agents, three reminders consistently improve performance (+20%):

Persistence: "Continue working until the task is fully resolved."
Tool use: "Use tools to gather information rather than guessing."
Planning: "Plan your approach before taking action, and reflect on results after."

Output Format

Be explicit. Without format guidance, models add prose, markdown fences, or unnecessary structure.

The CTCO pattern (from OpenAI's GPT-5.4 guide — most reliable anti-hallucination structure):

Context — Who is the model? What is the background state?
Task — The single, atomic action required
Constraints — Negative constraints and scope limits
Output — Exact format specification (JSON schema, sections, length)

Respond with a JSON object containing:
- "summary": one-sentence description
- "severity": "low" | "medium" | "high" | "critical"
- "fix": the recommended code change as a unified diff

Examples (Few-Shot)

Examples are the highest-impact technique — up to 90% accuracy improvement.

The 2-5 rule: Major gains after 2; diminishing returns after 5. More can paradoxically degrade performance (the "few-shot dilemma" — Gemma 7B dropped from 77.9% to 39.9%).

Quality over quantity: TF-IDF-based example selection outperforms random sampling by 1%+ while avoiding degradation. Choose examples relevant to the actual use case.

Contrastive examples dramatically improve reasoning: showing both a wrong approach (with why it's wrong) and the correct approach. GSM8K: 35.9% → 88.8% with GPT-4 using this pattern.

Order matters: Alternate positive and negative examples to avoid bias.

Format by model:

Claude:

<examples><example><user>...</user><assistant>...</assistant></example></examples>

GPT/others: Markdown headers or numbered examples
DeepSeek R1: NO examples (degrades reasoning)

Context/Reference

Place long-form reference material AFTER instructions and examples. Queries placed after long context improve quality by up to 30%.

Lost-in-the-middle: Models attend more to the beginning and end of prompts (architectural, not training artifact). Place highest-priority content at the start and end. Never bury critical instructions in the middle.

For Claude, use XML tags:

<context>

<documents>

<reference_material>

Final Reminders

Close with 1-3 critical behavioral anchors. Models weight content at the beginning and end more heavily.

Step 4: Review and Harden

Structural quality:

Sections clearly delineated (XML tags or markdown headers)
No contradictory instructions
Examples match the stated rules
Tool descriptions are complete contracts (what, when, when-not, limitations)

Language quality:

Positive framing (what to do, not what to avoid)
Motivation provided for non-obvious rules (the "why")
Language intensity matches target model
Specificity is testable ("2-space indent" not "format nicely")

Robustness:

Edge cases addressed or delegated ("if unclear, ask the user")
Failure modes handled ("if the tool errors, retry once then report")
No sensitive data in the prompt (credentials, API keys, PII)
Prompt injection considerations (trust boundaries, input validation)

Efficiency:

No redundant sections
Tool descriptions concise (3-4 sentences, not paragraphs)
Examples minimal but sufficient (2-5, not 10+)
Static content first, dynamic last (enables caching — up to 90% cost reduction)

See

references/security-patterns.md

for defense patterns.

Step 5: Test and Iterate

Try adversarial inputs — edge cases, ambiguous requests, rule-breaking attempts
Check failure modes — graceful or catastrophic?
Run 5x on same input — check consistency
Iterate one variable at a time — changing model AND prompt simultaneously makes attribution impossible

For systematic prompt improvement, see

references/prompt-improvement.md

Model Decision Matrix

Quick reference for choosing prompting strategy by model:

Model	System Prompt	Delimiter Style	Thinking	Key Differentiator
Claude 4.x	Separate role	XML tags	Extended thinking	Soft language, no CAPS
GPT-5.x	System + instructions	Markdown	o-series reasoning	CTCO pattern, reasoning_effort
GPT-4.1	System	Markdown	Manual CoT	Ultra-literal, sandwich method
Gemini 3	System instruction	Hierarchy/outline	thinking_level	Temp must stay 1.0, concise
Llama 4	Special tokens	Special tokens	None	Exact template required
DeepSeek R1	NO system prompt	Plain text in user msg	Auto <think>	Zero-shot only, no examples
DeepSeek V3	Yes	Standard	None	Use for speed, R1 for depth
Mistral	Prepended to user	Markdown sections	None	Prefix injection, JSON dual-instruct
Cohere	Preamble (optional)	Documents API	None	RAG-first, auto-citations
Qwen/QwQ	apply_chat_template	ChatML	enable_thinking	Non-greedy required for QwQ
Grok 3	Separate (verbose)	XML or Markdown	reasoning_effort	Real-time X data, cache-stable
Nova	Separate	Markdown	None	CAPS encouraged, temp=0 for tools

For detailed per-model guidance:

references/model-specific-tuning.md

Quick Reference: Prompt Patterns

Agent System Prompt Template

markdown

# Role
[1-3 sentences: who, what domain, what tone]

# Instructions
[Behavioral rules, organized by category]
[Each rule with motivation: "because..."]

# Tools
[For each tool: what it does, when to use, when not to use, limitations]

# Output Format
[Explicit format specification]

# Examples
[2-5 diverse, relevant demonstrations]

# Context
[Reference material, loaded last]

# Reminders
[1-3 critical behavioral anchors]

Common Prompt Recipes

Scenario	Key Sections	Notes
Simple chatbot	Role, Instructions, Format, Examples	Skip tools and reasoning
Coding agent	All 8 sections	Add persistence + planning reminders, think tool
Data analyst	Role, Instructions, Tools, Format	Emphasize structured output
Content writer	Role, Instructions, Examples, Format	Heavy on examples and tone
Customer support	Role, Instructions, Examples, Context	Load FAQ as context
Multi-agent orchestrator	Role, Instructions, Tools, Reasoning	Define delegation rules, effort budgets
Task prompt (one-shot)	Instructions, Format, Examples	No role needed, be specific
Prompt improvement	See `references/prompt-improvement.md`	Diagnose → fix → evaluate

Multimodal Prompting

For models that support vision (Claude, GPT-4o+, Gemini, Llama 4, Grok):

Image placement matters:

Gemini: image BEFORE text (official recommendation)
Claude/GPT: image position is flexible, but place relevant images near the text that references them
Multiple images: reference each by name or position ("In the first image...")

Key patterns:

Be explicit about what to do with the image: "transcribe", "describe", "extract table data", "compare these two screenshots"
For OCR/extraction: specify exact output format (JSON, table, list)
For analysis: provide evaluation criteria, not open-ended "what do you see?"
For diagrams/charts: ask for specific data points, not general description

Token costs vary significantly:

Gemini: 258 tokens per image tile (≤384px), video at 263 tokens/sec
Claude: ~1,600 tokens for a typical image
Budget accordingly — 10 images can consume 16K+ tokens

Structured Output (JSON Mode)

Each model handles structured output differently:

Model	How to enable	Key requirement
Claude	Request JSON in instructions + use XML output tags	Specify exact schema in prompt
GPT	`response_format: {type: "json_object"}` or Structured Outputs with JSON Schema	Must mention "JSON" in prompt
Gemini	`responseMimeType: "application/json"` + `responseSchema`	Define `required` arrays explicitly (avoids nulls)
DeepSeek	`response_format: {type: "json_object"}`	Include "json" in prompt + set adequate `max_tokens`
Mistral	`response_format: {type: "json_object"}`	Must ALSO instruct JSON in prompt text (dual requirement)
Llama	Instruct in prompt; validate output	No native JSON mode — always validate server-side
Cohere	Instruct in prompt	Use document grounding for factual JSON
Qwen	Instruct in prompt	Use `apply_chat_template`
Grok	Pydantic/JSON Schema via Structured Outputs	Use schema definitions
Nova	Instruct in prompt	Set `max_tokens` high enough to avoid truncation

Universal best practice: Always provide the exact JSON schema with field types, not just "return JSON". Show one complete example of the expected output.

Before/After Examples

Vague → Specific (any model)

Before: "Summarize this article" After: "Summarize this article in 3 bullet points for a product manager. Each bullet: one sentence, starts with an action verb. Focus on decisions needed, not background."

Generic → Model-Tuned (Claude 4.x)

Before: "You MUST ALWAYS follow these CRITICAL rules: NEVER use markdown. ALWAYS respond in JSON." After: "Respond in JSON format because the output feeds directly into our parser. Avoid markdown formatting since the parser doesn't handle it. Here's the expected schema: {example}"

Overloaded → Decomposed (any model)

Before: "Read this codebase, find all security vulnerabilities, fix them, write tests, update the docs, and create a PR." After (prompt 1 of 3): "Scan this codebase for security vulnerabilities. For each finding, output: file path, line number, vulnerability type (OWASP category), severity (high/medium/low), and a one-line fix description. Output as JSON array."

Missing Context → Complete (DeepSeek R1)

Before (system prompt): "You are a code reviewer." After (user message — no system prompt): "Task: Review this Python function for bugs and performance issues. Requirements: Focus on correctness first, then performance. Flag any edge cases that would cause exceptions. Output format: For each issue: {line, issue_type, severity, fix}"

No Examples → With Contrastive Example (GPT-4.1)

Before: "Classify customer feedback as positive, negative, or neutral." After: "Classify customer feedback as positive, negative, or neutral.

INCORRECT approach: 'The product works but the delivery was late' → positive (wrong: mixed sentiment defaults to negative when a complaint is present) CORRECT approach: 'The product works but the delivery was late' → negative (delivery complaint outweighs neutral product statement)

Classify: {input}"

Context Engineering (The Current Paradigm)

Prompt engineering is now context engineering — curating the minimal high-signal token set the model needs. Every instruction should earn its place in the context window.

Four strategies (Anthropic):

Compaction — Summarize and reinitiate near context limits
Structured note-taking — Agents write persistent notes outside the context window
Sub-agent architectures — Specialized agents return 1,000-2,000 token summaries
Hybrid retrieval — Combine upfront loading with just-in-time exploration

Context rot degrades quality in long conversations:

Poisoning: errors/hallucinations repeatedly referenced
Distraction: irrelevant content drowning signal
Confusion: superfluous information degrading quality
Clash: conflicting instructions from different sources

Prompt caching changes how prompts should be structured:

Static content first (role, instructions, tool definitions) — cacheable
Dynamic content last (user query, conversation) — not cached
Up to 90% cost reduction, 85% latency reduction
Break-even at 1.4+ cache reads per prefix

Anti-Patterns

Anti-Pattern	Why It Fails	Fix
ALL-CAPS MUST/NEVER	Claude 4.x overtriggers; GPT-5 wastes tokens (except Nova)	Explain the why; use CAPS only for Nova
"Be thorough"	Modern models are already thorough; causes overengineering	Specify exact scope and boundaries
Contradictory rules	Models attempt both, degrading quality	Add explicit priority ordering
20+ examples	Burns context; paradoxically degrades performance	Use 2-5 diverse, TF-IDF-selected examples
Vague instructions	Not testable, not actionable	Make each rule verifiable
Negative-only framing	"Don't X" is weaker than "Do Y"	Reframe as positive instructions
Same prompt across models	Each family responds differently (Inversion effect)	Tune per model; start simple for frontier
No examples at all	Removes highest-impact technique	Add 2+ demonstrations (except DeepSeek R1)
Monolithic wall of text	Hard to parse, sections blur	Use headers, XML tags, or delimiters
CoT on reasoning models	35-600% more latency, marginal gains	Use self-verification instead
One-sentence tool descriptions	97% of tool descriptions have defects	Use 3-4 sentence contracts with limitations
Critical info in the middle	Lost-in-the-middle: models attend to start/end	Put critical content at beginning and end

Security Considerations

For agents with tool access or external data:

Separate trusted/untrusted content — XML tags or delimiters to mark system vs user input
Least privilege — only grant tools the agent actually needs
Input validation guidance — instruct the agent to validate before acting
Spotlighting — mark untrusted data with delimiters/encoding (reduces attacks from >50% to <2%)
Graceful refusal — define what the agent should decline and how

See

references/security-patterns.md

for the full defense hierarchy and architectural patterns.

Additional Resources

references/templates.md
— Ready-to-use prompt templates: coding agent, chatbot, data extraction, content writer, orchestrator, per-model skeletons (Claude/GPT/Gemini/DeepSeek/Llama/Cohere/Nova), meta-prompts for rewriting/diagnosing/optimizing, tool description template
references/model-specific-tuning.md
— Per-model guidance for all 10 families: language patterns, structural preferences, tool calling, anti-patterns, migration tips
references/prompt-improvement.md
— Systematic workflow for improving existing prompts: 6-dimension defect taxonomy, diagnostic checklist, rewriting patterns, meta-prompting, evaluation
references/security-patterns.md
— Defense hierarchy, spotlighting, OWASP Top 10 for LLMs, architectural patterns, trust boundaries
references/research-evidence.md
— Key papers with measured improvements: technique rankings, prompting inversion, CoT findings, few-shot dilemma, context engineering, automated optimization
references/production-prompt-anatomy.md
— Structural analysis of Claude Code, Cursor, Manus, Devin, v0: universal patterns, conditional assembly, three eras, caching architecture

Sources

Anthropic: Prompt Engineering Docs, Context Engineering Blog, Claude 4 Best Practices, Think Tool
OpenAI: Platform Docs, GPT-4.1/5.x Prompting Guides, Harness Engineering
Google: Gemini API Prompting Strategies, Gemini 3 Developer Guide
Meta: Llama 4 Model Cards, Prompting Guide
DeepSeek: API Docs, Reasoning Model Guide
Mistral: Prompting Capabilities, Function Calling
Cohere: Crafting Effective Prompts, RAG Guide
Qwen: Official Docs, QwQ Guide
xAI: Grok Docs, Function Calling
Amazon: Nova Prompting Best Practices
Research: "Principled Instructions" (2024), "The Prompt Report" (2024), "Chain of Draft" (2025), "Prompting Inversion" (2025), "Few-Shot Dilemma" (2025), "CoT Faithfulness" (Anthropic 2025), "MCP Tool Description Smells" (2026), MASS (ICLR 2026)
Production: Claude Code, Cursor, Manus AI, Devin, v0, GitHub Copilot system prompt analysis

prompt-writer

NPX Install

Tags

SKILL.md Content