Skill Sharpen
Born from real-world production usage across multiple projects. Every diagnostic category, every proposal flow, and every guardrail exists because it solved a real problem in a real skill.
Kaizen (改善) for AI agent skills. Observe how a skill performed, find what went wrong or could be better, and propose concrete changes to its SKILL.md.
- Gathers evidence from three sources: conversation friction, file diffs, and your feedback
- Diagnoses root causes and proposes improvements — you decide each one
- Tracks recurrence in LESSONS.md with automatic importance escalation
- Works with Claude Code and any SKILL.md-based agent framework
Process
1. Resolve Target Skill
Determine which skill to sharpen:
- Explicit (): Search for across skill directories —
local project skills, installed skills, and plugin skills. Match by directory name.
- Auto-detect ( with no args): Scan conversation history for the most
recently loaded skill (look for SKILL.md content or invocations). Ask the
user to confirm: "Detected — is that the one?"
If the skill is not found, list the paths searched and ask the user for a correction or
an explicit path.
Once resolved, read the target skill's
and
(if it exists). Keep
both in context — they inform what to propose and what to skip.
2. Determine Execution Mode
Ask the user or detect from arguments:
| Mode | Trigger | Behavior |
|---|
| Interactive | Default (no flag) | Analyze sources → diagnose root cause → propose one by one → user decides each |
| Observe-only | or "just log" | Analyze sources → diagnose → write all to LESSONS.md → done |
| Watch | or "run X and observe" | Execute the target skill first, then analyze the results (interactive or with ) |
| Review | or "review lessons" | Skip source analysis → walk through existing LESSONS.md entries |
| Audit | or "audit the skill" | Skip sources → full static diagnostic of the SKILL.md → propose fixes |
If mode is
Review, jump directly to
Step 6: Review Mode.
Watch mode: Also detects natural language: "ejecutá /create-plan y después observemos"
triggers watch + interactive. The skill being watched becomes the target for analysis.
Accumulation workflow: Use
(or
) repeatedly across
sessions to accumulate lessons in LESSONS.md. Each run adds new findings or increments
Hits on existing ones. When ready to process, run
to walk through everything
and decide what to apply.
Session 1: /skill-sharpen --watch create-plan --observe → runs skill, logs findings
Session 2: /skill-sharpen --observe → logs more findings, Hits grow
Session 3: /skill-sharpen --review → process all accumulated lessons
3. Gather Evidence
Collect information from three sources. Work with whatever is available — not all sources
will have signal every time.
Source A — Conversation history
Scan the conversation for friction signals:
- Errors or exceptions during skill execution
- User corrections ("no, not that", "I meant...", "undo that")
- Retries or repeated attempts at the same step
- Manual interventions the user had to make
- Confusion about what the skill was supposed to do
- Steps the skill skipped or did in the wrong order
Source B — File diffs
Check
or recently modified files for:
- Files the skill created or modified — do they match what was expected?
- Changes the user had to make after the skill ran (post-corrections)
- Incomplete implementations (TODOs, placeholders, missing pieces)
- Patterns that deviate from what the SKILL.md prescribed
Source C — User feedback
Ask the user directly:
"What worked well? What didn't? Anything specific you want the skill to do differently?"
This is especially valuable when conversation context is compressed or when the issues
are subtle (preferences, style, approach). Keep it open-ended — one question, then follow
up if needed.
4. Analyze and Generate Proposals
Cross-reference the evidence against the target skill's SKILL.md to identify:
| Category | What to look for |
|---|
| Missing instructions | Steps the skill should have taken but didn't because the SKILL.md didn't mention them |
| Ambiguous instructions | Places where the SKILL.md was vague and the skill made a wrong choice |
| Wrong defaults | Default behaviors that consistently need overriding |
| Missing guardrails | Errors that a "don't" rule would have prevented |
| Outdated content | References to APIs, tools, or patterns that have changed |
| Missing examples | Cases where an example would have prevented misinterpretation |
| Structural issues | Ordering problems, missing sections, or buried important info |
For each finding, diagnose the root cause in the SKILL.md. Don't just describe what
went wrong — explain why it happened by tracing it back to a specific instruction,
gap, or ambiguity. Use these diagnostic categories:
| Diagnostic | What it means |
|---|
| Coherence | Sections don't align — the process says one thing, the guardrails another |
| Coupling | Content that doesn't belong in this skill — leaks from another domain, out-of-scope instructions, or mixed responsibilities that caused the agent to act outside its purpose. If it's not cohesive with the skill's core goal, it shouldn't be there |
| Ambiguity | Instruction open to interpretation — "if needed", "as appropriate" without criteria |
| Contradiction | Two rules directly conflict |
| Specificity gap | No concrete rule exists for this case — the agent had to guess |
| Missing instruction | The SKILL.md simply doesn't cover this scenario |
| Redundancy | Same instruction repeated in different sections or worded differently — causes confusion about which one to follow, wastes context window |
| Error inducer | A specific instruction actively promotes the wrong behavior |
Each proposal must include a short root cause line. Format:
Finding: [what happened]
Root cause: [diagnostic] — [which line/section caused it and why]
Proposed change: [concrete fix]
mode: When invoked with
, run a full static diagnostic of the
SKILL.md without requiring execution evidence. Validate against these baseline rules
(from Agent Skills spec + Anthropic best practices):
- Frontmatter must have and (required)
- Description max 1024 characters, third person, with specific trigger phrases
- Body should be under 500 lines — use for overflow
- Name: lowercase, hyphens only, 1-64 characters
- Progressive disclosure: metadata (~100 tokens) → body (<5k tokens) → resources (as needed)
- Check for: dead content, scope creep, trigger quality, token efficiency, completeness
Enrich with context7 (optional): If the
MCP server is available, query
the latest Agent Skills specification and Anthropic skill authoring best practices to
ensure rules reflect the most current standards. If not available, use the baseline
rules above — they cover the stable core.
For each finding, formulate a proposal: a concrete, actionable change to the SKILL.md.
Assign importance based on impact:
| Importance | Criteria |
|---|
| high | Breaks output, causes errors, or requires user intervention every time |
| medium | Suboptimal results, friction exists but workaround is possible |
| low | Style, preferences, minor improvements |
Recurrence escalation: Before generating a new proposal, check LESSONS.md for an
existing entry describing the same pattern. If found, increment its
column instead
of creating a duplicate. When hits reach 3+, escalate importance:
→
,
→
.
stays
.
5. Present Proposals (Interactive Mode)
Present proposals one at a time, ordered by importance (high → medium → low).
For each proposal, show:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROPOSAL [N/total] — [importance]
Source: [conversation | diff | user]
Finding: [what was observed]
Root cause: [diagnostic] — [which line/section and why]
Hits: [N — omit if first occurrence]
Proposed change:
[concrete description of what to add/modify/remove in SKILL.md]
Preview:
[show the actual diff or new text that would be added]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
(a)ccept (p)ostpone (r)eject (d)on't (s)kip all
Handle the user's decision:
- Accept: Show the exact edit to be made. Apply it only after the user confirms.
Edit the target SKILL.md directly.
- Postpone: Append to the target skill's LESSONS.md (create if it doesn't exist).
- Reject: Discard and move to the next proposal.
- Don't: The user is saying "this is wrong, the skill should NEVER do this". Confirm
the negative rule with the user, then add it to the SKILL.md as a "Do NOT..." instruction.
- Skip all: Write remaining proposals to LESSONS.md and end.
After all proposals are processed, show a summary:
Done. [N] accepted, [N] postponed, [N] rejected, [N] don'ts added.
6. Review Mode
Walk through existing LESSONS.md entries one by one. For each entry, present it in the
same format as Step 5 (but source and finding come from the LESSONS.md row).
The user can:
- Accept → apply to SKILL.md, remove from LESSONS.md
- Reject → remove from LESSONS.md (optionally convert to "don't")
- Keep → leave in LESSONS.md for later
After processing all entries, show the summary. If all entries were processed (none kept),
delete the LESSONS.md file.
LESSONS.md Format
The file lives alongside the target skill's SKILL.md. Format:
markdown
# Lessons — {skill-name}
### 1 — high | Hits: 1
- **Date**: 2026-03-28
- **Source**: conversation
- **Diagnostic**: ambiguity — line 45 says "if needed" without criteria
- **Proposal**: Replace "if needed" with explicit condition: "when scope is api or both"
### 2 — medium | Hits: 3
- **Date**: 2026-03-27
- **Source**: diff
- **Diagnostic**: missing instruction
- **Proposal**: Add validation step before Phase 3 for skill-scoped plans
Fields:
- Heading: entry number + importance + hits count
- Date: when first generated (YYYY-MM-DD), updated to latest occurrence on hit
- Source: , , or
- Diagnostic: root cause category + short explanation
- Proposal: concise description of finding + proposed change
Rules:
- Never create an empty LESSONS.md — only create it when there's at least one entry
- When the same pattern is detected again, increment in the heading instead of
adding a duplicate. Update to the latest occurrence
- When hits reach 3+, escalate importance: → , →
- When accepting or rejecting an entry, remove the entire block
- When all entries are removed, delete the file
Guardrails
- Never edit without confirmation. Always show the proposed diff and wait for explicit
user approval before modifying any SKILL.md. This is non-negotiable — no exceptions,
not even in observe-only mode (which writes to LESSONS.md, never to SKILL.md).
Always ask the user what they want to do. The user owns the skill.
- Never expose secrets. When analyzing conversation history, diffs, or files, redact
any sensitive content before displaying it in proposals, previews, or LESSONS.md entries.
This includes: API keys, tokens, passwords, connection strings, private URLs, and any
value that looks like a credential (e.g., , , ). Replace
with in all output. Never write secrets to LESSONS.md.
- Read before proposing. Always read the target SKILL.md and LESSONS.md before
generating proposals. Avoids duplicates, contradictions, and already-addressed issues.
- Work with partial context. If the conversation was long and context is compressed,
work with what's available. State what you can see and what might be missing. Never
invent evidence or assume what happened.
- One proposal at a time. Don't dump all proposals at once. Present, decide, move on.
- Respect the SKILL.md structure. When inserting new content, match the existing style,
indentation, and organizational pattern of the target SKILL.md.
- Don'ts need double confirmation. Adding a negative rule to a SKILL.md is impactful.
Always confirm: "Add this as a 'don't' rule to the SKILL.md?"