Loading...
Loading...
Found 9 Skills
Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).
Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). Use for LLM evaluation, benchmarks, rubrics, error analysis/open coding, and ship/no-ship quality gates for AI features.
Use when given a Sentry issue URL and you need to fetch exception details, stacktrace, and request context using sentry-cli (and Sentry API fallback when needed).
Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.
Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.
Analyze Claude Code session logs - extract thinking blocks, tool usage stats, error patterns, debug trajectories. Triggers on: introspect, session logs, trajectory, analyze sessions, what went wrong, tool usage, thinking blocks, session history, my reasoning, past sessions, what did I do.
Fetch and analyze AppSignal error incidents. Use when debugging errors, investigating exceptions, or when the user mentions AppSignal, incidents, or error monitoring.
Use proactively when encountering Rails errors, test failures, build issues, or unexpected behavior. Analyzes errors, reproduces issues, and identifies root causes.
Identifies bugs, analyzes errors, performs root cause analysis, and proposes fixes