Search Results: binary-evaluation

Found 3 Skills

AI & Machine Learningakillness/oh-my-skills

skill-autoresearch

Autonomously optimize an existing AI skill by running it repeatedly against binary evals, mutating one instruction at a time, and keeping only changes that improve pass rate. Based on Karpathy-style autoresearch, but applied to SKILL.md iteration instead of ML training. Use when optimizing a skill, benchmarking prompt quality, building evals for a skill, or running self-improvement loops on reusable agent instructions. Triggers on: skill-autoresearch, optimize this skill, improve this skill, benchmark this skill, eval my skill, run autoresearch on this skill, self-improve skill.

🇺🇸|EnglishTranslated

Code Qualityalpoxdev/hypercore

autoresearch-code

[Hyper] Optimize an existing codebase through baseline-first experiments, binary evaluation, and one-mutation-at-a-time iteration. Use for codebase autoresearch, measured bottleneck reduction, benchmarked code optimization, and evidence-backed refactors.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningpedronauck/skills

autoresearch

Autonomously optimize any Claude Code skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. Outputs: an improved SKILL.md, a results log, and a changelog of every mutation tried.

🇺🇸|EnglishTranslated