Search Results: terminal-bench

Found 2 Skills

terminal-bench-loop

Run a single Terminal-Bench problem through Paperclip in a bounded, human-in-the-loop improvement cycle until the smoke passes, the board rejects the next fix, the iteration budget is exhausted, or a real blocker is named. Each iteration runs a bounded smoke against an isolated Paperclip App worktree, captures artifacts, diagnoses the exact stop point with `/diagnose-why-work-stopped`, requests board confirmation before any product fix, then reruns against the same worktree. Use whenever an issue asks to "run Terminal-Bench in a loop", "drive Terminal-Bench until it passes", "loop fix-git through Paperclip", or otherwise points at a Terminal-Bench task and asks for bounded iteration with diagnosis.

🇺🇸|EnglishTranslated

AI & Machine Learningcoder/mux

tbench

Terminal-Bench integration for Mux agent benchmarking and failure analysis

🇺🇸|EnglishTranslated