Loading...
Loading...
Found 6 Skills
Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.
Control terminal TUIs and web/Electron apps for testing, demos, QA, and computer-use tasks. Use when you need to automate a CLI, drive a browser, record a demo, or capture proof artifacts.
Drive a real browser to QA a feature end-to-end as a user would. Loads the right mix of Playwright MCP, Claude-in-Chrome, and computer-use, plus the failure modes to avoid. Use whenever you need to verify a UI feature works in a browser, capture PR screenshots, repro a customer bug visually, or do end-of-task dogfooding before declaring something "done". This is the QA stage of orchestrate mode.
Loads orchestrate mode — a disciplined delivery loop that enforces BDD specs in specs/, real integration tests (no mocks), PR CI and CodeRabbit babysitting, and mandatory end-user QA via computer-use or CLI dogfooding before anything is considered done. Use when starting any non-trivial implementation task, feature build, or delivery where you want the work driven from spec to proven-shipped state rather than stopping at "tests pass".
Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.
Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control. Use when: computer use, desktop automation agent, screen control AI, vision-based agent, GUI automation.