agent-evals
Original:🇺🇸 English
Translated
Build automated evaluation suites for AI agents using golden datasets, rubrics, and regression gates.
2installs
Added on
NPX Install
npx skill4agent add bagelhole/devops-security-agent-skills agent-evalsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Agent Evals
Create repeatable checks so agent behavior improves safely over time.
Evaluation Layers
- Unit evals: prompt-level correctness
- Tool evals: API/tool call decision quality
- End-to-end evals: realistic multi-step tasks
- Safety evals: prompt injection and data leak resistance
CI/CD Integration
bash
# Example eval pipeline steps
make evals-smoke
make evals-regression
make evals-safetyBest Practices
- Version datasets with expected outputs.
- Track pass rates and score drift over time.
- Block deploys on critical safety regressions.
Related Skills
- github-actions - Eval automation in CI
- ai-agent-security - Security-focused eval cases