agent-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evals
Agent 评估
Create repeatable checks so agent behavior improves safely over time.
创建可重复的检查机制,确保Agent行为随时间安全改进。
Evaluation Layers
评估层级
- Unit evals: prompt-level correctness
- Tool evals: API/tool call decision quality
- End-to-end evals: realistic multi-step tasks
- Safety evals: prompt injection and data leak resistance
- 单元评估:提示词层面的正确性
- 工具评估:API/工具调用的决策质量
- 端到端评估:贴近真实场景的多步骤任务
- 安全评估:抵御提示词注入和数据泄露的能力
CI/CD Integration
CI/CD 集成
bash
undefinedbash
undefinedExample eval pipeline steps
Example eval pipeline steps
make evals-smoke
make evals-regression
make evals-safety
undefinedmake evals-smoke
make evals-regression
make evals-safety
undefinedBest Practices
最佳实践
- Version datasets with expected outputs.
- Track pass rates and score drift over time.
- Block deploys on critical safety regressions.
- 对带有预期输出的数据集进行版本控制。
- 跟踪通过率和分数随时间的变化趋势。
- 若出现严重安全回归问题,阻止部署。
Related Skills
相关技能
- github-actions - Eval automation in CI
- ai-agent-security - Security-focused eval cases
- github-actions - 在CI中实现评估自动化
- ai-agent-security - 聚焦安全的评估用例