Back to Details

agent-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Evals

Agent 评估

Create repeatable checks so agent behavior improves safely over time.

创建可重复的检查机制，确保Agent行为随时间安全改进。

Evaluation Layers

评估层级

Unit evals: prompt-level correctness
Tool evals: API/tool call decision quality
End-to-end evals: realistic multi-step tasks
Safety evals: prompt injection and data leak resistance

单元评估：提示词层面的正确性
工具评估：API/工具调用的决策质量
端到端评估：贴近真实场景的多步骤任务
安全评估：抵御提示词注入和数据泄露的能力

CI/CD Integration

CI/CD 集成

bash

undefined

bash

undefined

Example eval pipeline steps

Example eval pipeline steps

make evals-smoke make evals-regression make evals-safety

undefined

make evals-smoke make evals-regression make evals-safety

undefined

Best Practices

最佳实践

Version datasets with expected outputs.
Track pass rates and score drift over time.
Block deploys on critical safety regressions.

对带有预期输出的数据集进行版本控制。
跟踪通过率和分数随时间的变化趋势。
若出现严重安全回归问题，阻止部署。

Related Skills

相关技能

github-actions - Eval automation in CI
ai-agent-security - Security-focused eval cases

github-actions - 在CI中实现评估自动化
ai-agent-security - 聚焦安全的评估用例