agent-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evals

Agent 评估

Create repeatable checks so agent behavior improves safely over time.
创建可重复的检查机制,确保Agent行为随时间安全改进。

Evaluation Layers

评估层级

  • Unit evals: prompt-level correctness
  • Tool evals: API/tool call decision quality
  • End-to-end evals: realistic multi-step tasks
  • Safety evals: prompt injection and data leak resistance
  • 单元评估:提示词层面的正确性
  • 工具评估:API/工具调用的决策质量
  • 端到端评估:贴近真实场景的多步骤任务
  • 安全评估:抵御提示词注入和数据泄露的能力

CI/CD Integration

CI/CD 集成

bash
undefined
bash
undefined

Example eval pipeline steps

Example eval pipeline steps

make evals-smoke make evals-regression make evals-safety
undefined
make evals-smoke make evals-regression make evals-safety
undefined

Best Practices

最佳实践

  • Version datasets with expected outputs.
  • Track pass rates and score drift over time.
  • Block deploys on critical safety regressions.
  • 对带有预期输出的数据集进行版本控制。
  • 跟踪通过率和分数随时间的变化趋势。
  • 若出现严重安全回归问题,阻止部署。

Related Skills

相关技能

  • github-actions - Eval automation in CI
  • ai-agent-security - Security-focused eval cases
  • github-actions - 在CI中实现评估自动化
  • ai-agent-security - 聚焦安全的评估用例