self-improving-agent-builder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSelf-Improving Agent Builder
自我改进型Agent构建工具
Purpose
用途
Run a closed-loop improvement cycle on any goal-seeking agent implementation:
EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)Each iteration measures L1-L12 progressive test scores, identifies failures
with , runs a research step with hypothesis/evidence/
counter-arguments, applies targeted fixes, and gates promotion through
regression checks.
error_analyzer.py为任意目标导向型Agent实现运行闭环改进循环:
EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (重复)每次迭代都会测量L1-L12的渐进式测试分数,通过识别故障,执行包含假设/证据/反论点的研究步骤,应用针对性修复,并通过退化检查来决定是否推广改进。
error_analyzer.pyWhen I Activate
激活触发条件
- "improve agent" or "self-improving loop"
- "agent eval loop" or "run improvement cycle"
- "benchmark agents" or "compare SDK implementations"
- "iterate on agent scores" or "fix agent regressions"
- "improve agent"或"self-improving loop"
- "agent eval loop"或"run improvement cycle"
- "benchmark agents"或"compare SDK implementations"
- "iterate on agent scores"或"fix agent regressions"
Quick Start
快速开始
User: "Run the self-improving loop on the mini-framework agent for 3 iterations"
Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
Reports per-iteration scores, net improvement, and commits/reverts.用户: "在迷你框架Agent上运行自我改进循环,执行3次迭代"
Skill: 执行EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE的3次迭代
报告每次迭代的分数、净提升情况,以及提交/回滚操作。Runner Script
运行脚本
The self-improvement loop is implemented as a Python CLI:
bash
undefined自我改进循环以Python CLI工具的形式实现:
bash
undefinedBasic usage
基础用法
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3
Full options
完整选项
python -m amplihack.eval.self_improve.runner
--sdk mini
--iterations 5
--improvement-threshold 2.0
--regression-tolerance 5.0
--levels L1 L2 L3 L4 L5 L6
--output-dir ./eval_results/self_improve
--dry-run # evaluate only, don't apply changes
--sdk mini
--iterations 5
--improvement-threshold 2.0
--regression-tolerance 5.0
--levels L1 L2 L3 L4 L5 L6
--output-dir ./eval_results/self_improve
--dry-run # evaluate only, don't apply changes
**Source:** `src/amplihack/eval/self_improve/runner.py`python -m amplihack.eval.self_improve.runner
--sdk mini
--iterations 5
--improvement-threshold 2.0
--regression-tolerance 5.0
--levels L1 L2 L3 L4 L5 L6
--output-dir ./eval_results/self_improve
--dry-run # 仅评估,不应用更改
--sdk mini
--iterations 5
--improvement-threshold 2.0
--regression-tolerance 5.0
--levels L1 L2 L3 L4 L5 L6
--output-dir ./eval_results/self_improve
--dry-run # 仅评估,不应用更改
**源码路径:** `src/amplihack/eval/self_improve/runner.py`The Loop (6 Phases per Iteration)
循环流程(每次迭代包含6个阶段)
Phase 1: EVAL
阶段1:EVAL(评估)
Run the L1-L12 progressive test suite on the current agent implementation.
Execution:
bash
python -m amplihack.eval.progressive_test_suite \
--agent-name <agent_name> \
--output-dir <output_dir>/iteration_N/eval \
--levels L1 L2 L3 L4 L5 L6Output: Per-level scores and overall baseline.
在当前Agent实现上运行L1-L12渐进式测试套件。
执行命令:
bash
python -m amplihack.eval.progressive_test_suite \
--agent-name <agent_name> \
--output-dir <output_dir>/iteration_N/eval \
--levels L1 L2 L3 L4 L5 L6输出结果: 各层级分数及整体基准值。
Phase 2: ANALYZE
阶段2:ANALYZE(分析)
Classify failures using . Maps each failed question to a
failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and
the specific code component responsible.
error_analyzer.pypython
from amplihack.eval.self_improve import analyze_eval_results
analyses = analyze_eval_results(level_results, score_threshold=0.6)使用对故障进行分类。将每个失败的测试问题映射到故障分类体系(如retrieval_insufficient、temporal_ordering_wrong等),并定位到对应的具体代码组件。
error_analyzer.pypython
from amplihack.eval.self_improve import analyze_eval_results
analyses = analyze_eval_results(level_results, score_threshold=0.6)Each ErrorAnalysis maps to:
每个ErrorAnalysis对象包含:
failure_mode -> affected_component -> prompt_template
failure_mode -> affected_component -> prompt_template
undefinedundefinedPhase 3: RESEARCH (New)
阶段3:RESEARCH(研究)【新增】
The critical thinking step that prevents blind changes. For each proposed
improvement:
- State hypothesis: What specific change will fix the failure?
- Gather evidence: From eval results, failure patterns, baseline scores
- Consider counter-arguments: What could go wrong? Risk of regression?
- Make decision: Apply, skip, or defer with full reasoning
Decisions are logged in for auditability.
research_decisions.jsonDecision criteria:
- Apply: Clear failure pattern + prompt template available + low score
- Skip: Score above 50% (likely stochastic variation)
- Defer: Ambiguous evidence, needs more data
这是避免盲目更改的关键思考步骤。针对每个拟议的改进:
- 提出假设:具体的哪些更改可以修复故障?
- 收集证据:来自评估结果、故障模式、基准分数
- 考虑反论点:可能出现什么问题?是否存在性能退化风险?
- 做出决策:应用改进、跳过或延迟,并记录完整的推理过程
决策记录将保存到中,便于审计。
research_decisions.json决策标准:
- 应用:存在明确的故障模式 + 有可用的提示模板 + 分数较低
- 跳过:分数高于50%(可能是随机波动导致)
- 延迟:证据不明确,需要更多数据
Phase 4: IMPROVE
阶段4:IMPROVE(改进)
Apply the improvements approved by the research step. Priority order:
- Prompt template improvements (safest, highest impact)
- Retrieval strategy adjustments
- Code logic fixes (most risky, needs careful review)
应用经研究步骤批准的改进措施。优先级顺序:
- 提示模板改进(最安全,影响最大)
- 检索策略调整
- 代码逻辑修复(风险最高,需仔细审查)
Phase 5: RE-EVAL
阶段5:RE-EVAL(再评估)
Re-run the same eval suite after applying fixes to measure impact.
应用修复后重新运行相同的评估套件,以测量改进效果。
Phase 6: DECIDE
阶段6:DECIDE(决策)
Promotion gate:
- Net improvement >= +2% overall score: COMMIT the changes
- Any single level regression > 5%: REVERT all changes
- Otherwise: COMMIT with marginal improvement note
推广门槛:
- 整体分数净提升≥2%:COMMIT(提交)更改
- 任意单一层级性能退化>5%:REVERT(回滚)所有更改
- 其他情况:COMMIT(提交)并添加边际改进说明
Configuration
配置参数
| Parameter | Default | Description |
|---|---|---|
| | Which SDK: mini/claude/copilot/microsoft |
| | Maximum improvement iterations |
| | Minimum % improvement to commit |
| | Maximum % regression on any level |
| | Which levels to evaluate |
| | Results directory |
| | Evaluate only, don't apply changes |
| 参数名称 | 默认值 | 描述说明 |
|---|---|---|
| | 使用的SDK类型:mini/claude/copilot/microsoft |
| | 最大改进迭代次数 |
| | 提交改进所需的最小百分比提升 |
| | 任意层级允许的最大性能退化百分比 |
| | 需要评估的层级 |
| | 结果输出目录 |
| | 仅评估,不应用更改 |
Programmatic Usage
程序化调用
python
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig
config = RunnerConfig(
sdk_type="mini",
max_iterations=3,
improvement_threshold=2.0,
regression_tolerance=5.0,
levels=["L1", "L2", "L3", "L4", "L5", "L6"],
output_dir="./eval_results/self_improve",
dry_run=False,
)
result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")python
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig
config = RunnerConfig(
sdk_type="mini",
max_iterations=3,
improvement_threshold=2.0,
regression_tolerance=5.0,
levels=["L1", "L2", "L3", "L4", "L5", "L6"],
output_dir="./eval_results/self_improve",
dry_run=False,
)
result = run_self_improvement(config)
print(f"总提升幅度: {result.total_improvement:+.1f}%")
print(f"最终分数: {result.final_scores}")4-Way Benchmark Mode
四向基准测试模式
Compare all SDK implementations side by side:
User: "Run a 4-way benchmark comparing all SDK implementations"
Skill: Runs eval suite on mini, claude, copilot, microsoft
Generates comparison table with scores, LOC, and coverage.同时对比所有SDK实现的性能:
User: "运行四向基准测试,对比所有SDK implementations"
Skill: 在mini、claude、copilot、microsoft上运行评估套件
生成包含分数、LOC和覆盖率的对比表格。Integration Points
集成点
- : Self-improvement loop runner
src/amplihack/eval/self_improve/runner.py - : Failure classification
src/amplihack/eval/self_improve/error_analyzer.py - : L1-L12 eval runner
src/amplihack/eval/progressive_test_suite.py - : All 4 SDK implementations
src/amplihack/agents/goal_seeking/sdk_adapters/ - : Advanced eval dimensions
src/amplihack/eval/metacognition_grader.py - : L7 teaching quality eval
src/amplihack/eval/teaching_session.py
- : 自我改进循环运行器
src/amplihack/eval/self_improve/runner.py - : 故障分类工具
src/amplihack/eval/self_improve/error_analyzer.py - : L1-L12评估运行器
src/amplihack/eval/progressive_test_suite.py - : 全部4种SDK实现
src/amplihack/agents/goal_seeking/sdk_adapters/ - : 高级评估维度工具
src/amplihack/eval/metacognition_grader.py - : L7教学质量评估工具
src/amplihack/eval/teaching_session.py