Loading...
Loading...
Encodes a continuous improvement loop for goal-seeking agents: EVAL, ANALYZE, RESEARCH (hypothesis + evidence + counter-arguments), IMPROVE, RE-EVAL, DECIDE. Auto-commits improvements (+2% net, no regression >5%) and reverts failures. Works with all 4 SDK implementations. Auto-activates on "improve agent", "self-improving loop", "agent eval loop", "benchmark agents", "run improvement cycle".
npx skill4agent add rysweet/amplihack self-improving-agent-builderEVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)error_analyzer.pyUser: "Run the self-improving loop on the mini-framework agent for 3 iterations"
Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
Reports per-iteration scores, net improvement, and commits/reverts.# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3
# Full options
python -m amplihack.eval.self_improve.runner \
--sdk mini \
--iterations 5 \
--improvement-threshold 2.0 \
--regression-tolerance 5.0 \
--levels L1 L2 L3 L4 L5 L6 \
--output-dir ./eval_results/self_improve \
--dry-run # evaluate only, don't apply changessrc/amplihack/eval/self_improve/runner.pypython -m amplihack.eval.progressive_test_suite \
--agent-name <agent_name> \
--output-dir <output_dir>/iteration_N/eval \
--levels L1 L2 L3 L4 L5 L6error_analyzer.pyfrom amplihack.eval.self_improve import analyze_eval_results
analyses = analyze_eval_results(level_results, score_threshold=0.6)
# Each ErrorAnalysis maps to:
# failure_mode -> affected_component -> prompt_templateresearch_decisions.json| Parameter | Default | Description |
|---|---|---|
| | Which SDK: mini/claude/copilot/microsoft |
| | Maximum improvement iterations |
| | Minimum % improvement to commit |
| | Maximum % regression on any level |
| | Which levels to evaluate |
| | Results directory |
| | Evaluate only, don't apply changes |
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig
config = RunnerConfig(
sdk_type="mini",
max_iterations=3,
improvement_threshold=2.0,
regression_tolerance=5.0,
levels=["L1", "L2", "L3", "L4", "L5", "L6"],
output_dir="./eval_results/self_improve",
dry_run=False,
)
result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")User: "Run a 4-way benchmark comparing all SDK implementations"
Skill: Runs eval suite on mini, claude, copilot, microsoft
Generates comparison table with scores, LOC, and coverage.src/amplihack/eval/self_improve/runner.pysrc/amplihack/eval/self_improve/error_analyzer.pysrc/amplihack/eval/progressive_test_suite.pysrc/amplihack/agents/goal_seeking/sdk_adapters/src/amplihack/eval/metacognition_grader.pysrc/amplihack/eval/teaching_session.py