Loading...
Loading...
Self-directed iterative improvement system for Codex that cycles through modify, verify, retain/discard indefinitely
npx skill4agent add aradotso/codex-skills codex-autoresearch-skillSkill by ara.so — Codex Skills collection.
# In Codex
$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch# Clone to Codex skills directory
cd ~/.config/codex/skills
git clone https://github.com/leo-lilinxiao/codex-autoresearch
# Or create a symlink
ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch# Start Codex with Goals, hooks, and Full Access enabled
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox$codex-autoresearchYou: $codex-autoresearch
I want to improve test coverage to at least 80%
Codex: Current coverage: 62% (measured via pytest --cov)
Target: ≥80%
Verify: pytest --cov --cov-report=term
Guard: pytest passes
Results directory: ./autoresearch-results/
Run mode: foreground or background?
You: Foreground, go.
Codex: Starting iteration loop — baseline: 62%anyYou: $codex-autoresearch
Get rid of all the any types in my TypeScript code
Codex: Found 47 `any` occurrences in src/**/*.ts
Metric: `any` count (current: 47), direction: lower
Verify: grep -r "any" src | wc -l
Guard: npm run type-check (tsc --noEmit)
Run mode?
You: Background, run overnight.
Codex: Starting background run — baseline: 47. Iterating.autoresearch-results/results.tsvtail -f autoresearch-results/results.tsvYou: $codex-autoresearch
Fix the 12 failing tests
Codex: Detected 12 test failures via pytest
Strategy: Repair one failure at a time
Verify: pytest (exit code 0 = success)
Mode?
You: Foreground.
Codex: Iteration 1: test_auth_token_expiry
Root cause: Clock skew in mock time...You: $codex-autoresearch
Reduce API latency below 100ms
Codex: Current p95 latency: 247ms (from load test logs)
Target: <100ms
Verify: python benchmark.py --p95
Guard: pytest integration tests
Mode?
You: Foreground.You: $codex-autoresearch
Is this code secure?
Codex: Mode: security audit
Framework: STRIDE + OWASP Top 10
Scope: src/**/*.py (detected Flask app)
Findings will be logged with code evidence.
Run mode?
You: Foreground.
Codex: Finding 1/7: SQL Injection risk in user_search()
Evidence: Line 42, raw string interpolation...| Mode | Trigger Patterns | What It Does |
|---|---|---|
| loop | "improve X", "reduce Y", "optimize Z" | Iterative improvement until target or interrupt |
| plan | "analyze", "what should I improve?", "suggest metrics" | Scans repo, proposes goals and metrics |
| debug | "why is X happening?", "diagnose", "root cause" | Hypothesis-driven debugging with falsifiable tests |
| fix | "fix the N failing tests", "repair", "make tests pass" | Sequential repair of known failures |
| security | "is this secure?", "audit", "STRIDE", "OWASP" | Security analysis with structured findings |
| ship | "ship it", "ready to release?", "pre-deploy check" | Release readiness verification |
| exec | (CI/CD usage, see below) | Non-interactive automation mode |
Codex: Inferred configuration:
goal: "eliminate any types"
scope: "src/**/*.ts"
metric: any_count
current_value: 47
direction: lower
verify_cmd: "grep -r 'any' src | wc -l"
guard_cmd: "npm run type-check"
Adjust anything?
You: Change scope to include test files too.
Codex: Updated scope: "{src,test}/**/*.ts"
Current value: 63 (including tests)
Proceed?{
"goal": "Reduce bundle size",
"metric": "bundle_kb",
"current_value": 487,
"target_value": 300,
"direction": "lower",
"verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
"guard_cmd": "npm test",
"scope": "src/**/*.{ts,tsx}",
"max_iterations": 50
}# Save to file, then:
codex exec -f autoresearch_config.jsonautoresearch-results/autoresearch-results/
├── results.tsv # Full experiment log (audit trail)
├── state.json # Resume state (last consistent checkpoint)
├── lessons.json # Cross-run learning (what worked/failed)
└── sessions/
└── 2026-05-16_14-23-01/
├── experiment_1_keep.diff
├── experiment_2_discard.diff
└── ...# View all experiments
cat autoresearch-results/results.tsv
# Watch live (during background run)
tail -f autoresearch-results/results.tsv
# Filter successful improvements
grep "keep" autoresearch-results/results.tsviteration commit metric delta status description
0 a1b2c3d 47 0 baseline initial any count
1 b2c3d4e 41 -6 keep replace any in auth module
2 - 49 +8 discard generic wrapper introduced new anys
3 d4e5f6g 38 -3 keep type-narrow API response handlers
4 e5f6g7h 38 0 discard refactor had no effect
5 f6g7h8i 35 -3 keep infer types from JSON schema| Trigger | Action |
|---|---|
| 3 consecutive failures | REFINE — Adjust within current strategy |
| 5 consecutive failures | PIVOT — Try fundamentally different approach |
| 2 PIVOTs without progress | Web search — Look for external solutions |
| 3 PIVOTs without progress | STOP — Report that human input needed |
Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time
Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec
Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
Found: Use strictNullChecks + noImplicitAny...
Iteration 25: keep
→ Counters reset. Back to normal iteration.You: (interrupt Codex)
Codex: Autoresearch interrupted.
State saved to autoresearch-results/state.json
Resume with: $codex-autoresearchYou: $codex-autoresearch
Codex: Background run detected (PID 12345).
Status: 15 iterations, metric improved 47→35
You: Stop it.
Codex: Stopping background run...
Final state saved. Resume anytime with $codex-autoresearch.$codex-autoresearchstate.jsonCodex: Found interrupted session:
Goal: eliminate any types
Progress: 47→35 (12 iterations)
Last commit: f6g7h8i
Resume from iteration 13?
You: Yes, continue.# Pseudocode of each iteration
git checkout -b experiment_N
modify_code()
git commit -m "experiment N: {hypothesis}"
verify_result = run(verify_cmd)
guard_result = run(guard_cmd)
if verify_result.improved and guard_result.passed:
git merge experiment_N
log("keep")
else:
git reset --hard HEAD~1
log("discard")pytest --covpytestexec# Non-interactive, JSON output, exit codes
codex exec -f config.json --max-iterations 20 --timeout 3600012{
"goal": "Reduce lint warnings to zero",
"metric": "lint_warnings",
"current_value": 34,
"target_value": 0,
"direction": "lower",
"verify_cmd": "npm run lint -- --format json | jq '.length'",
"guard_cmd": "npm test",
"max_iterations": 30
}{
"status": "success",
"iterations": 18,
"initial_value": 34,
"final_value": 0,
"improvements": 12,
"discards": 6,
"elapsed_seconds": 1847
}You: $codex-autoresearch
Try multiple approaches to reduce latency
Codex: Parallel mode enabled (3 worktrees).
Hypothesis A: Database query optimization
Hypothesis B: Response caching
Hypothesis C: Async I/O refactor
Testing in parallel...
Results:
A: -23ms (keep)
B: -67ms (keep) ← Best
C: +12ms (discard)
Merging B, continuing from new baseline.lessons.json{
"successful_patterns": [
{
"goal": "reduce any types",
"approach": "infer from JSON schema",
"success_rate": 0.83,
"avg_improvement": 4.2
}
],
"failed_patterns": [
{
"goal": "reduce any types",
"approach": "generic type wrappers",
"failure_rate": 0.91,
"reason": "introduced more anys downstream"
}
]
}# Auto-installed with skill
~/.config/codex/hooks/post-session.shstate.jsonchmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
~/.config/codex/hooks/post-session.sh# Before autoresearch
def process_data(data): # No types
result = []
for item in data:
result.append(item['value'] * 2)
return result
# After 8 iterations
from typing import List, Dict, Any
def process_data(data: List[Dict[str, Any]]) -> List[float]:
result: List[float] = []
for item in data:
result.append(float(item['value']) * 2)
return result$codex-autoresearch
Improve type hints coverage to 90% with mypy strict mode// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';
// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es'; // Tree-shakeable
import { formatDate } from 'date-fns/formatDate'; // Targeted import
import { parseJSON, validateEmail } from './utils'; // Explicit imports$codex-autoresearch
Reduce production bundle size below 350kb// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
let mut result = 0;
for i in 0..x {
result = result + y; // clippy: use += instead
}
result
}
// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
x * y // Direct multiplication, clippy-clean
}$codex-autoresearch
Eliminate all clippy warnings with default lints--dangerously-bypass-approvals-and-sandbox# Restart Codex with Full Access
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandboxYou: Test the verify command manually.
Codex: Running: grep -r "any" src | wc -l
Output: " 47" ← Extra spaces
You: Adjust to: grep -r "any" src | wc -l | xargs
Codex: Updated verify_cmd. Baseline: 47 (clean integer).grep "discard" autoresearch-results/results.tsvYou: Run guard manually: npm test
Codex: Tests fail intermittently (test_cache_timeout).
Recommendation: Fix flaky test or exclude from guard.
You: Exclude that test from guard.
Codex: Updated guard: npm test -- --ignore test_cache_timeout# Restore from git (if committed)
git restore autoresearch-results/state.json
# Or start fresh (loses resume state, keeps logs)
rm autoresearch-results/state.json
$codex-autoresearchYou: Set max_iterations to 50 and stop.
Codex: Updated config: max_iterations = 50
Will stop after 50 iterations regardless of target.You: Target is 80% coverage, stop when reached.
Codex: Updated target_value: 80
Will stop when coverage ≥80%.# Before starting autoresearch, confirm the metric works
pytest --cov --cov-report=term | grep TOTAL
# Should output a parseable percentageCodex: Proposed verify: npm run test:coverage
You: Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%tail -20 autoresearch-results/results.tsvgit log --oneline --grep="autoresearch"git rebase -i HEAD~20 # Interactive rebase last 20 autoresearch commits# Example: Use specific Python for pytest
export PYTHON=/usr/bin/python3.11
$codex-autoresearch
# Example: Increase test timeout
export PYTEST_TIMEOUT=300
$codex-autoresearchYou: $code-review
Review the autoresearch improvements from last night.
Codex: Reviewing 12 commits in autoresearch-results/sessions/...
Summary:
- 8 type improvements: Good, no regressions detected
- 3 test additions: Coverage gaps filled correctly
- 1 refactor: Extracted helper, maintains behavior
Recommendation: Merge to main.You: $codex-autoresearch
Optimize performance
Then:
$benchmark
Compare before/after with flamegraphscat autoresearch-results/results.tsvcat autoresearch-results/state.json$codex-autoresearch$codex-autoresearch$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvements