codex-autoresearch-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCodex Autoresearch Skill
Codex Autoresearch Skill
What is Codex Autoresearch?
什么是Codex Autoresearch?
Codex Autoresearch is an autonomous goal-driven experimentation system for Codex that continuously cycles through: modify code → verify result → retain (if improved) or discard (if worse) → repeat indefinitely. Tell Codex what you want to improve, walk away, and come back to a log of experiments and a better codebase.
Key capabilities:
- Autonomous iteration loops (foreground or background)
- Git-based experiment tracking with automatic revert on failure
- Dual-gate verification (did it improve? did anything break?)
- Escalating retry strategies (REFINE → PIVOT → Web search → Stop)
- Cross-run learning from past experiments
- Multiple modes: loop, plan, debug, fix, security, ship, exec
Codex Autoresearch是面向Codex的自主目标驱动实验系统,可持续循环执行:修改代码 → 验证结果 → 保留(若改进)或丢弃(若恶化) → 无限重复。告诉Codex你想要改进的方向,无需值守,返回后即可查看实验日志和优化后的代码库。
核心功能:
- 自主迭代循环(前台或后台运行)
- 基于Git的实验跟踪,失败时自动回滚
- 双门验证(是否有改进?是否引入问题?)
- 递进式重试策略(REFINE → PIVOT → 网页搜索 → 停止)
- 从过往实验中跨运行学习
- 多种模式:loop、plan、debug、fix、security、ship、exec
Installation
安装
Recommended: Skill Installer
推荐方式:Skill Installer
bash
undefinedbash
undefinedIn Codex
在Codex中执行
$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch
Then restart Codex.$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch
然后重启Codex。Manual Installation
手动安装
bash
undefinedbash
undefinedClone to Codex skills directory
克隆到Codex技能目录
cd ~/.config/codex/skills
git clone https://github.com/leo-lilinxiao/codex-autoresearch
cd ~/.config/codex/skills
git clone https://github.com/leo-lilinxiao/codex-autoresearch
Or create a symlink
或创建符号链接
ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch
undefinedln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch
undefinedPrerequisites for Full Functionality
全功能前置条件
For background runs and unattended execution:
bash
undefined针对后台运行和无人值守执行:
bash
undefinedStart Codex with Goals, hooks, and Full Access enabled
启动Codex时启用Goals、hooks和完全访问权限
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox
**For foreground-only usage:** Standard Codex workspace permissions are sufficient.codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox
**仅前台使用:** 标准Codex工作区权限即可。Core Command
核心命令
bash
$codex-autoresearchThis is the single entry point. Codex will infer the mode and configuration from your natural language goal.
bash
$codex-autoresearch这是唯一的入口点。Codex会从你的自然语言目标中推断模式和配置。
Basic Usage Patterns
基础使用模式
1. Improve Test Coverage
1. 提升测试覆盖率
text
You: $codex-autoresearch
I want to improve test coverage to at least 80%
Codex: Current coverage: 62% (measured via pytest --cov)
Target: ≥80%
Verify: pytest --cov --cov-report=term
Guard: pytest passes
Results directory: ./autoresearch-results/
Run mode: foreground or background?
You: Foreground, go.
Codex: Starting iteration loop — baseline: 62%What happens:
- Codex identifies untested files/functions
- Writes new tests or improves existing ones
- Commits each change
- Runs coverage check
- Keeps if coverage increases, reverts if it decreases or tests fail
- Repeats until target reached or interrupted
text
你: $codex-autoresearch
我想要将测试覆盖率提升至至少80%
Codex: 当前覆盖率:62%(通过pytest --cov测量)
目标:≥80%
验证命令:pytest --cov --cov-report=term
守卫命令:pytest通过
结果目录:./autoresearch-results/
运行模式:前台还是后台?
你: 前台,开始。
Codex: 启动迭代循环 — 基准值:62%执行流程:
- Codex识别未测试的文件/函数
- 编写新测试或优化现有测试
- 提交每次变更
- 运行覆盖率检查
- 若覆盖率提升则保留,若下降或测试失败则回滚
- 重复直至达到目标或被中断
2. Eliminate TypeScript any
Types
any2. 消除TypeScript的any
类型
anytext
You: $codex-autoresearch
Get rid of all the any types in my TypeScript code
Codex: Found 47 `any` occurrences in src/**/*.ts
Metric: `any` count (current: 47), direction: lower
Verify: grep -r "any" src | wc -l
Guard: npm run type-check (tsc --noEmit)
Run mode?
You: Background, run overnight.
Codex: Starting background run — baseline: 47. Iterating.Background mode:
- Runs in detached process
- Logs to
autoresearch-results/results.tsv - Can be stopped/resumed
- Check progress:
tail -f autoresearch-results/results.tsv
text
你: $codex-autoresearch
移除我TypeScript代码中的所有any类型
Codex: 在src/**/*.ts中发现47处`any`
指标:`any`数量(当前:47),方向:减少
验证命令:grep -r "any" src | wc -l
守卫命令:npm run type-check (tsc --noEmit)
运行模式?
你: 后台,通宵运行。
Codex: 启动后台运行 — 基准值:47。开始迭代。后台模式:
- 在独立进程中运行
- 日志输出到
autoresearch-results/results.tsv - 可停止/恢复
- 查看进度:
tail -f autoresearch-results/results.tsv
3. Fix Failing Tests
3. 修复失败的测试
text
You: $codex-autoresearch
Fix the 12 failing tests
Codex: Detected 12 test failures via pytest
Strategy: Repair one failure at a time
Verify: pytest (exit code 0 = success)
Mode?
You: Foreground.
Codex: Iteration 1: test_auth_token_expiry
Root cause: Clock skew in mock time...text
你: $codex-autoresearch
修复12个失败的测试
Codex: 通过pytest检测到12个测试失败
策略:逐个修复失败项
验证命令:pytest(退出码0=成功)
模式?
你: 前台。
Codex: 迭代1:test_auth_token_expiry
根本原因:模拟时间中的时钟偏差...4. Optimize Performance
4. 优化性能
text
You: $codex-autoresearch
Reduce API latency below 100ms
Codex: Current p95 latency: 247ms (from load test logs)
Target: <100ms
Verify: python benchmark.py --p95
Guard: pytest integration tests
Mode?
You: Foreground.text
你: $codex-autoresearch
将API延迟降低至100ms以下
Codex: 当前p95延迟:247ms(来自负载测试日志)
目标:<100ms
验证命令:python benchmark.py --p95
守卫命令:pytest集成测试
模式?
你: 前台。5. Security Audit
5. 安全审计
text
You: $codex-autoresearch
Is this code secure?
Codex: Mode: security audit
Framework: STRIDE + OWASP Top 10
Scope: src/**/*.py (detected Flask app)
Findings will be logged with code evidence.
Run mode?
You: Foreground.
Codex: Finding 1/7: SQL Injection risk in user_search()
Evidence: Line 42, raw string interpolation...text
你: $codex-autoresearch
这段代码安全吗?
Codex: 模式:security audit
框架:STRIDE + OWASP Top 10
范围:src/**/*.py(检测到Flask应用)
发现结果将附带代码证据记录。
运行模式?
你: 前台。
Codex: 发现1/7:user_search()存在SQL注入风险
证据:第42行,原始字符串插值...Modes Reference
模式参考
Codex infers the mode from your natural language, but understanding them helps you craft better prompts.
| Mode | Trigger Patterns | What It Does |
|---|---|---|
| loop | "improve X", "reduce Y", "optimize Z" | Iterative improvement until target or interrupt |
| plan | "analyze", "what should I improve?", "suggest metrics" | Scans repo, proposes goals and metrics |
| debug | "why is X happening?", "diagnose", "root cause" | Hypothesis-driven debugging with falsifiable tests |
| fix | "fix the N failing tests", "repair", "make tests pass" | Sequential repair of known failures |
| security | "is this secure?", "audit", "STRIDE", "OWASP" | Security analysis with structured findings |
| ship | "ship it", "ready to release?", "pre-deploy check" | Release readiness verification |
| exec | (CI/CD usage, see below) | Non-interactive automation mode |
Codex会从你的自然语言中推断模式,但了解这些模式有助于你编写更精准的提示词。
| 模式 | 触发句式 | 功能 |
|---|---|---|
| loop | "改进X"、"减少Y"、"优化Z" | 迭代改进直至达到目标或被中断 |
| plan | "分析"、"我应该改进什么?"、"建议指标" | 扫描仓库,提出目标和指标建议 |
| debug | "为什么会发生X?"、"诊断"、"根本原因" | 基于假设的调试,附带可证伪测试 |
| fix | "修复N个失败的测试"、"修复"、"让测试通过" | 按顺序修复已知失败项 |
| security | "这段代码安全吗?"、"审计"、"STRIDE"、"OWASP" | 结构化的安全分析并输出发现结果 |
| ship | "发布"、"准备好发布了吗?"、"部署前检查" | 发布就绪性验证 |
| exec | (CI/CD使用,见下文) | 非交互式自动化模式 |
Configuration
配置
You typically don't need to write config. Codex infers from your repo and goal.
But if you want explicit control, Codex will show you the inferred config and let you adjust before starting:
text
Codex: Inferred configuration:
goal: "eliminate any types"
scope: "src/**/*.ts"
metric: any_count
current_value: 47
direction: lower
verify_cmd: "grep -r 'any' src | wc -l"
guard_cmd: "npm run type-check"
Adjust anything?
You: Change scope to include test files too.
Codex: Updated scope: "{src,test}/**/*.ts"
Current value: 63 (including tests)
Proceed?通常无需手动编写配置。Codex会从你的仓库和目标中自动推断。
但如果你需要显式控制,Codex会展示推断出的配置,并允许你在启动前调整:
text
Codex: 推断的配置:
目标:"消除any类型"
范围:"src/**/*.ts"
指标:any_count
当前值:47
方向:减少
验证命令:"grep -r 'any' src | wc -l"
守卫命令:"npm run type-check"
是否需要调整?
You: Change scope to include test files too.
Codex: Updated scope: "{src,test}/**/*.ts"
Current value: 63 (including tests)
Proceed?Explicit Config (Advanced)
显式配置(高级)
For CI/CD or scripted use, you can provide a JSON config:
json
{
"goal": "Reduce bundle size",
"metric": "bundle_kb",
"current_value": 487,
"target_value": 300,
"direction": "lower",
"verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
"guard_cmd": "npm test",
"scope": "src/**/*.{ts,tsx}",
"max_iterations": 50
}bash
undefined对于CI/CD或脚本化使用,你可以提供JSON配置:
json
{
"goal": "Reduce bundle size",
"metric": "bundle_kb",
"current_value": 487,
"target_value": 300,
"direction": "lower",
"verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
"guard_cmd": "npm test",
"scope": "src/**/*.{ts,tsx}",
"max_iterations": 50
}bash
undefinedSave to file, then:
Save to file, then:
codex exec -f autoresearch_config.json
undefinedcodex exec -f autoresearch_config.json
undefinedResults and State Files
结果和状态文件
All runs create an directory in your workspace root:
autoresearch-results/autoresearch-results/
├── results.tsv # Full experiment log (audit trail)
├── state.json # Resume state (last consistent checkpoint)
├── lessons.json # Cross-run learning (what worked/failed)
└── sessions/
└── 2026-05-16_14-23-01/
├── experiment_1_keep.diff
├── experiment_2_discard.diff
└── ...All runs create an directory in your workspace root:
autoresearch-results/autoresearch-results/
├── results.tsv # Full experiment log (audit trail)
├── state.json # Resume state (last consistent checkpoint)
├── lessons.json # Cross-run learning (what worked/failed)
└── sessions/
└── 2026-05-16_14-23-01/
├── experiment_1_keep.diff
├── experiment_2_discard.diff
└── ...Reading the Results Log
Reading the Results Log
bash
undefinedbash
undefinedView all experiments
View all experiments
cat autoresearch-results/results.tsv
cat autoresearch-results/results.tsv
Watch live (during background run)
Watch live (during background run)
tail -f autoresearch-results/results.tsv
tail -f autoresearch-results/results.tsv
Filter successful improvements
Filter successful improvements
grep "keep" autoresearch-results/results.tsv
**Example log:**
iteration commit metric delta status description
0 a1b2c3d 47 0 baseline initial any count
1 b2c3d4e 41 -6 keep replace any in auth module
2 - 49 +8 discard generic wrapper introduced new anys
3 d4e5f6g 38 -3 keep type-narrow API response handlers
4 e5f6g7h 38 0 discard refactor had no effect
5 f6g7h8i 35 -3 keep infer types from JSON schema
undefinedgrep "keep" autoresearch-results/results.tsv
**Example log:**
iteration commit metric delta status description
0 a1b2c3d 47 0 baseline initial any count
1 b2c3d4e 41 -6 keep replace any in auth module
2 - 49 +8 discard generic wrapper introduced new anys
3 d4e5f6g 38 -3 keep type-narrow API response handlers
4 e5f6g7h 38 0 discard refactor had no effect
5 f6g7h8i 35 -3 keep infer types from JSON schema
undefinedEscalation Strategy (When Stuck)
Escalation Strategy (When Stuck)
The loop doesn't blindly retry. It escalates:
| Trigger | Action |
|---|---|
| 3 consecutive failures | REFINE — Adjust within current strategy |
| 5 consecutive failures | PIVOT — Try fundamentally different approach |
| 2 PIVOTs without progress | Web search — Look for external solutions |
| 3 PIVOTs without progress | STOP — Report that human input needed |
One success resets all counters.
Example:
text
Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time
Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec
Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
Found: Use strictNullChecks + noImplicitAny...
Iteration 25: keep
→ Counters reset. Back to normal iteration.The loop doesn't blindly retry. It escalates:
| Trigger | Action |
|---|---|
| 3 consecutive failures | REFINE — Adjust within current strategy |
| 5 consecutive failures | PIVOT — Try fundamentally different approach |
| 2 PIVOTs without progress | Web search — Look for external solutions |
| 3 PIVOTs without progress | STOP — Report that human input needed |
One success resets all counters.
Example:
text
Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time
Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec
Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
Found: Use strictNullChecks + noImplicitAny...
Iteration 25: keep
→ Counters reset. Back to normal iteration.Stopping and Resuming
Stopping and Resuming
Stop a Foreground Run
Stop a Foreground Run
text
You: (interrupt Codex)
Codex: Autoresearch interrupted.
State saved to autoresearch-results/state.json
Resume with: $codex-autoresearchtext
You: (interrupt Codex)
Codex: Autoresearch interrupted.
State saved to autoresearch-results/state.json
Resume with: $codex-autoresearchStop a Background Run
Stop a Background Run
text
You: $codex-autoresearch
Codex: Background run detected (PID 12345).
Status: 15 iterations, metric improved 47→35
You: Stop it.
Codex: Stopping background run...
Final state saved. Resume anytime with $codex-autoresearch.text
You: $codex-autoresearch
Codex: Background run detected (PID 12345).
Status: 15 iterations, metric improved 47→35
You: Stop it.
Codex: Stopping background run...
Final state saved. Resume anytime with $codex-autoresearch.Resume
Resume
bash
$codex-autoresearchCodex automatically detects and offers to resume:
state.jsontext
Codex: Found interrupted session:
Goal: eliminate any types
Progress: 47→35 (12 iterations)
Last commit: f6g7h8i
Resume from iteration 13?
You: Yes, continue.bash
$codex-autoresearchCodex automatically detects and offers to resume:
state.jsontext
Codex: Found interrupted session:
Goal: eliminate any types
Progress: 47→35 (12 iterations)
Last commit: f6g7h8i
Resume from iteration 13?
You: Yes, continue.Dual-Gate Verification
Dual-Gate Verification
Every iteration runs two checks:
- Verify — Did the metric improve?
- Guard — Did anything break?
python
undefinedEvery iteration runs two checks:
- Verify — Did the metric improve?
- Guard — Did anything break?
python
undefinedPseudocode of each iteration
Pseudocode of each iteration
git checkout -b experiment_N
modify_code()
git commit -m "experiment N: {hypothesis}"
verify_result = run(verify_cmd)
guard_result = run(guard_cmd)
if verify_result.improved and guard_result.passed:
git merge experiment_N
log("keep")
else:
git reset --hard HEAD~1
log("discard")
**Example:**
- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)
A change that increases coverage but breaks tests is **discarded**.git checkout -b experiment_N
modify_code()
git commit -m "experiment N: {hypothesis}"
verify_result = run(verify_cmd)
guard_result = run(guard_cmd)
if verify_result.improved and guard_result.passed:
git merge experiment_N
log("keep")
else:
git reset --hard HEAD~1
log("discard")
**Example:**
- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)
A change that increases coverage but breaks tests is **discarded**.CI/CD Mode (exec)
CI/CD Mode (exec)
For automation pipelines, use mode:
execbash
undefinedFor automation pipelines, use mode:
execbash
undefinedNon-interactive, JSON output, exit codes
Non-interactive, JSON output, exit codes
codex exec -f config.json --max-iterations 20 --timeout 3600
**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure
**Example config.json:**
```json
{
"goal": "Reduce lint warnings to zero",
"metric": "lint_warnings",
"current_value": 34,
"target_value": 0,
"direction": "lower",
"verify_cmd": "npm run lint -- --format json | jq '.length'",
"guard_cmd": "npm test",
"max_iterations": 30
}Output (JSON):
json
{
"status": "success",
"iterations": 18,
"initial_value": 34,
"final_value": 0,
"improvements": 12,
"discards": 6,
"elapsed_seconds": 1847
}codex exec -f config.json --max-iterations 20 --timeout 3600
**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure
**Example config.json:**
```json
{
"goal": "Reduce lint warnings to zero",
"metric": "lint_warnings",
"current_value": 34,
"target_value": 0,
"direction": "lower",
"verify_cmd": "npm run lint -- --format json | jq '.length'",
"guard_cmd": "npm test",
"max_iterations": 30
}Output (JSON):
json
{
"status": "success",
"iterations": 18,
"initial_value": 34,
"final_value": 0,
"improvements": 12,
"discards": 6,
"elapsed_seconds": 1847
}Advanced Features
Advanced Features
Parallel Experiments (Git Worktrees)
Parallel Experiments (Git Worktrees)
Test up to 3 hypotheses simultaneously:
text
You: $codex-autoresearch
Try multiple approaches to reduce latency
Codex: Parallel mode enabled (3 worktrees).
Hypothesis A: Database query optimization
Hypothesis B: Response caching
Hypothesis C: Async I/O refactor
Testing in parallel...
Results:
A: -23ms (keep)
B: -67ms (keep) ← Best
C: +12ms (discard)
Merging B, continuing from new baseline.Test up to 3 hypotheses simultaneously:
text
You: $codex-autoresearch
Try multiple approaches to reduce latency
Codex: Parallel mode enabled (3 worktrees).
Hypothesis A: Database query optimization
Hypothesis B: Response caching
Hypothesis C: Async I/O refactor
Testing in parallel...
Results:
A: -23ms (keep)
B: -67ms (keep) ← Best
C: +12ms (discard)
Merging B, continuing from new baseline.Cross-Run Learning
Cross-Run Learning
The file accumulates knowledge:
lessons.jsonjson
{
"successful_patterns": [
{
"goal": "reduce any types",
"approach": "infer from JSON schema",
"success_rate": 0.83,
"avg_improvement": 4.2
}
],
"failed_patterns": [
{
"goal": "reduce any types",
"approach": "generic type wrappers",
"failure_rate": 0.91,
"reason": "introduced more anys downstream"
}
]
}Future runs bias toward proven approaches and away from known failures.
The file accumulates knowledge:
lessons.jsonjson
{
"successful_patterns": [
{
"goal": "reduce any types",
"approach": "infer from JSON schema",
"success_rate": 0.83,
"avg_improvement": 4.2
}
],
"failed_patterns": [
{
"goal": "reduce any types",
"approach": "generic type wrappers",
"failure_rate": 0.91,
"reason": "introduced more anys downstream"
}
]
}Future runs bias toward proven approaches and away from known failures.
Session Hooks (Auto-Persistence)
Session Hooks (Auto-Persistence)
Hooks keep Codex on track across session boundaries:
bash
undefinedHooks keep Codex on track across session boundaries:
bash
undefinedAuto-installed with skill
Auto-installed with skill
~/.config/codex/hooks/post-session.sh
On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned
**Manual hook setup (if needed):**
```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
~/.config/codex/hooks/post-session.sh~/.config/codex/hooks/post-session.sh
On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned
**Manual hook setup (if needed):**
```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
~/.config/codex/hooks/post-session.shReal-World Examples
Real-World Examples
Python: Improve Type Coverage
Python: Improve Type Coverage
python
undefinedpython
undefinedBefore autoresearch
Before autoresearch
def process_data(data): # No types
result = []
for item in data:
result.append(item['value'] * 2)
return result
def process_data(data): # No types
result = []
for item in data:
result.append(item['value'] * 2)
return result
After 8 iterations
After 8 iterations
from typing import List, Dict, Any
def process_data(data: List[Dict[str, Any]]) -> List[float]:
result: List[float] = []
for item in data:
result.append(float(item['value']) * 2)
return result
**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict modefrom typing import List, Dict, Any
def process_data(data: List[Dict[str, Any]]) -> List[float]:
result: List[float] = []
for item in data:
result.append(float(item['value']) * 2)
return result
**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict modeJavaScript: Reduce Bundle Size
JavaScript: Reduce Bundle Size
javascript
// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';
// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es'; // Tree-shakeable
import { formatDate } from 'date-fns/formatDate'; // Targeted import
import { parseJSON, validateEmail } from './utils'; // Explicit importsCommand:
text
$codex-autoresearch
Reduce production bundle size below 350kbjavascript
// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';
// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es'; // Tree-shakeable
import { formatDate } from 'date-fns/formatDate'; // Targeted import
import { parseJSON, validateEmail } from './utils'; // Explicit importsCommand:
text
$codex-autoresearch
Reduce production bundle size below 350kbRust: Eliminate Clippy Warnings
Rust: Eliminate Clippy Warnings
rust
// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
let mut result = 0;
for i in 0..x {
result = result + y; // clippy: use += instead
}
result
}
// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
x * y // Direct multiplication, clippy-clean
}Command:
text
$codex-autoresearch
Eliminate all clippy warnings with default lintsrust
// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
let mut result = 0;
for i in 0..x {
result = result + y; // clippy: use += instead
}
result
}
// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
x * y // Direct multiplication, clippy-clean
}Command:
text
$codex-autoresearch
Eliminate all clippy warnings with default lintsTroubleshooting
Troubleshooting
"Background run failed to start"
"Background run failed to start"
Cause: Codex not started with
--dangerously-bypass-approvals-and-sandboxFix:
bash
undefinedCause: Codex not started with
--dangerously-bypass-approvals-and-sandboxFix:
bash
undefinedRestart Codex with Full Access
Restart Codex with Full Access
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox
Or use foreground mode instead.codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox
Or use foreground mode instead."Verify command failed"
"Verify command failed"
Cause: The verify command isn't executable or returns unexpected format
Fix:
text
You: Test the verify command manually.
Codex: Running: grep -r "any" src | wc -l
Output: " 47" ← Extra spaces
You: Adjust to: grep -r "any" src | wc -l | xargs
Codex: Updated verify_cmd. Baseline: 47 (clean integer).Cause: The verify command isn't executable or returns unexpected format
Fix:
text
You: Test the verify command manually.
Codex: Running: grep -r "any" src | wc -l
Output: " 47" ← Extra spaces
You: Adjust to: grep -r "any" src | wc -l | xargs
Codex: Updated verify_cmd. Baseline: 47 (clean integer)."All experiments discarded, no progress"
"All experiments discarded, no progress"
Cause: Guard too strict (e.g., flaky tests) or goal unachievable
Check results log:
bash
grep "discard" autoresearch-results/results.tsvCommon reasons:
- Guard fails even on unchanged code (flaky tests)
- Metric can't be improved with current tooling
- Scope too broad (try narrowing)
Fix:
text
You: Run guard manually: npm test
Codex: Tests fail intermittently (test_cache_timeout).
Recommendation: Fix flaky test or exclude from guard.
You: Exclude that test from guard.
Codex: Updated guard: npm test -- --ignore test_cache_timeoutCause: Guard too strict (e.g., flaky tests) or goal unachievable
Check results log:
bash
grep "discard" autoresearch-results/results.tsvCommon reasons:
- Guard fails even on unchanged code (flaky tests)
- Metric can't be improved with current tooling
- Scope too broad (try narrowing)
Fix:
text
You: Run guard manually: npm test
Codex: Tests fail intermittently (test_cache_timeout).
Recommendation: Fix flaky test or exclude from guard.
You: Exclude that test from guard.
Codex: Updated guard: npm test -- --ignore test_cache_timeout"State file corrupted"
"State file corrupted"
Cause: Interrupted during JSON write
Fix:
bash
undefinedCause: Interrupted during JSON write
Fix:
bash
undefinedRestore from git (if committed)
Restore from git (if committed)
git restore autoresearch-results/state.json
git restore autoresearch-results/state.json
Or start fresh (loses resume state, keeps logs)
Or start fresh (loses resume state, keeps logs)
rm autoresearch-results/state.json
$codex-autoresearch
undefinedrm autoresearch-results/state.json
$codex-autoresearch
undefined"Infinite loop, no termination"
"Infinite loop, no termination"
Cause: No target value set, or metric unstable
Fix:
text
You: Set max_iterations to 50 and stop.
Codex: Updated config: max_iterations = 50
Will stop after 50 iterations regardless of target.Or add explicit target:
text
You: Target is 80% coverage, stop when reached.
Codex: Updated target_value: 80
Will stop when coverage ≥80%.Cause: No target value set, or metric unstable
Fix:
text
You: Set max_iterations to 50 and stop.
Codex: Updated config: max_iterations = 50
Will stop after 50 iterations regardless of target.Or add explicit target:
text
You: Target is 80% coverage, stop when reached.
Codex: Updated target_value: 80
Will stop when coverage ≥80%.Best Practices
Best Practices
1. Start with Small, Measurable Goals
1. Start with Small, Measurable Goals
❌ "Make the code better"
✅ "Reduce ESLint warnings from 42 to 0"
❌ "Make the code better"
✅ "Reduce ESLint warnings from 42 to 0"
2. Verify Your Verify Command First
2. Verify Your Verify Command First
bash
undefinedbash
undefinedBefore starting autoresearch, confirm the metric works
Before starting autoresearch, confirm the metric works
pytest --cov --cov-report=term | grep TOTAL
pytest --cov --cov-report=term | grep TOTAL
Should output a parseable percentage
Should output a parseable percentage
undefinedundefined3. Use Foreground for New Goals
3. Use Foreground for New Goals
Run foreground first to watch the loop and verify behavior. Switch to background once confident.
Run foreground first to watch the loop and verify behavior. Switch to background once confident.
4. Let Codex Infer, Then Adjust
4. Let Codex Infer, Then Adjust
Don't write config upfront. Let Codex propose, then refine:
text
Codex: Proposed verify: npm run test:coverage
You: Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%Don't write config upfront. Let Codex propose, then refine:
text
Codex: Proposed verify: npm run test:coverage
You: Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%5. Check Results Log After Each Run
5. Check Results Log After Each Run
bash
tail -20 autoresearch-results/results.tsvUnderstand what worked and what didn't. This informs your next goal.
bash
tail -20 autoresearch-results/results.tsvUnderstand what worked and what didn't. This informs your next goal.
6. Use Git Strategically
6. Use Git Strategically
Autoresearch commits every experiment. Your git log becomes the audit trail:
bash
git log --oneline --grep="autoresearch"To squash experiments into clean commits after the run:
bash
git rebase -i HEAD~20 # Interactive rebase last 20 autoresearch commitsAutoresearch commits every experiment. Your git log becomes the audit trail:
bash
git log --oneline --grep="autoresearch"To squash experiments into clean commits after the run:
bash
git rebase -i HEAD~20 # Interactive rebase last 20 autoresearch commitsEnvironment Variables
Environment Variables
Autoresearch respects standard tool configs via environment:
bash
undefinedAutoresearch respects standard tool configs via environment:
bash
undefinedExample: Use specific Python for pytest
Example: Use specific Python for pytest
export PYTHON=/usr/bin/python3.11
$codex-autoresearch
export PYTHON=/usr/bin/python3.11
$codex-autoresearch
Example: Increase test timeout
Example: Increase test timeout
export PYTEST_TIMEOUT=300
$codex-autoresearch
No secrets needed — autoresearch runs local tools, no external API calls.export PYTEST_TIMEOUT=300
$codex-autoresearch
No secrets needed — autoresearch runs local tools, no external API calls.Integration with Other Codex Skills
Integration with Other Codex Skills
Combine autoresearch with other skills:
text
You: $code-review
Review the autoresearch improvements from last night.
Codex: Reviewing 12 commits in autoresearch-results/sessions/...
Summary:
- 8 type improvements: Good, no regressions detected
- 3 test additions: Coverage gaps filled correctly
- 1 refactor: Extracted helper, maintains behavior
Recommendation: Merge to main.text
You: $codex-autoresearch
Optimize performance
Then:
$benchmark
Compare before/after with flamegraphsCombine autoresearch with other skills:
text
You: $code-review
Review the autoresearch improvements from last night.
Codex: Reviewing 12 commits in autoresearch-results/sessions/...
Summary:
- 8 type improvements: Good, no regressions detected
- 3 test additions: Coverage gaps filled correctly
- 1 refactor: Extracted helper, maintains behavior
Recommendation: Merge to main.text
You: $codex-autoresearch
Optimize performance
Then:
$benchmark
Compare before/after with flamegraphsLimitations
Limitations
- Requires git — All experiments are git-based (commit/revert cycle)
- Local tools only — Verify and guard must be executable commands in your environment
- No multi-repo (yet) — Operates within a single workspace root
- Deterministic metrics work best — Flaky metrics lead to false discards
- Requires git — All experiments are git-based (commit/revert cycle)
- Local tools only — Verify and guard must be executable commands in your environment
- No multi-repo (yet) — Operates within a single workspace root
- Deterministic metrics work best — Flaky metrics lead to false discards
Getting Help
Getting Help
If autoresearch behaves unexpectedly:
- Check the results log:
cat autoresearch-results/results.tsv - Review state file:
cat autoresearch-results/state.json - Run verify manually: Test your verify command outside autoresearch
- Ask Codex to explain: then "explain the last 5 iterations"
$codex-autoresearch
If autoresearch behaves unexpectedly:
- Check the results log:
cat autoresearch-results/results.tsv - Review state file:
cat autoresearch-results/state.json - Run verify manually: Test your verify command outside autoresearch
- Ask Codex to explain: then "explain the last 5 iterations"
$codex-autoresearch
Summary
Summary
Codex Autoresearch is a single-command autonomous improvement loop:
bash
$codex-autoresearchYou describe the goal in natural language. Codex infers the config, confirms with you, then iterates:
modify → commit → verify → keep or discard → repeat
Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.
Most common workflow:
text
$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvementsThat's it. Autonomous iteration with human-in-the-loop goal-setting.
Codex Autoresearch is a single-command autonomous improvement loop:
bash
$codex-autoresearchYou describe the goal in natural language. Codex infers the config, confirms with you, then iterates:
modify → commit → verify → keep or discard → repeat
Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.
Most common workflow:
text
$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvementsThat's it. Autonomous iteration with human-in-the-loop goal-setting.