codex-autoresearch-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Codex Autoresearch Skill

Codex Autoresearch Skill

Skill by ara.so — Codex Skills collection.
ara.so开发的Skill — Codex Skills合集。

What is Codex Autoresearch?

什么是Codex Autoresearch?

Codex Autoresearch is an autonomous goal-driven experimentation system for Codex that continuously cycles through: modify code → verify result → retain (if improved) or discard (if worse) → repeat indefinitely. Tell Codex what you want to improve, walk away, and come back to a log of experiments and a better codebase.
Key capabilities:
  • Autonomous iteration loops (foreground or background)
  • Git-based experiment tracking with automatic revert on failure
  • Dual-gate verification (did it improve? did anything break?)
  • Escalating retry strategies (REFINE → PIVOT → Web search → Stop)
  • Cross-run learning from past experiments
  • Multiple modes: loop, plan, debug, fix, security, ship, exec
Codex Autoresearch是面向Codex的自主目标驱动实验系统,可持续循环执行:修改代码 → 验证结果 → 保留(若改进)或丢弃(若恶化) → 无限重复。告诉Codex你想要改进的方向,无需值守,返回后即可查看实验日志和优化后的代码库。
核心功能:
  • 自主迭代循环(前台或后台运行)
  • 基于Git的实验跟踪,失败时自动回滚
  • 双门验证(是否有改进?是否引入问题?)
  • 递进式重试策略(REFINE → PIVOT → 网页搜索 → 停止)
  • 从过往实验中跨运行学习
  • 多种模式:loop、plan、debug、fix、security、ship、exec

Installation

安装

Recommended: Skill Installer

推荐方式:Skill Installer

bash
undefined
bash
undefined

In Codex

在Codex中执行


Then restart Codex.

然后重启Codex。

Manual Installation

手动安装

bash
undefined
bash
undefined

Clone to Codex skills directory

克隆到Codex技能目录

cd ~/.config/codex/skills git clone https://github.com/leo-lilinxiao/codex-autoresearch
cd ~/.config/codex/skills git clone https://github.com/leo-lilinxiao/codex-autoresearch

Or create a symlink

或创建符号链接

ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch
undefined
ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch
undefined

Prerequisites for Full Functionality

全功能前置条件

For background runs and unattended execution:
bash
undefined
针对后台运行和无人值守执行:
bash
undefined

Start Codex with Goals, hooks, and Full Access enabled

启动Codex时启用Goals、hooks和完全访问权限

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox

**For foreground-only usage:** Standard Codex workspace permissions are sufficient.
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox

**仅前台使用:** 标准Codex工作区权限即可。

Core Command

核心命令

bash
$codex-autoresearch
This is the single entry point. Codex will infer the mode and configuration from your natural language goal.
bash
$codex-autoresearch
这是唯一的入口点。Codex会从你的自然语言目标中推断模式和配置。

Basic Usage Patterns

基础使用模式

1. Improve Test Coverage

1. 提升测试覆盖率

text
You:   $codex-autoresearch
       I want to improve test coverage to at least 80%

Codex: Current coverage: 62% (measured via pytest --cov)
       Target: ≥80%
       Verify: pytest --cov --cov-report=term
       Guard: pytest passes
       Results directory: ./autoresearch-results/
       Run mode: foreground or background?

You:   Foreground, go.

Codex: Starting iteration loop — baseline: 62%
What happens:
  • Codex identifies untested files/functions
  • Writes new tests or improves existing ones
  • Commits each change
  • Runs coverage check
  • Keeps if coverage increases, reverts if it decreases or tests fail
  • Repeats until target reached or interrupted
text
你:   $codex-autoresearch
       我想要将测试覆盖率提升至至少80%

Codex: 当前覆盖率:62%(通过pytest --cov测量)
       目标:≥80%
       验证命令:pytest --cov --cov-report=term
       守卫命令:pytest通过
       结果目录:./autoresearch-results/
       运行模式:前台还是后台?

你:   前台,开始。

Codex: 启动迭代循环 — 基准值:62%
执行流程:
  • Codex识别未测试的文件/函数
  • 编写新测试或优化现有测试
  • 提交每次变更
  • 运行覆盖率检查
  • 若覆盖率提升则保留,若下降或测试失败则回滚
  • 重复直至达到目标或被中断

2. Eliminate TypeScript
any
Types

2. 消除TypeScript的
any
类型

text
You:   $codex-autoresearch
       Get rid of all the any types in my TypeScript code

Codex: Found 47 `any` occurrences in src/**/*.ts
       Metric: `any` count (current: 47), direction: lower
       Verify: grep -r "any" src | wc -l
       Guard: npm run type-check (tsc --noEmit)
       Run mode?

You:   Background, run overnight.

Codex: Starting background run — baseline: 47. Iterating.
Background mode:
  • Runs in detached process
  • Logs to
    autoresearch-results/results.tsv
  • Can be stopped/resumed
  • Check progress:
    tail -f autoresearch-results/results.tsv
text
你:   $codex-autoresearch
       移除我TypeScript代码中的所有any类型

Codex: 在src/**/*.ts中发现47处`any`
       指标:`any`数量(当前:47),方向:减少
       验证命令:grep -r "any" src | wc -l
       守卫命令:npm run type-check (tsc --noEmit)
       运行模式?

你:   后台,通宵运行。

Codex: 启动后台运行 — 基准值:47。开始迭代。
后台模式:
  • 在独立进程中运行
  • 日志输出到
    autoresearch-results/results.tsv
  • 可停止/恢复
  • 查看进度:
    tail -f autoresearch-results/results.tsv

3. Fix Failing Tests

3. 修复失败的测试

text
You:   $codex-autoresearch
       Fix the 12 failing tests

Codex: Detected 12 test failures via pytest
       Strategy: Repair one failure at a time
       Verify: pytest (exit code 0 = success)
       Mode?

You:   Foreground.

Codex: Iteration 1: test_auth_token_expiry
       Root cause: Clock skew in mock time...
text
你:   $codex-autoresearch
       修复12个失败的测试

Codex: 通过pytest检测到12个测试失败
       策略:逐个修复失败项
       验证命令:pytest(退出码0=成功)
       模式?

你:   前台。

Codex: 迭代1:test_auth_token_expiry
       根本原因:模拟时间中的时钟偏差...

4. Optimize Performance

4. 优化性能

text
You:   $codex-autoresearch
       Reduce API latency below 100ms

Codex: Current p95 latency: 247ms (from load test logs)
       Target: <100ms
       Verify: python benchmark.py --p95
       Guard: pytest integration tests
       Mode?

You:   Foreground.
text
你:   $codex-autoresearch
       将API延迟降低至100ms以下

Codex: 当前p95延迟:247ms(来自负载测试日志)
       目标:<100ms
       验证命令:python benchmark.py --p95
       守卫命令:pytest集成测试
       模式?

你:   前台。

5. Security Audit

5. 安全审计

text
You:   $codex-autoresearch
       Is this code secure?

Codex: Mode: security audit
       Framework: STRIDE + OWASP Top 10
       Scope: src/**/*.py (detected Flask app)
       Findings will be logged with code evidence.
       Run mode?

You:   Foreground.

Codex: Finding 1/7: SQL Injection risk in user_search()
       Evidence: Line 42, raw string interpolation...
text
你:   $codex-autoresearch
       这段代码安全吗?

Codex: 模式:security audit
       框架:STRIDE + OWASP Top 10
       范围:src/**/*.py(检测到Flask应用)
       发现结果将附带代码证据记录。
       运行模式?

你:   前台。

Codex: 发现1/7:user_search()存在SQL注入风险
       证据:第42行,原始字符串插值...

Modes Reference

模式参考

Codex infers the mode from your natural language, but understanding them helps you craft better prompts.
ModeTrigger PatternsWhat It Does
loop"improve X", "reduce Y", "optimize Z"Iterative improvement until target or interrupt
plan"analyze", "what should I improve?", "suggest metrics"Scans repo, proposes goals and metrics
debug"why is X happening?", "diagnose", "root cause"Hypothesis-driven debugging with falsifiable tests
fix"fix the N failing tests", "repair", "make tests pass"Sequential repair of known failures
security"is this secure?", "audit", "STRIDE", "OWASP"Security analysis with structured findings
ship"ship it", "ready to release?", "pre-deploy check"Release readiness verification
exec(CI/CD usage, see below)Non-interactive automation mode
Codex会从你的自然语言中推断模式,但了解这些模式有助于你编写更精准的提示词。
模式触发句式功能
loop"改进X"、"减少Y"、"优化Z"迭代改进直至达到目标或被中断
plan"分析"、"我应该改进什么?"、"建议指标"扫描仓库,提出目标和指标建议
debug"为什么会发生X?"、"诊断"、"根本原因"基于假设的调试,附带可证伪测试
fix"修复N个失败的测试"、"修复"、"让测试通过"按顺序修复已知失败项
security"这段代码安全吗?"、"审计"、"STRIDE"、"OWASP"结构化的安全分析并输出发现结果
ship"发布"、"准备好发布了吗?"、"部署前检查"发布就绪性验证
exec(CI/CD使用,见下文)非交互式自动化模式

Configuration

配置

You typically don't need to write config. Codex infers from your repo and goal.
But if you want explicit control, Codex will show you the inferred config and let you adjust before starting:
text
Codex: Inferred configuration:
       goal: "eliminate any types"
       scope: "src/**/*.ts"
       metric: any_count
       current_value: 47
       direction: lower
       verify_cmd: "grep -r 'any' src | wc -l"
       guard_cmd: "npm run type-check"
       
       Adjust anything?

You:   Change scope to include test files too.

Codex: Updated scope: "{src,test}/**/*.ts"
       Current value: 63 (including tests)
       Proceed?
通常无需手动编写配置。Codex会从你的仓库和目标中自动推断。
但如果你需要显式控制,Codex会展示推断出的配置,并允许你在启动前调整:
text
Codex: 推断的配置:
       目标:"消除any类型"
       范围:"src/**/*.ts"
       指标:any_count
       当前值:47
       方向:减少
       验证命令:"grep -r 'any' src | wc -l"
       守卫命令:"npm run type-check"
       
       是否需要调整?

You:   Change scope to include test files too.

Codex: Updated scope: "{src,test}/**/*.ts"
       Current value: 63 (including tests)
       Proceed?

Explicit Config (Advanced)

显式配置(高级)

For CI/CD or scripted use, you can provide a JSON config:
json
{
  "goal": "Reduce bundle size",
  "metric": "bundle_kb",
  "current_value": 487,
  "target_value": 300,
  "direction": "lower",
  "verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
  "guard_cmd": "npm test",
  "scope": "src/**/*.{ts,tsx}",
  "max_iterations": 50
}
bash
undefined
对于CI/CD或脚本化使用,你可以提供JSON配置:
json
{
  "goal": "Reduce bundle size",
  "metric": "bundle_kb",
  "current_value": 487,
  "target_value": 300,
  "direction": "lower",
  "verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
  "guard_cmd": "npm test",
  "scope": "src/**/*.{ts,tsx}",
  "max_iterations": 50
}
bash
undefined

Save to file, then:

Save to file, then:

codex exec -f autoresearch_config.json
undefined
codex exec -f autoresearch_config.json
undefined

Results and State Files

结果和状态文件

All runs create an
autoresearch-results/
directory in your workspace root:
autoresearch-results/
├── results.tsv          # Full experiment log (audit trail)
├── state.json           # Resume state (last consistent checkpoint)
├── lessons.json         # Cross-run learning (what worked/failed)
└── sessions/
    └── 2026-05-16_14-23-01/
        ├── experiment_1_keep.diff
        ├── experiment_2_discard.diff
        └── ...
All runs create an
autoresearch-results/
directory in your workspace root:
autoresearch-results/
├── results.tsv          # Full experiment log (audit trail)
├── state.json           # Resume state (last consistent checkpoint)
├── lessons.json         # Cross-run learning (what worked/failed)
└── sessions/
    └── 2026-05-16_14-23-01/
        ├── experiment_1_keep.diff
        ├── experiment_2_discard.diff
        └── ...

Reading the Results Log

Reading the Results Log

bash
undefined
bash
undefined

View all experiments

View all experiments

cat autoresearch-results/results.tsv
cat autoresearch-results/results.tsv

Watch live (during background run)

Watch live (during background run)

tail -f autoresearch-results/results.tsv
tail -f autoresearch-results/results.tsv

Filter successful improvements

Filter successful improvements

grep "keep" autoresearch-results/results.tsv

**Example log:**
iteration commit metric delta status description 0 a1b2c3d 47 0 baseline initial any count 1 b2c3d4e 41 -6 keep replace any in auth module 2 - 49 +8 discard generic wrapper introduced new anys 3 d4e5f6g 38 -3 keep type-narrow API response handlers 4 e5f6g7h 38 0 discard refactor had no effect 5 f6g7h8i 35 -3 keep infer types from JSON schema
undefined
grep "keep" autoresearch-results/results.tsv

**Example log:**
iteration commit metric delta status description 0 a1b2c3d 47 0 baseline initial any count 1 b2c3d4e 41 -6 keep replace any in auth module 2 - 49 +8 discard generic wrapper introduced new anys 3 d4e5f6g 38 -3 keep type-narrow API response handlers 4 e5f6g7h 38 0 discard refactor had no effect 5 f6g7h8i 35 -3 keep infer types from JSON schema
undefined

Escalation Strategy (When Stuck)

Escalation Strategy (When Stuck)

The loop doesn't blindly retry. It escalates:
TriggerAction
3 consecutive failuresREFINE — Adjust within current strategy
5 consecutive failuresPIVOT — Try fundamentally different approach
2 PIVOTs without progressWeb search — Look for external solutions
3 PIVOTs without progressSTOP — Report that human input needed
One success resets all counters.
Example:
text
Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time

Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec

Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
  Found: Use strictNullChecks + noImplicitAny...

Iteration 25: keep
→ Counters reset. Back to normal iteration.
The loop doesn't blindly retry. It escalates:
TriggerAction
3 consecutive failuresREFINE — Adjust within current strategy
5 consecutive failuresPIVOT — Try fundamentally different approach
2 PIVOTs without progressWeb search — Look for external solutions
3 PIVOTs without progressSTOP — Report that human input needed
One success resets all counters.
Example:
text
Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time

Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec

Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
  Found: Use strictNullChecks + noImplicitAny...

Iteration 25: keep
→ Counters reset. Back to normal iteration.

Stopping and Resuming

Stopping and Resuming

Stop a Foreground Run

Stop a Foreground Run

text
You:   (interrupt Codex)

Codex: Autoresearch interrupted.
       State saved to autoresearch-results/state.json
       Resume with: $codex-autoresearch
text
You:   (interrupt Codex)

Codex: Autoresearch interrupted.
       State saved to autoresearch-results/state.json
       Resume with: $codex-autoresearch

Stop a Background Run

Stop a Background Run

text
You:   $codex-autoresearch

Codex: Background run detected (PID 12345).
       Status: 15 iterations, metric improved 47→35
       
You:   Stop it.

Codex: Stopping background run...
       Final state saved. Resume anytime with $codex-autoresearch.
text
You:   $codex-autoresearch

Codex: Background run detected (PID 12345).
       Status: 15 iterations, metric improved 47→35
       
You:   Stop it.

Codex: Stopping background run...
       Final state saved. Resume anytime with $codex-autoresearch.

Resume

Resume

bash
$codex-autoresearch
Codex automatically detects
state.json
and offers to resume:
text
Codex: Found interrupted session:
       Goal: eliminate any types
       Progress: 47→35 (12 iterations)
       Last commit: f6g7h8i
       Resume from iteration 13?

You:   Yes, continue.
bash
$codex-autoresearch
Codex automatically detects
state.json
and offers to resume:
text
Codex: Found interrupted session:
       Goal: eliminate any types
       Progress: 47→35 (12 iterations)
       Last commit: f6g7h8i
       Resume from iteration 13?

You:   Yes, continue.

Dual-Gate Verification

Dual-Gate Verification

Every iteration runs two checks:
  1. Verify — Did the metric improve?
  2. Guard — Did anything break?
python
undefined
Every iteration runs two checks:
  1. Verify — Did the metric improve?
  2. Guard — Did anything break?
python
undefined

Pseudocode of each iteration

Pseudocode of each iteration

git checkout -b experiment_N modify_code() git commit -m "experiment N: {hypothesis}"
verify_result = run(verify_cmd) guard_result = run(guard_cmd)
if verify_result.improved and guard_result.passed: git merge experiment_N log("keep") else: git reset --hard HEAD~1 log("discard")

**Example:**

- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)

A change that increases coverage but breaks tests is **discarded**.
git checkout -b experiment_N modify_code() git commit -m "experiment N: {hypothesis}"
verify_result = run(verify_cmd) guard_result = run(guard_cmd)
if verify_result.improved and guard_result.passed: git merge experiment_N log("keep") else: git reset --hard HEAD~1 log("discard")

**Example:**

- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)

A change that increases coverage but breaks tests is **discarded**.

CI/CD Mode (exec)

CI/CD Mode (exec)

For automation pipelines, use
exec
mode:
bash
undefined
For automation pipelines, use
exec
mode:
bash
undefined

Non-interactive, JSON output, exit codes

Non-interactive, JSON output, exit codes

codex exec -f config.json --max-iterations 20 --timeout 3600

**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure

**Example config.json:**

```json
{
  "goal": "Reduce lint warnings to zero",
  "metric": "lint_warnings",
  "current_value": 34,
  "target_value": 0,
  "direction": "lower",
  "verify_cmd": "npm run lint -- --format json | jq '.length'",
  "guard_cmd": "npm test",
  "max_iterations": 30
}
Output (JSON):
json
{
  "status": "success",
  "iterations": 18,
  "initial_value": 34,
  "final_value": 0,
  "improvements": 12,
  "discards": 6,
  "elapsed_seconds": 1847
}
codex exec -f config.json --max-iterations 20 --timeout 3600

**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure

**Example config.json:**

```json
{
  "goal": "Reduce lint warnings to zero",
  "metric": "lint_warnings",
  "current_value": 34,
  "target_value": 0,
  "direction": "lower",
  "verify_cmd": "npm run lint -- --format json | jq '.length'",
  "guard_cmd": "npm test",
  "max_iterations": 30
}
Output (JSON):
json
{
  "status": "success",
  "iterations": 18,
  "initial_value": 34,
  "final_value": 0,
  "improvements": 12,
  "discards": 6,
  "elapsed_seconds": 1847
}

Advanced Features

Advanced Features

Parallel Experiments (Git Worktrees)

Parallel Experiments (Git Worktrees)

Test up to 3 hypotheses simultaneously:
text
You:   $codex-autoresearch
       Try multiple approaches to reduce latency

Codex: Parallel mode enabled (3 worktrees).
       Hypothesis A: Database query optimization
       Hypothesis B: Response caching
       Hypothesis C: Async I/O refactor
       
       Testing in parallel...
       
       Results:
       A: -23ms (keep)
       B: -67ms (keep)  ← Best
       C: +12ms (discard)
       
       Merging B, continuing from new baseline.
Test up to 3 hypotheses simultaneously:
text
You:   $codex-autoresearch
       Try multiple approaches to reduce latency

Codex: Parallel mode enabled (3 worktrees).
       Hypothesis A: Database query optimization
       Hypothesis B: Response caching
       Hypothesis C: Async I/O refactor
       
       Testing in parallel...
       
       Results:
       A: -23ms (keep)
       B: -67ms (keep)  ← Best
       C: +12ms (discard)
       
       Merging B, continuing from new baseline.

Cross-Run Learning

Cross-Run Learning

The
lessons.json
file accumulates knowledge:
json
{
  "successful_patterns": [
    {
      "goal": "reduce any types",
      "approach": "infer from JSON schema",
      "success_rate": 0.83,
      "avg_improvement": 4.2
    }
  ],
  "failed_patterns": [
    {
      "goal": "reduce any types",
      "approach": "generic type wrappers",
      "failure_rate": 0.91,
      "reason": "introduced more anys downstream"
    }
  ]
}
Future runs bias toward proven approaches and away from known failures.
The
lessons.json
file accumulates knowledge:
json
{
  "successful_patterns": [
    {
      "goal": "reduce any types",
      "approach": "infer from JSON schema",
      "success_rate": 0.83,
      "avg_improvement": 4.2
    }
  ],
  "failed_patterns": [
    {
      "goal": "reduce any types",
      "approach": "generic type wrappers",
      "failure_rate": 0.91,
      "reason": "introduced more anys downstream"
    }
  ]
}
Future runs bias toward proven approaches and away from known failures.

Session Hooks (Auto-Persistence)

Session Hooks (Auto-Persistence)

Hooks keep Codex on track across session boundaries:
bash
undefined
Hooks keep Codex on track across session boundaries:
bash
undefined

Auto-installed with skill

Auto-installed with skill

~/.config/codex/hooks/post-session.sh

On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned

**Manual hook setup (if needed):**

```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
      ~/.config/codex/hooks/post-session.sh
~/.config/codex/hooks/post-session.sh

On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned

**Manual hook setup (if needed):**

```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
      ~/.config/codex/hooks/post-session.sh

Real-World Examples

Real-World Examples

Python: Improve Type Coverage

Python: Improve Type Coverage

python
undefined
python
undefined

Before autoresearch

Before autoresearch

def process_data(data): # No types result = [] for item in data: result.append(item['value'] * 2) return result
def process_data(data): # No types result = [] for item in data: result.append(item['value'] * 2) return result

After 8 iterations

After 8 iterations

from typing import List, Dict, Any
def process_data(data: List[Dict[str, Any]]) -> List[float]: result: List[float] = [] for item in data: result.append(float(item['value']) * 2) return result

**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict mode
from typing import List, Dict, Any
def process_data(data: List[Dict[str, Any]]) -> List[float]: result: List[float] = [] for item in data: result.append(float(item['value']) * 2) return result

**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict mode

JavaScript: Reduce Bundle Size

JavaScript: Reduce Bundle Size

javascript
// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';

// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es';  // Tree-shakeable
import { formatDate } from 'date-fns/formatDate';  // Targeted import
import { parseJSON, validateEmail } from './utils';  // Explicit imports
Command:
text
$codex-autoresearch
Reduce production bundle size below 350kb
javascript
// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';

// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es';  // Tree-shakeable
import { formatDate } from 'date-fns/formatDate';  // Targeted import
import { parseJSON, validateEmail } from './utils';  // Explicit imports
Command:
text
$codex-autoresearch
Reduce production bundle size below 350kb

Rust: Eliminate Clippy Warnings

Rust: Eliminate Clippy Warnings

rust
// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
    let mut result = 0;
    for i in 0..x {
        result = result + y;  // clippy: use += instead
    }
    result
}

// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
    x * y  // Direct multiplication, clippy-clean
}
Command:
text
$codex-autoresearch
Eliminate all clippy warnings with default lints
rust
// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
    let mut result = 0;
    for i in 0..x {
        result = result + y;  // clippy: use += instead
    }
    result
}

// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
    x * y  // Direct multiplication, clippy-clean
}
Command:
text
$codex-autoresearch
Eliminate all clippy warnings with default lints

Troubleshooting

Troubleshooting

"Background run failed to start"

"Background run failed to start"

Cause: Codex not started with
--dangerously-bypass-approvals-and-sandbox
Fix:
bash
undefined
Cause: Codex not started with
--dangerously-bypass-approvals-and-sandbox
Fix:
bash
undefined

Restart Codex with Full Access

Restart Codex with Full Access

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox

Or use foreground mode instead.
codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox

Or use foreground mode instead.

"Verify command failed"

"Verify command failed"

Cause: The verify command isn't executable or returns unexpected format
Fix:
text
You:   Test the verify command manually.

Codex: Running: grep -r "any" src | wc -l
       Output: "      47"  ← Extra spaces

You:   Adjust to: grep -r "any" src | wc -l | xargs

Codex: Updated verify_cmd. Baseline: 47 (clean integer).
Cause: The verify command isn't executable or returns unexpected format
Fix:
text
You:   Test the verify command manually.

Codex: Running: grep -r "any" src | wc -l
       Output: "      47"  ← Extra spaces

You:   Adjust to: grep -r "any" src | wc -l | xargs

Codex: Updated verify_cmd. Baseline: 47 (clean integer).

"All experiments discarded, no progress"

"All experiments discarded, no progress"

Cause: Guard too strict (e.g., flaky tests) or goal unachievable
Check results log:
bash
grep "discard" autoresearch-results/results.tsv
Common reasons:
  • Guard fails even on unchanged code (flaky tests)
  • Metric can't be improved with current tooling
  • Scope too broad (try narrowing)
Fix:
text
You:   Run guard manually: npm test

Codex: Tests fail intermittently (test_cache_timeout).
       Recommendation: Fix flaky test or exclude from guard.

You:   Exclude that test from guard.

Codex: Updated guard: npm test -- --ignore test_cache_timeout
Cause: Guard too strict (e.g., flaky tests) or goal unachievable
Check results log:
bash
grep "discard" autoresearch-results/results.tsv
Common reasons:
  • Guard fails even on unchanged code (flaky tests)
  • Metric can't be improved with current tooling
  • Scope too broad (try narrowing)
Fix:
text
You:   Run guard manually: npm test

Codex: Tests fail intermittently (test_cache_timeout).
       Recommendation: Fix flaky test or exclude from guard.

You:   Exclude that test from guard.

Codex: Updated guard: npm test -- --ignore test_cache_timeout

"State file corrupted"

"State file corrupted"

Cause: Interrupted during JSON write
Fix:
bash
undefined
Cause: Interrupted during JSON write
Fix:
bash
undefined

Restore from git (if committed)

Restore from git (if committed)

git restore autoresearch-results/state.json
git restore autoresearch-results/state.json

Or start fresh (loses resume state, keeps logs)

Or start fresh (loses resume state, keeps logs)

rm autoresearch-results/state.json $codex-autoresearch
undefined
rm autoresearch-results/state.json $codex-autoresearch
undefined

"Infinite loop, no termination"

"Infinite loop, no termination"

Cause: No target value set, or metric unstable
Fix:
text
You:   Set max_iterations to 50 and stop.

Codex: Updated config: max_iterations = 50
       Will stop after 50 iterations regardless of target.
Or add explicit target:
text
You:   Target is 80% coverage, stop when reached.

Codex: Updated target_value: 80
       Will stop when coverage ≥80%.
Cause: No target value set, or metric unstable
Fix:
text
You:   Set max_iterations to 50 and stop.

Codex: Updated config: max_iterations = 50
       Will stop after 50 iterations regardless of target.
Or add explicit target:
text
You:   Target is 80% coverage, stop when reached.

Codex: Updated target_value: 80
       Will stop when coverage ≥80%.

Best Practices

Best Practices

1. Start with Small, Measurable Goals

1. Start with Small, Measurable Goals

❌ "Make the code better" ✅ "Reduce ESLint warnings from 42 to 0"
❌ "Make the code better" ✅ "Reduce ESLint warnings from 42 to 0"

2. Verify Your Verify Command First

2. Verify Your Verify Command First

bash
undefined
bash
undefined

Before starting autoresearch, confirm the metric works

Before starting autoresearch, confirm the metric works

pytest --cov --cov-report=term | grep TOTAL
pytest --cov --cov-report=term | grep TOTAL

Should output a parseable percentage

Should output a parseable percentage

undefined
undefined

3. Use Foreground for New Goals

3. Use Foreground for New Goals

Run foreground first to watch the loop and verify behavior. Switch to background once confident.
Run foreground first to watch the loop and verify behavior. Switch to background once confident.

4. Let Codex Infer, Then Adjust

4. Let Codex Infer, Then Adjust

Don't write config upfront. Let Codex propose, then refine:
text
Codex: Proposed verify: npm run test:coverage
You:   Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%
Don't write config upfront. Let Codex propose, then refine:
text
Codex: Proposed verify: npm run test:coverage
You:   Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%

5. Check Results Log After Each Run

5. Check Results Log After Each Run

bash
tail -20 autoresearch-results/results.tsv
Understand what worked and what didn't. This informs your next goal.
bash
tail -20 autoresearch-results/results.tsv
Understand what worked and what didn't. This informs your next goal.

6. Use Git Strategically

6. Use Git Strategically

Autoresearch commits every experiment. Your git log becomes the audit trail:
bash
git log --oneline --grep="autoresearch"
To squash experiments into clean commits after the run:
bash
git rebase -i HEAD~20  # Interactive rebase last 20 autoresearch commits
Autoresearch commits every experiment. Your git log becomes the audit trail:
bash
git log --oneline --grep="autoresearch"
To squash experiments into clean commits after the run:
bash
git rebase -i HEAD~20  # Interactive rebase last 20 autoresearch commits

Environment Variables

Environment Variables

Autoresearch respects standard tool configs via environment:
bash
undefined
Autoresearch respects standard tool configs via environment:
bash
undefined

Example: Use specific Python for pytest

Example: Use specific Python for pytest

export PYTHON=/usr/bin/python3.11 $codex-autoresearch
export PYTHON=/usr/bin/python3.11 $codex-autoresearch

Example: Increase test timeout

Example: Increase test timeout

export PYTEST_TIMEOUT=300 $codex-autoresearch

No secrets needed — autoresearch runs local tools, no external API calls.
export PYTEST_TIMEOUT=300 $codex-autoresearch

No secrets needed — autoresearch runs local tools, no external API calls.

Integration with Other Codex Skills

Integration with Other Codex Skills

Combine autoresearch with other skills:
text
You:   $code-review
       Review the autoresearch improvements from last night.

Codex: Reviewing 12 commits in autoresearch-results/sessions/...
       
       Summary:
       - 8 type improvements: Good, no regressions detected
       - 3 test additions: Coverage gaps filled correctly
       - 1 refactor: Extracted helper, maintains behavior
       
       Recommendation: Merge to main.
text
You:   $codex-autoresearch
       Optimize performance
       
       Then:
       $benchmark
       Compare before/after with flamegraphs
Combine autoresearch with other skills:
text
You:   $code-review
       Review the autoresearch improvements from last night.

Codex: Reviewing 12 commits in autoresearch-results/sessions/...
       
       Summary:
       - 8 type improvements: Good, no regressions detected
       - 3 test additions: Coverage gaps filled correctly
       - 1 refactor: Extracted helper, maintains behavior
       
       Recommendation: Merge to main.
text
You:   $codex-autoresearch
       Optimize performance
       
       Then:
       $benchmark
       Compare before/after with flamegraphs

Limitations

Limitations

  1. Requires git — All experiments are git-based (commit/revert cycle)
  2. Local tools only — Verify and guard must be executable commands in your environment
  3. No multi-repo (yet) — Operates within a single workspace root
  4. Deterministic metrics work best — Flaky metrics lead to false discards
  1. Requires git — All experiments are git-based (commit/revert cycle)
  2. Local tools only — Verify and guard must be executable commands in your environment
  3. No multi-repo (yet) — Operates within a single workspace root
  4. Deterministic metrics work best — Flaky metrics lead to false discards

Getting Help

Getting Help

If autoresearch behaves unexpectedly:
  1. Check the results log:
    cat autoresearch-results/results.tsv
  2. Review state file:
    cat autoresearch-results/state.json
  3. Run verify manually: Test your verify command outside autoresearch
  4. Ask Codex to explain:
    $codex-autoresearch
    then "explain the last 5 iterations"
If autoresearch behaves unexpectedly:
  1. Check the results log:
    cat autoresearch-results/results.tsv
  2. Review state file:
    cat autoresearch-results/state.json
  3. Run verify manually: Test your verify command outside autoresearch
  4. Ask Codex to explain:
    $codex-autoresearch
    then "explain the last 5 iterations"

Summary

Summary

Codex Autoresearch is a single-command autonomous improvement loop:
bash
$codex-autoresearch
You describe the goal in natural language. Codex infers the config, confirms with you, then iterates:
modify → commit → verify → keep or discard → repeat
Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.
Most common workflow:
text
$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvements
That's it. Autonomous iteration with human-in-the-loop goal-setting.
Codex Autoresearch is a single-command autonomous improvement loop:
bash
$codex-autoresearch
You describe the goal in natural language. Codex infers the config, confirms with you, then iterates:
modify → commit → verify → keep or discard → repeat
Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.
Most common workflow:
text
$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvements
That's it. Autonomous iteration with human-in-the-loop goal-setting.