codex-autoresearch-skill

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Codex Autoresearch Skill

Skill by ara.so — Codex Skills collection.

由ara.so开发的Skill — Codex Skills合集。

What is Codex Autoresearch?

什么是Codex Autoresearch？

Codex Autoresearch is an autonomous goal-driven experimentation system for Codex that continuously cycles through: modify code → verify result → retain (if improved) or discard (if worse) → repeat indefinitely. Tell Codex what you want to improve, walk away, and come back to a log of experiments and a better codebase.

Key capabilities:

Autonomous iteration loops (foreground or background)
Git-based experiment tracking with automatic revert on failure
Dual-gate verification (did it improve? did anything break?)
Escalating retry strategies (REFINE → PIVOT → Web search → Stop)
Cross-run learning from past experiments
Multiple modes: loop, plan, debug, fix, security, ship, exec

Codex Autoresearch是面向Codex的自主目标驱动实验系统，可持续循环执行：修改代码 → 验证结果 → 保留（若改进）或丢弃（若恶化） → 无限重复。告诉Codex你想要改进的方向，无需值守，返回后即可查看实验日志和优化后的代码库。

核心功能：

自主迭代循环（前台或后台运行）
基于Git的实验跟踪，失败时自动回滚
双门验证（是否有改进？是否引入问题？）
递进式重试策略（REFINE → PIVOT → 网页搜索 → 停止）
从过往实验中跨运行学习
多种模式：loop、plan、debug、fix、security、ship、exec

Installation

安装

Recommended: Skill Installer

推荐方式：Skill Installer

bash

undefined

bash

undefined

In Codex

在Codex中执行

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch


Then restart Codex.

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch


然后重启Codex。

Manual Installation

手动安装

bash

undefined

bash

undefined

Clone to Codex skills directory

克隆到Codex技能目录

cd ~/.config/codex/skills git clone https://github.com/leo-lilinxiao/codex-autoresearch

Or create a symlink

或创建符号链接

ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch

undefined

ln -s /path/to/codex-autoresearch ~/.config/codex/skills/codex-autoresearch

undefined

Prerequisites for Full Functionality

全功能前置条件

For background runs and unattended execution:

bash

undefined

针对后台运行和无人值守执行：

bash

undefined

Start Codex with Goals, hooks, and Full Access enabled

启动Codex时启用Goals、hooks和完全访问权限

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox


**For foreground-only usage:** Standard Codex workspace permissions are sufficient.

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox


**仅前台使用：** 标准Codex工作区权限即可。

Core Command

核心命令

bash

$codex-autoresearch

This is the single entry point. Codex will infer the mode and configuration from your natural language goal.

bash

$codex-autoresearch

这是唯一的入口点。Codex会从你的自然语言目标中推断模式和配置。

Basic Usage Patterns

基础使用模式

1. Improve Test Coverage

1. 提升测试覆盖率

text

You:   $codex-autoresearch
       I want to improve test coverage to at least 80%

Codex: Current coverage: 62% (measured via pytest --cov)
       Target: ≥80%
       Verify: pytest --cov --cov-report=term
       Guard: pytest passes
       Results directory: ./autoresearch-results/
       Run mode: foreground or background?

You:   Foreground, go.

Codex: Starting iteration loop — baseline: 62%

What happens:

Codex identifies untested files/functions
Writes new tests or improves existing ones
Commits each change
Runs coverage check
Keeps if coverage increases, reverts if it decreases or tests fail
Repeats until target reached or interrupted

text

你：   $codex-autoresearch
       我想要将测试覆盖率提升至至少80%

Codex: 当前覆盖率：62%（通过pytest --cov测量）
       目标：≥80%
       验证命令：pytest --cov --cov-report=term
       守卫命令：pytest通过
       结果目录：./autoresearch-results/
       运行模式：前台还是后台？

你：   前台，开始。

Codex: 启动迭代循环 — 基准值：62%

执行流程：

Codex识别未测试的文件/函数
编写新测试或优化现有测试
提交每次变更
运行覆盖率检查
若覆盖率提升则保留，若下降或测试失败则回滚
重复直至达到目标或被中断

2. Eliminate TypeScript

any

Types

2. 消除TypeScript的

any

类型

text

You:   $codex-autoresearch
       Get rid of all the any types in my TypeScript code

Codex: Found 47 `any` occurrences in src/**/*.ts
       Metric: `any` count (current: 47), direction: lower
       Verify: grep -r "any" src | wc -l
       Guard: npm run type-check (tsc --noEmit)
       Run mode?

You:   Background, run overnight.

Codex: Starting background run — baseline: 47. Iterating.

Background mode:

Runs in detached process
Logs to
```
autoresearch-results/results.tsv
```
Can be stopped/resumed

Check progress:

tail -f autoresearch-results/results.tsv

text

你：   $codex-autoresearch
       移除我TypeScript代码中的所有any类型

Codex: 在src/**/*.ts中发现47处`any`
       指标：`any`数量（当前：47），方向：减少
       验证命令：grep -r "any" src | wc -l
       守卫命令：npm run type-check (tsc --noEmit)
       运行模式？

你：   后台，通宵运行。

Codex: 启动后台运行 — 基准值：47。开始迭代。

后台模式：

在独立进程中运行
日志输出到
```
autoresearch-results/results.tsv
```
可停止/恢复

查看进度：

tail -f autoresearch-results/results.tsv

3. Fix Failing Tests

3. 修复失败的测试

text

You:   $codex-autoresearch
       Fix the 12 failing tests

Codex: Detected 12 test failures via pytest
       Strategy: Repair one failure at a time
       Verify: pytest (exit code 0 = success)
       Mode?

You:   Foreground.

Codex: Iteration 1: test_auth_token_expiry
       Root cause: Clock skew in mock time...

text

你：   $codex-autoresearch
       修复12个失败的测试

Codex: 通过pytest检测到12个测试失败
       策略：逐个修复失败项
       验证命令：pytest（退出码0=成功）
       模式？

你：   前台。

Codex: 迭代1：test_auth_token_expiry
       根本原因：模拟时间中的时钟偏差...

4. Optimize Performance

4. 优化性能

text

You:   $codex-autoresearch
       Reduce API latency below 100ms

Codex: Current p95 latency: 247ms (from load test logs)
       Target: <100ms
       Verify: python benchmark.py --p95
       Guard: pytest integration tests
       Mode?

You:   Foreground.

text

你：   $codex-autoresearch
       将API延迟降低至100ms以下

Codex: 当前p95延迟：247ms（来自负载测试日志）
       目标：<100ms
       验证命令：python benchmark.py --p95
       守卫命令：pytest集成测试
       模式？

你：   前台。

5. Security Audit

5. 安全审计

text

You:   $codex-autoresearch
       Is this code secure?

Codex: Mode: security audit
       Framework: STRIDE + OWASP Top 10
       Scope: src/**/*.py (detected Flask app)
       Findings will be logged with code evidence.
       Run mode?

You:   Foreground.

Codex: Finding 1/7: SQL Injection risk in user_search()
       Evidence: Line 42, raw string interpolation...

text

你：   $codex-autoresearch
       这段代码安全吗？

Codex: 模式：security audit
       框架：STRIDE + OWASP Top 10
       范围：src/**/*.py（检测到Flask应用）
       发现结果将附带代码证据记录。
       运行模式？

你：   前台。

Codex: 发现1/7：user_search()存在SQL注入风险
       证据：第42行，原始字符串插值...

Modes Reference

模式参考

Codex infers the mode from your natural language, but understanding them helps you craft better prompts.

Mode	Trigger Patterns	What It Does
loop	"improve X", "reduce Y", "optimize Z"	Iterative improvement until target or interrupt
plan	"analyze", "what should I improve?", "suggest metrics"	Scans repo, proposes goals and metrics
debug	"why is X happening?", "diagnose", "root cause"	Hypothesis-driven debugging with falsifiable tests
fix	"fix the N failing tests", "repair", "make tests pass"	Sequential repair of known failures
security	"is this secure?", "audit", "STRIDE", "OWASP"	Security analysis with structured findings
ship	"ship it", "ready to release?", "pre-deploy check"	Release readiness verification
exec	(CI/CD usage, see below)	Non-interactive automation mode

Codex会从你的自然语言中推断模式，但了解这些模式有助于你编写更精准的提示词。

模式	触发句式	功能
loop	"改进X"、"减少Y"、"优化Z"	迭代改进直至达到目标或被中断
plan	"分析"、"我应该改进什么？"、"建议指标"	扫描仓库，提出目标和指标建议
debug	"为什么会发生X？"、"诊断"、"根本原因"	基于假设的调试，附带可证伪测试
fix	"修复N个失败的测试"、"修复"、"让测试通过"	按顺序修复已知失败项
security	"这段代码安全吗？"、"审计"、"STRIDE"、"OWASP"	结构化的安全分析并输出发现结果
ship	"发布"、"准备好发布了吗？"、"部署前检查"	发布就绪性验证
exec	（CI/CD使用，见下文）	非交互式自动化模式

Configuration

配置

You typically don't need to write config. Codex infers from your repo and goal.

But if you want explicit control, Codex will show you the inferred config and let you adjust before starting:

text

Codex: Inferred configuration:
       goal: "eliminate any types"
       scope: "src/**/*.ts"
       metric: any_count
       current_value: 47
       direction: lower
       verify_cmd: "grep -r 'any' src | wc -l"
       guard_cmd: "npm run type-check"
       
       Adjust anything?

You:   Change scope to include test files too.

Codex: Updated scope: "{src,test}/**/*.ts"
       Current value: 63 (including tests)
       Proceed?

通常无需手动编写配置。Codex会从你的仓库和目标中自动推断。

但如果你需要显式控制，Codex会展示推断出的配置，并允许你在启动前调整：

text

Codex: 推断的配置：
       目标："消除any类型"
       范围："src/**/*.ts"
       指标：any_count
       当前值：47
       方向：减少
       验证命令："grep -r 'any' src | wc -l"
       守卫命令："npm run type-check"
       
       是否需要调整？

You:   Change scope to include test files too.

Codex: Updated scope: "{src,test}/**/*.ts"
       Current value: 63 (including tests)
       Proceed?

Explicit Config (Advanced)

显式配置（高级）

For CI/CD or scripted use, you can provide a JSON config:

json

{
  "goal": "Reduce bundle size",
  "metric": "bundle_kb",
  "current_value": 487,
  "target_value": 300,
  "direction": "lower",
  "verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
  "guard_cmd": "npm test",
  "scope": "src/**/*.{ts,tsx}",
  "max_iterations": 50
}

bash

undefined

对于CI/CD或脚本化使用，你可以提供JSON配置：

json

{
  "goal": "Reduce bundle size",
  "metric": "bundle_kb",
  "current_value": 487,
  "target_value": 300,
  "direction": "lower",
  "verify_cmd": "npm run build && du -k dist/bundle.js | cut -f1",
  "guard_cmd": "npm test",
  "scope": "src/**/*.{ts,tsx}",
  "max_iterations": 50
}

bash

undefined

Save to file, then:

codex exec -f autoresearch_config.json

undefined

codex exec -f autoresearch_config.json

undefined

Results and State Files

结果和状态文件

All runs create an

autoresearch-results/

directory in your workspace root:

autoresearch-results/
├── results.tsv          # Full experiment log (audit trail)
├── state.json           # Resume state (last consistent checkpoint)
├── lessons.json         # Cross-run learning (what worked/failed)
└── sessions/
    └── 2026-05-16_14-23-01/
        ├── experiment_1_keep.diff
        ├── experiment_2_discard.diff
        └── ...

All runs create an

autoresearch-results/

directory in your workspace root:

autoresearch-results/
├── results.tsv          # Full experiment log (audit trail)
├── state.json           # Resume state (last consistent checkpoint)
├── lessons.json         # Cross-run learning (what worked/failed)
└── sessions/
    └── 2026-05-16_14-23-01/
        ├── experiment_1_keep.diff
        ├── experiment_2_discard.diff
        └── ...

Reading the Results Log

bash

undefined

bash

undefined

View all experiments

cat autoresearch-results/results.tsv

Watch live (during background run)

tail -f autoresearch-results/results.tsv

Filter successful improvements

grep "keep" autoresearch-results/results.tsv


**Example log:**

iteration commit metric delta status description 0 a1b2c3d 47 0 baseline initial any count 1 b2c3d4e 41 -6 keep replace any in auth module 2 - 49 +8 discard generic wrapper introduced new anys 3 d4e5f6g 38 -3 keep type-narrow API response handlers 4 e5f6g7h 38 0 discard refactor had no effect 5 f6g7h8i 35 -3 keep infer types from JSON schema

undefined

grep "keep" autoresearch-results/results.tsv


**Example log:**

undefined

Escalation Strategy (When Stuck)

The loop doesn't blindly retry. It escalates:

Trigger	Action
3 consecutive failures	REFINE — Adjust within current strategy
5 consecutive failures	PIVOT — Try fundamentally different approach
2 PIVOTs without progress	Web search — Look for external solutions
3 PIVOTs without progress	STOP — Report that human input needed

One success resets all counters.

Example:

text

Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time

Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec

Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
  Found: Use strictNullChecks + noImplicitAny...

Iteration 25: keep
→ Counters reset. Back to normal iteration.

The loop doesn't blindly retry. It escalates:

Trigger	Action
3 consecutive failures	REFINE — Adjust within current strategy
5 consecutive failures	PIVOT — Try fundamentally different approach
2 PIVOTs without progress	Web search — Look for external solutions
3 PIVOTs without progress	STOP — Report that human input needed

One success resets all counters.

Example:

text

Iteration 12: discard (3rd consecutive failure)
→ REFINE: Try smaller type changes, one file at a time

Iteration 17: discard (5th consecutive failure)
→ PIVOT: Switch from manual typing to codegen from OpenAPI spec

Iteration 23: discard (2nd PIVOT without progress)
→ Web search: "TypeScript eliminate any types best practices"
  Found: Use strictNullChecks + noImplicitAny...

Iteration 25: keep
→ Counters reset. Back to normal iteration.

Stopping and Resuming

Stop a Foreground Run

text

You:   (interrupt Codex)

Codex: Autoresearch interrupted.
       State saved to autoresearch-results/state.json
       Resume with: $codex-autoresearch

text

You:   (interrupt Codex)

Codex: Autoresearch interrupted.
       State saved to autoresearch-results/state.json
       Resume with: $codex-autoresearch

Stop a Background Run

text

You:   $codex-autoresearch

Codex: Background run detected (PID 12345).
       Status: 15 iterations, metric improved 47→35
       
You:   Stop it.

Codex: Stopping background run...
       Final state saved. Resume anytime with $codex-autoresearch.

text

You:   $codex-autoresearch

Codex: Background run detected (PID 12345).
       Status: 15 iterations, metric improved 47→35
       
You:   Stop it.

Codex: Stopping background run...
       Final state saved. Resume anytime with $codex-autoresearch.

Resume

bash

$codex-autoresearch

Codex automatically detects

state.json

and offers to resume:

text

Codex: Found interrupted session:
       Goal: eliminate any types
       Progress: 47→35 (12 iterations)
       Last commit: f6g7h8i
       Resume from iteration 13?

You:   Yes, continue.

bash

$codex-autoresearch

Codex automatically detects

state.json

and offers to resume:

text

Codex: Found interrupted session:
       Goal: eliminate any types
       Progress: 47→35 (12 iterations)
       Last commit: f6g7h8i
       Resume from iteration 13?

You:   Yes, continue.

Dual-Gate Verification

Every iteration runs two checks:

Verify — Did the metric improve?
Guard — Did anything break?

python

undefined

Every iteration runs two checks:

Verify — Did the metric improve?
Guard — Did anything break?

python

undefined

Pseudocode of each iteration

git checkout -b experiment_N modify_code() git commit -m "experiment N: {hypothesis}"

verify_result = run(verify_cmd) guard_result = run(guard_cmd)

if verify_result.improved and guard_result.passed: git merge experiment_N log("keep") else: git reset --hard HEAD~1 log("discard")


**Example:**

- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)

A change that increases coverage but breaks tests is **discarded**.

git checkout -b experiment_N modify_code() git commit -m "experiment N: {hypothesis}"

verify_result = run(verify_cmd) guard_result = run(guard_cmd)

if verify_result.improved and guard_result.passed: git merge experiment_N log("keep") else: git reset --hard HEAD~1 log("discard")


**Example:**

- **Verify:** `pytest --cov` (did coverage increase?)
- **Guard:** `pytest` (did all tests still pass?)

A change that increases coverage but breaks tests is **discarded**.

CI/CD Mode (exec)

For automation pipelines, use

exec

mode:

bash

undefined

For automation pipelines, use

exec

mode:

bash

undefined

Non-interactive, JSON output, exit codes

codex exec -f config.json --max-iterations 20 --timeout 3600


**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure

**Example config.json:**

```json
{
  "goal": "Reduce lint warnings to zero",
  "metric": "lint_warnings",
  "current_value": 34,
  "target_value": 0,
  "direction": "lower",
  "verify_cmd": "npm run lint -- --format json | jq '.length'",
  "guard_cmd": "npm test",
  "max_iterations": 30
}

Output (JSON):

json

{
  "status": "success",
  "iterations": 18,
  "initial_value": 34,
  "final_value": 0,
  "improvements": 12,
  "discards": 6,
  "elapsed_seconds": 1847
}

codex exec -f config.json --max-iterations 20 --timeout 3600


**Exit codes:**
- `0` — Target reached
- `1` — Max iterations reached without target
- `2` — Error or guard failure

**Example config.json:**

```json
{
  "goal": "Reduce lint warnings to zero",
  "metric": "lint_warnings",
  "current_value": 34,
  "target_value": 0,
  "direction": "lower",
  "verify_cmd": "npm run lint -- --format json | jq '.length'",
  "guard_cmd": "npm test",
  "max_iterations": 30
}

Output (JSON):

json

{
  "status": "success",
  "iterations": 18,
  "initial_value": 34,
  "final_value": 0,
  "improvements": 12,
  "discards": 6,
  "elapsed_seconds": 1847
}

Advanced Features

Parallel Experiments (Git Worktrees)

Test up to 3 hypotheses simultaneously:

text

You:   $codex-autoresearch
       Try multiple approaches to reduce latency

Codex: Parallel mode enabled (3 worktrees).
       Hypothesis A: Database query optimization
       Hypothesis B: Response caching
       Hypothesis C: Async I/O refactor
       
       Testing in parallel...
       
       Results:
       A: -23ms (keep)
       B: -67ms (keep)  ← Best
       C: +12ms (discard)
       
       Merging B, continuing from new baseline.

Test up to 3 hypotheses simultaneously:

text

You:   $codex-autoresearch
       Try multiple approaches to reduce latency

Codex: Parallel mode enabled (3 worktrees).
       Hypothesis A: Database query optimization
       Hypothesis B: Response caching
       Hypothesis C: Async I/O refactor
       
       Testing in parallel...
       
       Results:
       A: -23ms (keep)
       B: -67ms (keep)  ← Best
       C: +12ms (discard)
       
       Merging B, continuing from new baseline.

Cross-Run Learning

The

lessons.json

file accumulates knowledge:

json

{
  "successful_patterns": [
    {
      "goal": "reduce any types",
      "approach": "infer from JSON schema",
      "success_rate": 0.83,
      "avg_improvement": 4.2
    }
  ],
  "failed_patterns": [
    {
      "goal": "reduce any types",
      "approach": "generic type wrappers",
      "failure_rate": 0.91,
      "reason": "introduced more anys downstream"
    }
  ]
}

Future runs bias toward proven approaches and away from known failures.

The

lessons.json

file accumulates knowledge:

json

{
  "successful_patterns": [
    {
      "goal": "reduce any types",
      "approach": "infer from JSON schema",
      "success_rate": 0.83,
      "avg_improvement": 4.2
    }
  ],
  "failed_patterns": [
    {
      "goal": "reduce any types",
      "approach": "generic type wrappers",
      "failure_rate": 0.91,
      "reason": "introduced more anys downstream"
    }
  ]
}

Future runs bias toward proven approaches and away from known failures.

Session Hooks (Auto-Persistence)

Hooks keep Codex on track across session boundaries:

bash

undefined

Hooks keep Codex on track across session boundaries:

bash

undefined

Auto-installed with skill

~/.config/codex/hooks/post-session.sh


On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned

**Manual hook setup (if needed):**

```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
      ~/.config/codex/hooks/post-session.sh

~/.config/codex/hooks/post-session.sh


On every Codex session end:
1. Saves current state to `state.json`
2. Commits results log to git (if repo)
3. Backs up lessons learned

**Manual hook setup (if needed):**

```bash
chmod +x ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh
ln -s ~/.config/codex/skills/codex-autoresearch/hooks/post-session.sh \
      ~/.config/codex/hooks/post-session.sh

Real-World Examples

Python: Improve Type Coverage

python

undefined

python

undefined

Before autoresearch

def process_data(data): # No types result = [] for item in data: result.append(item['value'] * 2) return result

After 8 iterations

from typing import List, Dict, Any

def process_data(data: List[Dict[str, Any]]) -> List[float]: result: List[float] = [] for item in data: result.append(float(item['value']) * 2) return result


**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict mode

from typing import List, Dict, Any

def process_data(data: List[Dict[str, Any]]) -> List[float]: result: List[float] = [] for item in data: result.append(float(item['value']) * 2) return result


**Command:**
```text
$codex-autoresearch
Improve type hints coverage to 90% with mypy strict mode

JavaScript: Reduce Bundle Size

javascript

// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';

// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es';  // Tree-shakeable
import { formatDate } from 'date-fns/formatDate';  // Targeted import
import { parseJSON, validateEmail } from './utils';  // Explicit imports

Command:

text

$codex-autoresearch
Reduce production bundle size below 350kb

javascript

// Before (487kb)
import _ from 'lodash';
import moment from 'moment';
import * as utils from './utils';

// After 12 iterations (312kb)
import { debounce, throttle } from 'lodash-es';  // Tree-shakeable
import { formatDate } from 'date-fns/formatDate';  // Targeted import
import { parseJSON, validateEmail } from './utils';  // Explicit imports

Command:

text

$codex-autoresearch
Reduce production bundle size below 350kb

Rust: Eliminate Clippy Warnings

rust

// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
    let mut result = 0;
    for i in 0..x {
        result = result + y;  // clippy: use += instead
    }
    result
}

// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
    x * y  // Direct multiplication, clippy-clean
}

Command:

text

$codex-autoresearch
Eliminate all clippy warnings with default lints

rust

// Before (23 clippy warnings)
fn calculate(x: i32, y: i32) -> i32 {
    let mut result = 0;
    for i in 0..x {
        result = result + y;  // clippy: use += instead
    }
    result
}

// After 5 iterations (0 warnings)
fn calculate(x: i32, y: i32) -> i32 {
    x * y  // Direct multiplication, clippy-clean
}

Command:

text

$codex-autoresearch
Eliminate all clippy warnings with default lints

Troubleshooting

"Background run failed to start"

Cause: Codex not started with

--dangerously-bypass-approvals-and-sandbox

Fix:

bash

undefined

Cause: Codex not started with

--dangerously-bypass-approvals-and-sandbox

Fix:

bash

undefined

Restart Codex with Full Access

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox


Or use foreground mode instead.

codex --enable goals --enable hooks --dangerously-bypass-approvals-and-sandbox


Or use foreground mode instead.

"Verify command failed"

Cause: The verify command isn't executable or returns unexpected format

Fix:

text

You:   Test the verify command manually.

Codex: Running: grep -r "any" src | wc -l
       Output: "      47"  ← Extra spaces

You:   Adjust to: grep -r "any" src | wc -l | xargs

Codex: Updated verify_cmd. Baseline: 47 (clean integer).

Cause: The verify command isn't executable or returns unexpected format

Fix:

text

You:   Test the verify command manually.

Codex: Running: grep -r "any" src | wc -l
       Output: "      47"  ← Extra spaces

You:   Adjust to: grep -r "any" src | wc -l | xargs

Codex: Updated verify_cmd. Baseline: 47 (clean integer).

"All experiments discarded, no progress"

Cause: Guard too strict (e.g., flaky tests) or goal unachievable

Check results log:

bash

grep "discard" autoresearch-results/results.tsv

Common reasons:

Guard fails even on unchanged code (flaky tests)
Metric can't be improved with current tooling
Scope too broad (try narrowing)

Fix:

text

You:   Run guard manually: npm test

Codex: Tests fail intermittently (test_cache_timeout).
       Recommendation: Fix flaky test or exclude from guard.

You:   Exclude that test from guard.

Codex: Updated guard: npm test -- --ignore test_cache_timeout

Cause: Guard too strict (e.g., flaky tests) or goal unachievable

Check results log:

bash

grep "discard" autoresearch-results/results.tsv

Common reasons:

Guard fails even on unchanged code (flaky tests)
Metric can't be improved with current tooling
Scope too broad (try narrowing)

Fix:

text

You:   Run guard manually: npm test

Codex: Tests fail intermittently (test_cache_timeout).
       Recommendation: Fix flaky test or exclude from guard.

You:   Exclude that test from guard.

Codex: Updated guard: npm test -- --ignore test_cache_timeout

"State file corrupted"

Cause: Interrupted during JSON write

Fix:

bash

undefined

Cause: Interrupted during JSON write

Fix:

bash

undefined

Restore from git (if committed)

git restore autoresearch-results/state.json

Or start fresh (loses resume state, keeps logs)

rm autoresearch-results/state.json $codex-autoresearch

undefined

rm autoresearch-results/state.json $codex-autoresearch

undefined

"Infinite loop, no termination"

Cause: No target value set, or metric unstable

Fix:

text

You:   Set max_iterations to 50 and stop.

Codex: Updated config: max_iterations = 50
       Will stop after 50 iterations regardless of target.

Or add explicit target:

text

You:   Target is 80% coverage, stop when reached.

Codex: Updated target_value: 80
       Will stop when coverage ≥80%.

Cause: No target value set, or metric unstable

Fix:

text

You:   Set max_iterations to 50 and stop.

Codex: Updated config: max_iterations = 50
       Will stop after 50 iterations regardless of target.

Or add explicit target:

text

You:   Target is 80% coverage, stop when reached.

Codex: Updated target_value: 80
       Will stop when coverage ≥80%.

Best Practices

1. Start with Small, Measurable Goals

❌ "Make the code better" ✅ "Reduce ESLint warnings from 42 to 0"

2. Verify Your Verify Command First

bash

undefined

bash

undefined

Before starting autoresearch, confirm the metric works

pytest --cov --cov-report=term | grep TOTAL

Should output a parseable percentage

undefined

undefined

3. Use Foreground for New Goals

Run foreground first to watch the loop and verify behavior. Switch to background once confident.

4. Let Codex Infer, Then Adjust

Don't write config upfront. Let Codex propose, then refine:

text

Codex: Proposed verify: npm run test:coverage
You:   Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%

Don't write config upfront. Let Codex propose, then refine:

text

Codex: Proposed verify: npm run test:coverage
You:   Change to: npm run test:coverage -- --json
Codex: Updated. Baseline: 62%

5. Check Results Log After Each Run

bash

tail -20 autoresearch-results/results.tsv

Understand what worked and what didn't. This informs your next goal.

bash

tail -20 autoresearch-results/results.tsv

Understand what worked and what didn't. This informs your next goal.

6. Use Git Strategically

Autoresearch commits every experiment. Your git log becomes the audit trail:

bash

git log --oneline --grep="autoresearch"

To squash experiments into clean commits after the run:

bash

git rebase -i HEAD~20  # Interactive rebase last 20 autoresearch commits

Autoresearch commits every experiment. Your git log becomes the audit trail:

bash

git log --oneline --grep="autoresearch"

To squash experiments into clean commits after the run:

bash

git rebase -i HEAD~20  # Interactive rebase last 20 autoresearch commits

Environment Variables

Autoresearch respects standard tool configs via environment:

bash

undefined

Autoresearch respects standard tool configs via environment:

bash

undefined

Example: Use specific Python for pytest

export PYTHON=/usr/bin/python3.11 $codex-autoresearch

Example: Increase test timeout

export PYTEST_TIMEOUT=300 $codex-autoresearch


No secrets needed — autoresearch runs local tools, no external API calls.

export PYTEST_TIMEOUT=300 $codex-autoresearch


No secrets needed — autoresearch runs local tools, no external API calls.

Integration with Other Codex Skills

Combine autoresearch with other skills:

text

You:   $code-review
       Review the autoresearch improvements from last night.

Codex: Reviewing 12 commits in autoresearch-results/sessions/...
       
       Summary:
       - 8 type improvements: Good, no regressions detected
       - 3 test additions: Coverage gaps filled correctly
       - 1 refactor: Extracted helper, maintains behavior
       
       Recommendation: Merge to main.

text

You:   $codex-autoresearch
       Optimize performance
       
       Then:
       $benchmark
       Compare before/after with flamegraphs

Combine autoresearch with other skills:

text

You:   $code-review
       Review the autoresearch improvements from last night.

Codex: Reviewing 12 commits in autoresearch-results/sessions/...
       
       Summary:
       - 8 type improvements: Good, no regressions detected
       - 3 test additions: Coverage gaps filled correctly
       - 1 refactor: Extracted helper, maintains behavior
       
       Recommendation: Merge to main.

text

You:   $codex-autoresearch
       Optimize performance
       
       Then:
       $benchmark
       Compare before/after with flamegraphs

Limitations

Requires git — All experiments are git-based (commit/revert cycle)
Local tools only — Verify and guard must be executable commands in your environment
No multi-repo (yet) — Operates within a single workspace root
Deterministic metrics work best — Flaky metrics lead to false discards

Requires git — All experiments are git-based (commit/revert cycle)
Local tools only — Verify and guard must be executable commands in your environment
No multi-repo (yet) — Operates within a single workspace root
Deterministic metrics work best — Flaky metrics lead to false discards

Getting Help

If autoresearch behaves unexpectedly:

Check the results log:
```
cat autoresearch-results/results.tsv
```
Review state file:
```
cat autoresearch-results/state.json
```
Run verify manually: Test your verify command outside autoresearch
Ask Codex to explain:
```
$codex-autoresearch
```
then "explain the last 5 iterations"

If autoresearch behaves unexpectedly:

Check the results log:
```
cat autoresearch-results/results.tsv
```
Review state file:
```
cat autoresearch-results/state.json
```
Run verify manually: Test your verify command outside autoresearch
Ask Codex to explain:
```
$codex-autoresearch
```
then "explain the last 5 iterations"

Summary

Codex Autoresearch is a single-command autonomous improvement loop:

bash

$codex-autoresearch

You describe the goal in natural language. Codex infers the config, confirms with you, then iterates:

modify → commit → verify → keep or discard → repeat

Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.

Most common workflow:

text

$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvements

That's it. Autonomous iteration with human-in-the-loop goal-setting.

Codex Autoresearch is a single-command autonomous improvement loop:

bash

$codex-autoresearch

You describe the goal in natural language. Codex infers the config, confirms with you, then iterates:

modify → commit → verify → keep or discard → repeat

Foreground for interactive runs. Background for overnight. Results logged, state resumable, lessons learned across runs.

Most common workflow:

text

$codex-autoresearch
I want to [measurable goal]
→ Codex proposes config
→ You confirm or adjust
→ Choose foreground/background
→ Walk away or watch
→ Review results.tsv
→ Merge improvements

That's it. Autonomous iteration with human-in-the-loop goal-setting.

codex-autoresearch-skill

Original

Translation

Codex Autoresearch Skill

Codex Autoresearch Skill

What is Codex Autoresearch?

什么是Codex Autoresearch？

Installation

安装

Recommended: Skill Installer

推荐方式：Skill Installer

In Codex

在Codex中执行

Manual Installation

手动安装

Clone to Codex skills directory

克隆到Codex技能目录

Or create a symlink

或创建符号链接

Prerequisites for Full Functionality

全功能前置条件

Start Codex with Goals, hooks, and Full Access enabled

启动Codex时启用Goals、hooks和完全访问权限

Core Command

核心命令

Basic Usage Patterns

基础使用模式

1. Improve Test Coverage

1. 提升测试覆盖率

2. Eliminate TypeScript any Types

2. 消除TypeScript的any类型

3. Fix Failing Tests

3. 修复失败的测试

4. Optimize Performance

4. 优化性能

5. Security Audit

5. 安全审计

Modes Reference

模式参考

Configuration

配置

Explicit Config (Advanced)

显式配置（高级）

Save to file, then:

Save to file, then:

Results and State Files

结果和状态文件

Reading the Results Log

Reading the Results Log

View all experiments

View all experiments

Watch live (during background run)

Watch live (during background run)

Filter successful improvements

Filter successful improvements

Escalation Strategy (When Stuck)

Escalation Strategy (When Stuck)

Stopping and Resuming

Stopping and Resuming

Stop a Foreground Run

Stop a Foreground Run

Stop a Background Run

Stop a Background Run

Resume

Resume

Dual-Gate Verification

Dual-Gate Verification

Pseudocode of each iteration

Pseudocode of each iteration

CI/CD Mode (exec)

CI/CD Mode (exec)

Non-interactive, JSON output, exit codes

Non-interactive, JSON output, exit codes

Advanced Features

Advanced Features

Parallel Experiments (Git Worktrees)

Parallel Experiments (Git Worktrees)

Cross-Run Learning

Cross-Run Learning

Session Hooks (Auto-Persistence)

2. Eliminate TypeScript
`any`
Types

2. 消除TypeScript的
`any`
类型