meta-optimize

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Meta-Optimize: Outer-Loop Harness Optimization for ARIS

元优化：面向ARIS的外环工具框架优化

Analyze accumulated usage logs and propose optimizations for: $ARGUMENTS

分析积累的使用日志，针对以下内容提出优化方案：$ARGUMENTS

Context

背景

ARIS is a research harness — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype outer loop that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).

Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.

ARIS是一个研究工具框架——一套由技能、桥接器、工作流和工件契约组成的系统，围绕LLMs进行编排以实现研究工作的自动化。本技能实现了一个原型外环功能，用于观察工具框架的使用方式，并提出对框架本身的改进建议（而非对其生成的研究工件进行改进）。

灵感来源于Meta-Harness（Lee等人，2026）：核心见解是工具框架的设计与模型权重同样重要，通过记录执行轨迹并以此指导改进，可以部分实现工具框架工程的自动化。

What This Skill Optimizes (Harness Components)

本技能优化的对象（工具框架组件）

Component	Example	Optimizable?
SKILL.md prompts	Reviewer instructions, quality gates, step descriptions	Yes
Default parameters	`difficulty: medium` , `MAX_ROUNDS: 4` , `threshold: 6/10`	Yes
Convergence rules	When to stop the review loop, retry counts	Yes
Workflow ordering	Skill chain sequence within a workflow	Yes
Artifact schemas	What fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.md	Cautious
MCP bridge config	Which reviewer model, routing rules	No (infra)

Not optimized: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.

组件	示例	是否可优化
SKILL.md提示	评审员指令、质量门、步骤描述	是
默认参数	`difficulty: medium` , `MAX_ROUNDS: 4` , `threshold: 6/10`	是
收敛规则	何时停止评审循环、重试次数	是
工作流顺序	工作流内的技能链序列	是
工件 schema	EXPERIMENT_LOG.md、idea-stage/IDEA_REPORT.md中包含哪些字段	谨慎优化
MCP桥接器配置	使用哪个评审模型、路由规则	否（基础设施类）

不优化的内容：研究工件本身（论文、代码、实验）。这些内容由常规工作流处理。

Prerequisites

前置条件

Logging must be active. Copy
```
templates/claude-hooks/meta_logging.json
```
into your project's
```
.claude/settings.json
```
(or merge the hooks section).
Sufficient data. At least 5 complete workflow runs logged in
```
.aris/meta/events.jsonl
```
. The skill will check and warn if insufficient.

必须启用日志记录。将
```
templates/claude-hooks/meta_logging.json
```
复制到项目的
```
.claude/settings.json
```
中（或合并hooks部分）。
拥有足够的数据。
```
.aris/meta/events.jsonl
```
中至少记录了5次完整的工作流运行。如果数据不足，技能会发出警告。

Workflow

工作流

Step 0: Check Data Availability

步骤0：检查数据可用性

bash

EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
    echo "ERROR: No event log found at $EVENTS_FILE"
    echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
    exit 1
fi

EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)

echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"

if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
    echo "⚠️  Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
    exit 0
fi

bash

EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
    echo "ERROR: No event log found at $EVENTS_FILE"
    echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
    exit 1
fi

EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)

echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"

if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
    echo "⚠️  Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
    exit 0
fi

Step 1: Analyze Usage Patterns

步骤1：分析使用模式

Read

.aris/meta/events.jsonl

and compute:

Frequency analysis:

Which skills are invoked most often?
Which slash commands do users type most?
What parameter overrides are most common? (These suggest bad defaults.)

Failure analysis:

Which tools fail most often? In which skills?
What error patterns repeat? (OOM, import, compilation, timeout)
How many auto-debug retries per workflow run?

Convergence analysis (for auto-review-loop):

Average rounds to reach threshold
Score trajectory shape (fast improvement? plateau? oscillation?)
Which review round catches the most critical issues?
Do users override difficulty mid-run?

Human intervention analysis:

Where do users interrupt with manual prompts during workflows?
What manual corrections do users make most? (These indicate skill gaps.)

Present findings as a structured summary table.

读取

.aris/meta/events.jsonl

并计算：

频率分析：

哪些技能被调用的次数最多？
用户最常使用哪些斜杠命令？
最常见的参数覆盖是什么？（这些表明默认值不合理）

失败分析：

哪些工具失败次数最多？出现在哪些技能中？
重复出现的错误模式有哪些？（内存不足、导入错误、编译错误、超时）
每次工作流运行平均需要多少次自动调试重试？

收敛分析（针对自动评审循环）：

达到阈值所需的平均轮次
分数轨迹形态（快速提升？停滞？波动？）
哪一轮评审发现的关键问题最多？
用户是否会在运行中途修改难度参数？

人工干预分析：

用户在工作流运行过程中会在哪些环节插入手动提示？
用户最常进行哪些手动修正？（这些表明技能存在不足）

将分析结果整理为结构化的汇总表格。

Step 2: Identify Optimization Targets

步骤2：确定优化目标

Based on Step 1, rank optimization opportunities by expected impact:

markdown

undefined

基于步骤1的结果，按预期影响对优化机会进行排序：

markdown

undefined

Optimization Opportunities (ranked)

优化机会（按优先级排序）

#	Target	Signal	Proposed Change	Expected Impact
1	auto-review-loop default threshold	Users override to 7/10 in 60% of runs	Change default from 6/10 to 7/10	Fewer manual overrides
2	experiment-bridge retry count	40% of runs hit max retries on OOM	Add OOM-specific recovery (reduce batch size)	Fewer failed experiments
3	paper-write de-AI patterns	Users manually fix "delve" in 80% of runs	Add "delve" to default watchword list	Fewer manual edits


If `$ARGUMENTS` specifies a target skill, focus analysis on that skill only.
If `$ARGUMENTS` is empty or "all", analyze all skills with sufficient data.

#	目标	信号	建议修改	预期影响
1	自动评审循环默认阈值	60%的运行中用户将阈值覆盖为7/10	将默认值从6/10改为7/10	减少手动覆盖操作
2	实验桥接器重试次数	40%的运行因内存不足达到最大重试次数	添加针对内存不足的恢复机制（减小批处理大小）	减少实验失败次数
3	论文撰写去AI化模式	80%的运行中用户手动修正"delve"表述	将"delve"添加到默认关注词列表	减少手动编辑操作


如果`$ARGUMENTS`指定了目标技能，则仅针对该技能进行分析。
如果`$ARGUMENTS`为空或为"all"，则分析所有有足够数据的技能。

Step 3: Generate Patch Proposals

步骤3：生成补丁提案

For each optimization target, generate a concrete diff:

diff

--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
 ## Constants
 
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)

Rules for patch generation:

One patch per optimization target
Each patch must include a comment explaining WHY (with data from the log)
Patches must be minimal — change only what the data supports
Never change artifact schemas or MCP bridge config in v1
Never change behavior that would break existing user workflows

针对每个优化目标，生成具体的差异补丁：

diff

--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
 ## Constants
 
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)

补丁生成规则：

每个优化目标对应一个补丁
每个补丁必须包含注释，说明修改的原因（引用日志中的数据）
补丁必须尽可能精简——仅修改数据支持的内容
在v1版本中，绝不修改工件schema或MCP桥接器配置
绝不修改会破坏现有用户工作流的行为

Step 4: Cross-Model Review of Patches

步骤4：补丁的跨模型评审

Send each patch to GPT-5.4 xhigh for adversarial review:

mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are reviewing a proposed optimization to an ARIS SKILL.md file.
    
    ## Original Skill (relevant section)
    [paste original]
    
    ## Proposed Patch
    [paste diff]
    
    ## Evidence from Usage Log
    [paste summary stats]
    
    Review this patch:
    1. Does the evidence support the change?
    2. Could this change hurt other use cases?
    3. Is the change minimal and safe?
    4. Score 1-10: should this be applied?
    
    If score < 7, explain what additional evidence would be needed.

将每个补丁发送给GPT-5.4 xhigh进行对抗性评审：

mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are reviewing a proposed optimization to an ARIS SKILL.md file.
    
    ## Original Skill (relevant section)
    [paste original]
    
    ## Proposed Patch
    [paste diff]
    
    ## Evidence from Usage Log
    [paste summary stats]
    
    Review this patch:
    1. Does the evidence support the change?
    2. Could this change hurt other use cases?
    3. Is the change minimal and safe?
    4. Score 1-10: should this be applied?
    
    If score < 7, explain what additional evidence would be needed.

Step 5: Present Results

步骤5：呈现结果

Output a structured report:

markdown

undefined

输出结构化报告：

markdown

undefined

ARIS Meta-Optimization Report

ARIS元优化报告

Date: [today] Data: [N] events, [M] skill invocations, [K] sessions Target: [skill name or "all"]

日期: [今日] 数据: [N]条事件记录，[M]次技能调用，[K]次会话目标: [技能名称或"全部"]

Proposed Changes

建议修改

Change 1: [title]

修改1: [标题]

Target: [skill/file:line]
Signal: [what the data shows]
Patch: [diff]
Reviewer Score: [X/10]
Reviewer Notes: [summary]
Status: ✅ Recommended / ⚠️ Needs more data / ❌ Rejected

目标: [技能/文件:行号]
信号: [数据显示的内容]
补丁: [差异内容]
评审得分: [X/10]
评审备注: [摘要]
状态: ✅ 推荐 / ⚠️ 需要更多数据 / ❌ 拒绝

Change 2: ...

修改2: ...

Changes NOT Made (insufficient evidence)

未进行的修改（证据不足）

[pattern observed but too few samples]

[观察到模式但样本量太少]

Recommendations

建议

Apply Change 1 (reviewer approved)
Collect more data for Change 3 (need N more runs)
Consider manual review of Change 2

应用修改1（已通过评审）
为修改3收集更多数据（还需要N次运行）
考虑对修改2进行人工评审

Next Steps

下一步操作

Run

/meta-optimize apply 1

to apply a specific change, or

/meta-optimize apply all

to apply all recommended changes.

undefined

运行

/meta-optimize apply 1

应用特定修改，或运行

/meta-optimize apply all

应用所有推荐修改。

undefined

Step 6: Apply Changes (if user approves)

步骤6：应用修改（需用户批准）

If user runs

/meta-optimize apply [N]

Back up original SKILL.md to
```
.aris/meta/backups/
```
Apply the patch
Log the change to
```
.aris/meta/optimizations.jsonl
```
Remind user to test the changed skill on their next run

Never auto-apply without user approval.

如果用户运行

/meta-optimize apply [N]

：

将原始SKILL.md备份到
```
.aris/meta/backups/
```
应用补丁
将修改记录到
```
.aris/meta/optimizations.jsonl
```
提醒用户在下次运行时测试修改后的技能

未经用户批准，绝不自动应用修改。

Key Rules

核心规则

Log-driven, not speculative. Every proposed change must cite specific data from the event log. No "I think this would be better."
Minimal patches. Change one thing at a time. Don't rewrite entire skills.
Reviewer-gated. Every patch goes through cross-model review before recommendation.
Reversible. Always back up before applying. Always log what changed.
User-approved. Never auto-apply. Present, explain, let the user decide.
Honest about uncertainty. If the data is insufficient, say so. Don't optimize on noise.
Portable. Optimizations should improve the skill for all users, not just one user's style. If a change seems user-specific, flag it.

基于日志，而非推测。每个建议修改必须引用事件日志中的具体数据。禁止出现“我认为这样更好”的表述。
精简补丁。每次只修改一件事。不要重写整个技能。
评审把关。每个补丁在推荐前必须经过跨模型评审。
可回滚。应用修改前始终备份。始终记录修改内容。
用户批准。绝不自动应用。先呈现、解释，再让用户决定。
诚实面对不确定性。如果数据不足，如实说明。不要基于噪声进行优化。
可移植性。优化应提升技能对所有用户的适用性，而非仅适用于单个用户的使用风格。如果修改看起来具有用户特异性，需标记出来。

Event Schema Reference

事件Schema参考

The log at

.aris/meta/events.jsonl

contains JSONL records with these shapes:

jsonl

{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}

.aris/meta/events.jsonl

中的日志包含JSONL格式的记录，结构如下：

jsonl

{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}

Triggering

触发方式

This skill is NOT part of the standard W1→W1.5→W2→W3→W4 pipeline. It is a maintenance workflow with three trigger mechanisms:

Passive logging (always on): Claude Code hooks record events to
```
.aris/meta/events.jsonl
```
automatically during normal usage. Zero user effort.
Automatic readiness check (SessionEnd hook): When a Claude Code session ends,
```
check_ready.sh
```
counts skill invocations since the last
```
/meta-optimize
```
run. If ≥5 new invocations have accumulated, it prints a reminder:
```
📊 ARIS has logged 8 skill runs since last optimization. Run /meta-optimize to check for improvement opportunities.
```
This is a suggestion only — it does not auto-run optimization.
Manual trigger: User runs
```
/meta-optimize
```
when they see the reminder or whenever they want.

After each
/meta-optimize
run, the skill writes the current timestamp to

.aris/meta/.last_optimize

so the readiness check only counts new invocations.

本技能不属于标准的W1→W1.5→W2→W3→W4流水线。它是一个维护工作流，有三种触发机制：

被动日志记录（始终开启）：Claude Code钩子会在正常使用过程中自动将事件记录到
```
.aris/meta/events.jsonl
```
。无需用户操作。
自动就绪检查（SessionEnd钩子）：当Claude Code会话结束时，
```
check_ready.sh
```
会统计自上次
```
/meta-optimize
```
运行以来的技能调用次数。如果累计了≥5次新的调用，会打印提醒：
```
📊 ARIS自上次优化以来已记录8次技能运行。运行/meta-optimize以查看改进机会。
```
这只是一个建议——不会自动运行优化。
手动触发：用户看到提醒后或任何需要的时候，运行
```
/meta-optimize
```
。

每次运行
/meta-optimize
后，技能会将当前时间戳写入

.aris/meta/.last_optimize

，以便就绪检查仅统计新的调用次数。

Acknowledgements

致谢

Inspired by Meta-Harness (Lee et al., 2026) — end-to-end optimization of model harnesses via filesystem-based experience access and agentic code search.

灵感来源于Meta-Harness（Lee等人，2026）——通过基于文件系统的经验访问和智能代码搜索实现模型工具框架的端到端优化。

Output Protocols

输出协议

Follow these shared protocols for all output files:

Output Versioning Protocol — write timestamped file first, then copy to fixed name

Output Manifest Protocol — log every output to MANIFEST.md

Output Language Protocol — respect the project's language setting

所有输出文件需遵循以下共享协议：

输出版本控制协议 — 先写入带时间戳的文件，再复制到固定名称

输出清单协议 — 将所有输出记录到MANIFEST.md

输出语言协议 — 遵循项目的语言设置

Review Tracing

评审追踪

After each

mcp__codex__codex

mcp__codex__codex-reply

reviewer call, save the trace following

shared-references/review-tracing.md

. Use

tools/save_trace.sh

or write files directly to

.aris/traces/<skill>/<date>_run<NN>/

. Respect the

--- trace:

parameter (default:

full

每次调用

mcp__codex__codex

或

mcp__codex__codex-reply

评审员后，按照

shared-references/review-tracing.md

保存追踪记录。使用

tools/save_trace.sh

或直接将文件写入

.aris/traces/<skill>/<date>_run<NN>/

。遵循

--- trace:

参数（默认值：

full

）。