meta-optimize

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Meta-Optimize: Outer-Loop Harness Optimization for ARIS

元优化:面向ARIS的外环工具框架优化

Analyze accumulated usage logs and propose optimizations for: $ARGUMENTS
分析积累的使用日志,针对以下内容提出优化方案:$ARGUMENTS

Context

背景

ARIS is a research harness — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype outer loop that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).
Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.
ARIS是一个研究工具框架——一套由技能、桥接器、工作流和工件契约组成的系统,围绕LLMs进行编排以实现研究工作的自动化。本技能实现了一个原型外环功能,用于观察工具框架的使用方式,并提出对框架本身的改进建议(而非对其生成的研究工件进行改进)。
灵感来源于Meta-Harness(Lee等人,2026):核心见解是工具框架的设计与模型权重同样重要,通过记录执行轨迹并以此指导改进,可以部分实现工具框架工程的自动化。

What This Skill Optimizes (Harness Components)

本技能优化的对象(工具框架组件)

ComponentExampleOptimizable?
SKILL.md promptsReviewer instructions, quality gates, step descriptionsYes
Default parameters
difficulty: medium
,
MAX_ROUNDS: 4
,
threshold: 6/10
Yes
Convergence rulesWhen to stop the review loop, retry countsYes
Workflow orderingSkill chain sequence within a workflowYes
Artifact schemasWhat fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.mdCautious
MCP bridge configWhich reviewer model, routing rulesNo (infra)
Not optimized: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.
组件示例是否可优化
SKILL.md提示评审员指令、质量门、步骤描述
默认参数
difficulty: medium
,
MAX_ROUNDS: 4
,
threshold: 6/10
收敛规则何时停止评审循环、重试次数
工作流顺序工作流内的技能链序列
工件 schemaEXPERIMENT_LOG.md、idea-stage/IDEA_REPORT.md中包含哪些字段谨慎优化
MCP桥接器配置使用哪个评审模型、路由规则否(基础设施类)
不优化的内容:研究工件本身(论文、代码、实验)。这些内容由常规工作流处理。

Prerequisites

前置条件

  1. Logging must be active. Copy
    templates/claude-hooks/meta_logging.json
    into your project's
    .claude/settings.json
    (or merge the hooks section).
  2. Sufficient data. At least 5 complete workflow runs logged in
    .aris/meta/events.jsonl
    . The skill will check and warn if insufficient.
  1. 必须启用日志记录。将
    templates/claude-hooks/meta_logging.json
    复制到项目的
    .claude/settings.json
    中(或合并hooks部分)。
  2. 拥有足够的数据
    .aris/meta/events.jsonl
    中至少记录了5次完整的工作流运行。如果数据不足,技能会发出警告。

Workflow

工作流

Step 0: Check Data Availability

步骤0:检查数据可用性

bash
EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
    echo "ERROR: No event log found at $EVENTS_FILE"
    echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
    exit 1
fi

EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)

echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"

if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
    echo "⚠️  Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
    exit 0
fi
bash
EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
    echo "ERROR: No event log found at $EVENTS_FILE"
    echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
    exit 1
fi

EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)

echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"

if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
    echo "⚠️  Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
    exit 0
fi

Step 1: Analyze Usage Patterns

步骤1:分析使用模式

Read
.aris/meta/events.jsonl
and compute:
Frequency analysis:
  • Which skills are invoked most often?
  • Which slash commands do users type most?
  • What parameter overrides are most common? (These suggest bad defaults.)
Failure analysis:
  • Which tools fail most often? In which skills?
  • What error patterns repeat? (OOM, import, compilation, timeout)
  • How many auto-debug retries per workflow run?
Convergence analysis (for auto-review-loop):
  • Average rounds to reach threshold
  • Score trajectory shape (fast improvement? plateau? oscillation?)
  • Which review round catches the most critical issues?
  • Do users override difficulty mid-run?
Human intervention analysis:
  • Where do users interrupt with manual prompts during workflows?
  • What manual corrections do users make most? (These indicate skill gaps.)
Present findings as a structured summary table.
读取
.aris/meta/events.jsonl
并计算:
频率分析
  • 哪些技能被调用的次数最多?
  • 用户最常使用哪些斜杠命令?
  • 最常见的参数覆盖是什么?(这些表明默认值不合理)
失败分析
  • 哪些工具失败次数最多?出现在哪些技能中?
  • 重复出现的错误模式有哪些?(内存不足、导入错误、编译错误、超时)
  • 每次工作流运行平均需要多少次自动调试重试?
收敛分析(针对自动评审循环)
  • 达到阈值所需的平均轮次
  • 分数轨迹形态(快速提升?停滞?波动?)
  • 哪一轮评审发现的关键问题最多?
  • 用户是否会在运行中途修改难度参数?
人工干预分析
  • 用户在工作流运行过程中会在哪些环节插入手动提示?
  • 用户最常进行哪些手动修正?(这些表明技能存在不足)
将分析结果整理为结构化的汇总表格。

Step 2: Identify Optimization Targets

步骤2:确定优化目标

Based on Step 1, rank optimization opportunities by expected impact:
markdown
undefined
基于步骤1的结果,按预期影响对优化机会进行排序:
markdown
undefined

Optimization Opportunities (ranked)

优化机会(按优先级排序)

#TargetSignalProposed ChangeExpected Impact
1auto-review-loop default thresholdUsers override to 7/10 in 60% of runsChange default from 6/10 to 7/10Fewer manual overrides
2experiment-bridge retry count40% of runs hit max retries on OOMAdd OOM-specific recovery (reduce batch size)Fewer failed experiments
3paper-write de-AI patternsUsers manually fix "delve" in 80% of runsAdd "delve" to default watchword listFewer manual edits

If `$ARGUMENTS` specifies a target skill, focus analysis on that skill only.
If `$ARGUMENTS` is empty or "all", analyze all skills with sufficient data.
#目标信号建议修改预期影响
1自动评审循环默认阈值60%的运行中用户将阈值覆盖为7/10将默认值从6/10改为7/10减少手动覆盖操作
2实验桥接器重试次数40%的运行因内存不足达到最大重试次数添加针对内存不足的恢复机制(减小批处理大小)减少实验失败次数
3论文撰写去AI化模式80%的运行中用户手动修正"delve"表述将"delve"添加到默认关注词列表减少手动编辑操作

如果`$ARGUMENTS`指定了目标技能,则仅针对该技能进行分析。
如果`$ARGUMENTS`为空或为"all",则分析所有有足够数据的技能。

Step 3: Generate Patch Proposals

步骤3:生成补丁提案

For each optimization target, generate a concrete diff:
diff
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
 ## Constants
 
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)
Rules for patch generation:
  • One patch per optimization target
  • Each patch must include a comment explaining WHY (with data from the log)
  • Patches must be minimal — change only what the data supports
  • Never change artifact schemas or MCP bridge config in v1
  • Never change behavior that would break existing user workflows
针对每个优化目标,生成具体的差异补丁:
diff
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
 ## Constants
 
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)
补丁生成规则
  • 每个优化目标对应一个补丁
  • 每个补丁必须包含注释,说明修改的原因(引用日志中的数据)
  • 补丁必须尽可能精简——仅修改数据支持的内容
  • 在v1版本中,绝不修改工件schema或MCP桥接器配置
  • 绝不修改会破坏现有用户工作流的行为

Step 4: Cross-Model Review of Patches

步骤4:补丁的跨模型评审

Send each patch to GPT-5.4 xhigh for adversarial review:
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are reviewing a proposed optimization to an ARIS SKILL.md file.
    
    ## Original Skill (relevant section)
    [paste original]
    
    ## Proposed Patch
    [paste diff]
    
    ## Evidence from Usage Log
    [paste summary stats]
    
    Review this patch:
    1. Does the evidence support the change?
    2. Could this change hurt other use cases?
    3. Is the change minimal and safe?
    4. Score 1-10: should this be applied?
    
    If score < 7, explain what additional evidence would be needed.
将每个补丁发送给GPT-5.4 xhigh进行对抗性评审:
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are reviewing a proposed optimization to an ARIS SKILL.md file.
    
    ## Original Skill (relevant section)
    [paste original]
    
    ## Proposed Patch
    [paste diff]
    
    ## Evidence from Usage Log
    [paste summary stats]
    
    Review this patch:
    1. Does the evidence support the change?
    2. Could this change hurt other use cases?
    3. Is the change minimal and safe?
    4. Score 1-10: should this be applied?
    
    If score < 7, explain what additional evidence would be needed.

Step 5: Present Results

步骤5:呈现结果

Output a structured report:
markdown
undefined
输出结构化报告:
markdown
undefined

ARIS Meta-Optimization Report

ARIS元优化报告

Date: [today] Data: [N] events, [M] skill invocations, [K] sessions Target: [skill name or "all"]
日期: [今日] 数据: [N]条事件记录,[M]次技能调用,[K]次会话 目标: [技能名称或"全部"]

Proposed Changes

建议修改

Change 1: [title]

修改1: [标题]

  • Target: [skill/file:line]
  • Signal: [what the data shows]
  • Patch: [diff]
  • Reviewer Score: [X/10]
  • Reviewer Notes: [summary]
  • Status: ✅ Recommended / ⚠️ Needs more data / ❌ Rejected
  • 目标: [技能/文件:行号]
  • 信号: [数据显示的内容]
  • 补丁: [差异内容]
  • 评审得分: [X/10]
  • 评审备注: [摘要]
  • 状态: ✅ 推荐 / ⚠️ 需要更多数据 / ❌ 拒绝

Change 2: ...

修改2: ...

Changes NOT Made (insufficient evidence)

未进行的修改(证据不足)

  • [pattern observed but too few samples]
  • [观察到模式但样本量太少]

Recommendations

建议

  • Apply Change 1 (reviewer approved)
  • Collect more data for Change 3 (need N more runs)
  • Consider manual review of Change 2
  • 应用修改1(已通过评审)
  • 为修改3收集更多数据(还需要N次运行)
  • 考虑对修改2进行人工评审

Next Steps

下一步操作

Run
/meta-optimize apply 1
to apply a specific change, or
/meta-optimize apply all
to apply all recommended changes.
undefined
运行
/meta-optimize apply 1
应用特定修改,或运行
/meta-optimize apply all
应用所有推荐修改。
undefined

Step 6: Apply Changes (if user approves)

步骤6:应用修改(需用户批准)

If user runs
/meta-optimize apply [N]
:
  1. Back up original SKILL.md to
    .aris/meta/backups/
  2. Apply the patch
  3. Log the change to
    .aris/meta/optimizations.jsonl
  4. Remind user to test the changed skill on their next run
Never auto-apply without user approval.
如果用户运行
/meta-optimize apply [N]
  1. 将原始SKILL.md备份到
    .aris/meta/backups/
  2. 应用补丁
  3. 将修改记录到
    .aris/meta/optimizations.jsonl
  4. 提醒用户在下次运行时测试修改后的技能
未经用户批准,绝不自动应用修改。

Key Rules

核心规则

  • Log-driven, not speculative. Every proposed change must cite specific data from the event log. No "I think this would be better."
  • Minimal patches. Change one thing at a time. Don't rewrite entire skills.
  • Reviewer-gated. Every patch goes through cross-model review before recommendation.
  • Reversible. Always back up before applying. Always log what changed.
  • User-approved. Never auto-apply. Present, explain, let the user decide.
  • Honest about uncertainty. If the data is insufficient, say so. Don't optimize on noise.
  • Portable. Optimizations should improve the skill for all users, not just one user's style. If a change seems user-specific, flag it.
  • 基于日志,而非推测。每个建议修改必须引用事件日志中的具体数据。禁止出现“我认为这样更好”的表述。
  • 精简补丁。每次只修改一件事。不要重写整个技能。
  • 评审把关。每个补丁在推荐前必须经过跨模型评审。
  • 可回滚。应用修改前始终备份。始终记录修改内容。
  • 用户批准。绝不自动应用。先呈现、解释,再让用户决定。
  • 诚实面对不确定性。如果数据不足,如实说明。不要基于噪声进行优化。
  • 可移植性。优化应提升技能对所有用户的适用性,而非仅适用于单个用户的使用风格。如果修改看起来具有用户特异性,需标记出来。

Event Schema Reference

事件Schema参考

The log at
.aris/meta/events.jsonl
contains JSONL records with these shapes:
jsonl
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}
.aris/meta/events.jsonl
中的日志包含JSONL格式的记录,结构如下:
jsonl
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}

Triggering

触发方式

This skill is NOT part of the standard W1→W1.5→W2→W3→W4 pipeline. It is a maintenance workflow with three trigger mechanisms:
  1. Passive logging (always on): Claude Code hooks record events to
    .aris/meta/events.jsonl
    automatically during normal usage. Zero user effort.
  2. Automatic readiness check (SessionEnd hook): When a Claude Code session ends,
    check_ready.sh
    counts skill invocations since the last
    /meta-optimize
    run. If ≥5 new invocations have accumulated, it prints a reminder:
    📊 ARIS has logged 8 skill runs since last optimization. Run /meta-optimize to check for improvement opportunities.
    This is a suggestion only — it does not auto-run optimization.
  3. Manual trigger: User runs
    /meta-optimize
    when they see the reminder or whenever they want.
After each
/meta-optimize
run
, the skill writes the current timestamp to
.aris/meta/.last_optimize
so the readiness check only counts new invocations.
本技能不属于标准的W1→W1.5→W2→W3→W4流水线。它是一个维护工作流,有三种触发机制:
  1. 被动日志记录(始终开启):Claude Code钩子会在正常使用过程中自动将事件记录到
    .aris/meta/events.jsonl
    。无需用户操作。
  2. 自动就绪检查(SessionEnd钩子):当Claude Code会话结束时,
    check_ready.sh
    会统计自上次
    /meta-optimize
    运行以来的技能调用次数。如果累计了≥5次新的调用,会打印提醒:
    📊 ARIS自上次优化以来已记录8次技能运行。运行/meta-optimize以查看改进机会。
    这只是一个建议——不会自动运行优化。
  3. 手动触发:用户看到提醒后或任何需要的时候,运行
    /meta-optimize
每次运行
/meta-optimize
,技能会将当前时间戳写入
.aris/meta/.last_optimize
,以便就绪检查仅统计新的调用次数。

Acknowledgements

致谢

Inspired by Meta-Harness (Lee et al., 2026) — end-to-end optimization of model harnesses via filesystem-based experience access and agentic code search.
灵感来源于Meta-Harness(Lee等人,2026)——通过基于文件系统的经验访问和智能代码搜索实现模型工具框架的端到端优化。

Output Protocols

输出协议

Follow these shared protocols for all output files:
  • Output Versioning Protocol — write timestamped file first, then copy to fixed name
  • Output Manifest Protocol — log every output to MANIFEST.md
  • Output Language Protocol — respect the project's language setting
所有输出文件需遵循以下共享协议:
  • 输出版本控制协议 — 先写入带时间戳的文件,再复制到固定名称
  • 输出清单协议 — 将所有输出记录到MANIFEST.md
  • 输出语言协议 — 遵循项目的语言设置

Review Tracing

评审追踪

After each
mcp__codex__codex
or
mcp__codex__codex-reply
reviewer call, save the trace following
shared-references/review-tracing.md
. Use
tools/save_trace.sh
or write files directly to
.aris/traces/<skill>/<date>_run<NN>/
. Respect the
--- trace:
parameter (default:
full
).
每次调用
mcp__codex__codex
mcp__codex__codex-reply
评审员后,按照
shared-references/review-tracing.md
保存追踪记录。使用
tools/save_trace.sh
或直接将文件写入
.aris/traces/<skill>/<date>_run<NN>/
。遵循
--- trace:
参数(默认值:
full
)。