meta-optimize
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMeta-Optimize: Outer-Loop Harness Optimization for ARIS
元优化:面向ARIS的外环工具框架优化
Analyze accumulated usage logs and propose optimizations for: $ARGUMENTS
分析积累的使用日志,针对以下内容提出优化方案:$ARGUMENTS
Context
背景
ARIS is a research harness — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype outer loop that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).
Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.
ARIS是一个研究工具框架——一套由技能、桥接器、工作流和工件契约组成的系统,围绕LLMs进行编排以实现研究工作的自动化。本技能实现了一个原型外环功能,用于观察工具框架的使用方式,并提出对框架本身的改进建议(而非对其生成的研究工件进行改进)。
灵感来源于Meta-Harness(Lee等人,2026):核心见解是工具框架的设计与模型权重同样重要,通过记录执行轨迹并以此指导改进,可以部分实现工具框架工程的自动化。
What This Skill Optimizes (Harness Components)
本技能优化的对象(工具框架组件)
| Component | Example | Optimizable? |
|---|---|---|
| SKILL.md prompts | Reviewer instructions, quality gates, step descriptions | Yes |
| Default parameters | | Yes |
| Convergence rules | When to stop the review loop, retry counts | Yes |
| Workflow ordering | Skill chain sequence within a workflow | Yes |
| Artifact schemas | What fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.md | Cautious |
| MCP bridge config | Which reviewer model, routing rules | No (infra) |
Not optimized: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.
| 组件 | 示例 | 是否可优化 |
|---|---|---|
| SKILL.md提示 | 评审员指令、质量门、步骤描述 | 是 |
| 默认参数 | | 是 |
| 收敛规则 | 何时停止评审循环、重试次数 | 是 |
| 工作流顺序 | 工作流内的技能链序列 | 是 |
| 工件 schema | EXPERIMENT_LOG.md、idea-stage/IDEA_REPORT.md中包含哪些字段 | 谨慎优化 |
| MCP桥接器配置 | 使用哪个评审模型、路由规则 | 否(基础设施类) |
不优化的内容:研究工件本身(论文、代码、实验)。这些内容由常规工作流处理。
Prerequisites
前置条件
- Logging must be active. Copy into your project's
templates/claude-hooks/meta_logging.json(or merge the hooks section)..claude/settings.json - Sufficient data. At least 5 complete workflow runs logged in . The skill will check and warn if insufficient.
.aris/meta/events.jsonl
- 必须启用日志记录。将复制到项目的
templates/claude-hooks/meta_logging.json中(或合并hooks部分)。.claude/settings.json - 拥有足够的数据。中至少记录了5次完整的工作流运行。如果数据不足,技能会发出警告。
.aris/meta/events.jsonl
Workflow
工作流
Step 0: Check Data Availability
步骤0:检查数据可用性
bash
EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
echo "ERROR: No event log found at $EVENTS_FILE"
echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
exit 1
fi
EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)
echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"
if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
echo "⚠️ Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
exit 0
fibash
EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
echo "ERROR: No event log found at $EVENTS_FILE"
echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
exit 1
fi
EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)
echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"
if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
echo "⚠️ Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
exit 0
fiStep 1: Analyze Usage Patterns
步骤1:分析使用模式
Read and compute:
.aris/meta/events.jsonlFrequency analysis:
- Which skills are invoked most often?
- Which slash commands do users type most?
- What parameter overrides are most common? (These suggest bad defaults.)
Failure analysis:
- Which tools fail most often? In which skills?
- What error patterns repeat? (OOM, import, compilation, timeout)
- How many auto-debug retries per workflow run?
Convergence analysis (for auto-review-loop):
- Average rounds to reach threshold
- Score trajectory shape (fast improvement? plateau? oscillation?)
- Which review round catches the most critical issues?
- Do users override difficulty mid-run?
Human intervention analysis:
- Where do users interrupt with manual prompts during workflows?
- What manual corrections do users make most? (These indicate skill gaps.)
Present findings as a structured summary table.
读取并计算:
.aris/meta/events.jsonl频率分析:
- 哪些技能被调用的次数最多?
- 用户最常使用哪些斜杠命令?
- 最常见的参数覆盖是什么?(这些表明默认值不合理)
失败分析:
- 哪些工具失败次数最多?出现在哪些技能中?
- 重复出现的错误模式有哪些?(内存不足、导入错误、编译错误、超时)
- 每次工作流运行平均需要多少次自动调试重试?
收敛分析(针对自动评审循环):
- 达到阈值所需的平均轮次
- 分数轨迹形态(快速提升?停滞?波动?)
- 哪一轮评审发现的关键问题最多?
- 用户是否会在运行中途修改难度参数?
人工干预分析:
- 用户在工作流运行过程中会在哪些环节插入手动提示?
- 用户最常进行哪些手动修正?(这些表明技能存在不足)
将分析结果整理为结构化的汇总表格。
Step 2: Identify Optimization Targets
步骤2:确定优化目标
Based on Step 1, rank optimization opportunities by expected impact:
markdown
undefined基于步骤1的结果,按预期影响对优化机会进行排序:
markdown
undefinedOptimization Opportunities (ranked)
优化机会(按优先级排序)
| # | Target | Signal | Proposed Change | Expected Impact |
|---|---|---|---|---|
| 1 | auto-review-loop default threshold | Users override to 7/10 in 60% of runs | Change default from 6/10 to 7/10 | Fewer manual overrides |
| 2 | experiment-bridge retry count | 40% of runs hit max retries on OOM | Add OOM-specific recovery (reduce batch size) | Fewer failed experiments |
| 3 | paper-write de-AI patterns | Users manually fix "delve" in 80% of runs | Add "delve" to default watchword list | Fewer manual edits |
If `$ARGUMENTS` specifies a target skill, focus analysis on that skill only.
If `$ARGUMENTS` is empty or "all", analyze all skills with sufficient data.| # | 目标 | 信号 | 建议修改 | 预期影响 |
|---|---|---|---|---|
| 1 | 自动评审循环默认阈值 | 60%的运行中用户将阈值覆盖为7/10 | 将默认值从6/10改为7/10 | 减少手动覆盖操作 |
| 2 | 实验桥接器重试次数 | 40%的运行因内存不足达到最大重试次数 | 添加针对内存不足的恢复机制(减小批处理大小) | 减少实验失败次数 |
| 3 | 论文撰写去AI化模式 | 80%的运行中用户手动修正"delve"表述 | 将"delve"添加到默认关注词列表 | 减少手动编辑操作 |
如果`$ARGUMENTS`指定了目标技能,则仅针对该技能进行分析。
如果`$ARGUMENTS`为空或为"all",则分析所有有足够数据的技能。Step 3: Generate Patch Proposals
步骤3:生成补丁提案
For each optimization target, generate a concrete diff:
diff
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
## Constants
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)Rules for patch generation:
- One patch per optimization target
- Each patch must include a comment explaining WHY (with data from the log)
- Patches must be minimal — change only what the data supports
- Never change artifact schemas or MCP bridge config in v1
- Never change behavior that would break existing user workflows
针对每个优化目标,生成具体的差异补丁:
diff
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
## Constants
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)补丁生成规则:
- 每个优化目标对应一个补丁
- 每个补丁必须包含注释,说明修改的原因(引用日志中的数据)
- 补丁必须尽可能精简——仅修改数据支持的内容
- 在v1版本中,绝不修改工件schema或MCP桥接器配置
- 绝不修改会破坏现有用户工作流的行为
Step 4: Cross-Model Review of Patches
步骤4:补丁的跨模型评审
Send each patch to GPT-5.4 xhigh for adversarial review:
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a proposed optimization to an ARIS SKILL.md file.
## Original Skill (relevant section)
[paste original]
## Proposed Patch
[paste diff]
## Evidence from Usage Log
[paste summary stats]
Review this patch:
1. Does the evidence support the change?
2. Could this change hurt other use cases?
3. Is the change minimal and safe?
4. Score 1-10: should this be applied?
If score < 7, explain what additional evidence would be needed.将每个补丁发送给GPT-5.4 xhigh进行对抗性评审:
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a proposed optimization to an ARIS SKILL.md file.
## Original Skill (relevant section)
[paste original]
## Proposed Patch
[paste diff]
## Evidence from Usage Log
[paste summary stats]
Review this patch:
1. Does the evidence support the change?
2. Could this change hurt other use cases?
3. Is the change minimal and safe?
4. Score 1-10: should this be applied?
If score < 7, explain what additional evidence would be needed.Step 5: Present Results
步骤5:呈现结果
Output a structured report:
markdown
undefined输出结构化报告:
markdown
undefinedARIS Meta-Optimization Report
ARIS元优化报告
Date: [today]
Data: [N] events, [M] skill invocations, [K] sessions
Target: [skill name or "all"]
日期: [今日]
数据: [N]条事件记录,[M]次技能调用,[K]次会话
目标: [技能名称或"全部"]
Proposed Changes
建议修改
Change 1: [title]
修改1: [标题]
- Target: [skill/file:line]
- Signal: [what the data shows]
- Patch: [diff]
- Reviewer Score: [X/10]
- Reviewer Notes: [summary]
- Status: ✅ Recommended / ⚠️ Needs more data / ❌ Rejected
- 目标: [技能/文件:行号]
- 信号: [数据显示的内容]
- 补丁: [差异内容]
- 评审得分: [X/10]
- 评审备注: [摘要]
- 状态: ✅ 推荐 / ⚠️ 需要更多数据 / ❌ 拒绝
Change 2: ...
修改2: ...
Changes NOT Made (insufficient evidence)
未进行的修改(证据不足)
- [pattern observed but too few samples]
- [观察到模式但样本量太少]
Recommendations
建议
- Apply Change 1 (reviewer approved)
- Collect more data for Change 3 (need N more runs)
- Consider manual review of Change 2
- 应用修改1(已通过评审)
- 为修改3收集更多数据(还需要N次运行)
- 考虑对修改2进行人工评审
Next Steps
下一步操作
Run to apply a specific change, or
to apply all recommended changes.
/meta-optimize apply 1/meta-optimize apply allundefined运行应用特定修改,或运行应用所有推荐修改。
/meta-optimize apply 1/meta-optimize apply allundefinedStep 6: Apply Changes (if user approves)
步骤6:应用修改(需用户批准)
If user runs :
/meta-optimize apply [N]- Back up original SKILL.md to
.aris/meta/backups/ - Apply the patch
- Log the change to
.aris/meta/optimizations.jsonl - Remind user to test the changed skill on their next run
Never auto-apply without user approval.
如果用户运行:
/meta-optimize apply [N]- 将原始SKILL.md备份到
.aris/meta/backups/ - 应用补丁
- 将修改记录到
.aris/meta/optimizations.jsonl - 提醒用户在下次运行时测试修改后的技能
未经用户批准,绝不自动应用修改。
Key Rules
核心规则
- Log-driven, not speculative. Every proposed change must cite specific data from the event log. No "I think this would be better."
- Minimal patches. Change one thing at a time. Don't rewrite entire skills.
- Reviewer-gated. Every patch goes through cross-model review before recommendation.
- Reversible. Always back up before applying. Always log what changed.
- User-approved. Never auto-apply. Present, explain, let the user decide.
- Honest about uncertainty. If the data is insufficient, say so. Don't optimize on noise.
- Portable. Optimizations should improve the skill for all users, not just one user's style. If a change seems user-specific, flag it.
- 基于日志,而非推测。每个建议修改必须引用事件日志中的具体数据。禁止出现“我认为这样更好”的表述。
- 精简补丁。每次只修改一件事。不要重写整个技能。
- 评审把关。每个补丁在推荐前必须经过跨模型评审。
- 可回滚。应用修改前始终备份。始终记录修改内容。
- 用户批准。绝不自动应用。先呈现、解释,再让用户决定。
- 诚实面对不确定性。如果数据不足,如实说明。不要基于噪声进行优化。
- 可移植性。优化应提升技能对所有用户的适用性,而非仅适用于单个用户的使用风格。如果修改看起来具有用户特异性,需标记出来。
Event Schema Reference
事件Schema参考
The log at contains JSONL records with these shapes:
.aris/meta/events.jsonljsonl
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}.aris/meta/events.jsonljsonl
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}Triggering
触发方式
This skill is NOT part of the standard W1→W1.5→W2→W3→W4 pipeline. It is a maintenance workflow with three trigger mechanisms:
-
Passive logging (always on): Claude Code hooks record events toautomatically during normal usage. Zero user effort.
.aris/meta/events.jsonl -
Automatic readiness check (SessionEnd hook): When a Claude Code session ends,counts skill invocations since the last
check_ready.shrun. If ≥5 new invocations have accumulated, it prints a reminder:/meta-optimize📊 ARIS has logged 8 skill runs since last optimization. Run /meta-optimize to check for improvement opportunities.This is a suggestion only — it does not auto-run optimization. -
Manual trigger: User runswhen they see the reminder or whenever they want.
/meta-optimize
After each run, the skill writes the current timestamp to so the readiness check only counts new invocations.
/meta-optimize.aris/meta/.last_optimize本技能不属于标准的W1→W1.5→W2→W3→W4流水线。它是一个维护工作流,有三种触发机制:
-
被动日志记录(始终开启):Claude Code钩子会在正常使用过程中自动将事件记录到。无需用户操作。
.aris/meta/events.jsonl -
自动就绪检查(SessionEnd钩子):当Claude Code会话结束时,会统计自上次
check_ready.sh运行以来的技能调用次数。如果累计了≥5次新的调用,会打印提醒:/meta-optimize📊 ARIS自上次优化以来已记录8次技能运行。运行/meta-optimize以查看改进机会。这只是一个建议——不会自动运行优化。 -
手动触发:用户看到提醒后或任何需要的时候,运行。
/meta-optimize
每次运行后,技能会将当前时间戳写入,以便就绪检查仅统计新的调用次数。
/meta-optimize.aris/meta/.last_optimizeAcknowledgements
致谢
Inspired by Meta-Harness (Lee et al., 2026) — end-to-end optimization of model harnesses via filesystem-based experience access and agentic code search.
灵感来源于Meta-Harness(Lee等人,2026)——通过基于文件系统的经验访问和智能代码搜索实现模型工具框架的端到端优化。
Output Protocols
输出协议
Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
所有输出文件需遵循以下共享协议:
- 输出版本控制协议 — 先写入带时间戳的文件,再复制到固定名称
- 输出清单协议 — 将所有输出记录到MANIFEST.md
- 输出语言协议 — 遵循项目的语言设置
Review Tracing
评审追踪
After each or reviewer call, save the trace following . Use or write files directly to . Respect the parameter (default: ).
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full每次调用或评审员后,按照保存追踪记录。使用或直接将文件写入。遵循参数(默认值:)。
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full