meta-harness-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMeta-Harness Optimization
Meta-Harness优化
Skill by ara.so — Daily 2026 Skills collection.
Meta-Harness is a framework for automated end-to-end search over model harnesses — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.
来自ara.so的技能——Daily 2026 Skills合集。
Meta-Harness是一款用于对**模型封装框架(Model Harness)**进行端到端自动化搜索的框架——模型封装框架是围绕固定基础模型的脚手架代码,控制模型在处理任务时存储、检索和感知的内容。无需手动编写提示词和记忆系统,Meta-Harness可自动提出、评估并演进封装框架的实现方案。
Core Concepts
核心概念
| Term | Meaning |
|---|---|
| Harness | All code around the base model: memory, retrieval, prompt construction, tool use |
| Proposer Agent | LLM (e.g. Claude Code) that proposes new harness variants |
| Evaluator | Runs proposed harnesses on a benchmark, returns a score |
| Meta-Loop | Iterative propose → evaluate → feedback cycle |
| 术语 | 含义 |
|---|---|
| Harness | 基础模型周围的所有代码:记忆、检索、提示词构建、工具调用 |
| Proposer Agent | 提出新封装框架变体的大语言模型(LLM,如Claude Code) |
| Evaluator | 在基准测试上运行提出的封装框架,返回评分 |
| Meta-Loop | 迭代式的“提出→评估→反馈”循环 |
Installation
安装
Meta-Harness uses for dependency management. Each reference experiment is self-contained:
uvbash
undefinedMeta-Harness使用进行依赖管理。每个参考实验都是独立的:
uvbash
undefinedText classification experiment
文本分类实验
cd reference_examples/text_classification
uv sync
cd reference_examples/text_classification
uv sync
Terminal-Bench 2 experiment
Terminal-Bench 2实验
cd reference_examples/terminal_bench_2
uv sync
No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.cd reference_examples/terminal_bench_2
uv sync
无需全局pip安装。所有依赖通过`pyproject.toml`按实验进行管理。Quick Start
快速开始
Text Classification (Memory System Search)
文本分类(记忆系统搜索)
bash
cd reference_examples/text_classificationbash
cd reference_examples/text_classificationRun 1 iteration of meta-harness optimization
运行一轮Meta-Harness优化
uv run python meta_harness.py --iterations 1
uv run python meta_harness.py --iterations 1
Run more iterations for better optimization
运行更多轮次以获得更好的优化效果
uv run python meta_harness.py --iterations 10
undefineduv run python meta_harness.py --iterations 10
undefinedTerminal-Bench 2 (Scaffold Evolution)
Terminal-Bench 2(脚手架演进)
bash
cd reference_examples/terminal_bench_2bash
cd reference_examples/terminal_bench_2Smoke test with a single task
单任务冒烟测试
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
General eval format:
通用评估格式:
run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]
run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]
undefinedundefinedApplying Meta-Harness to a New Domain
将Meta-Harness应用到新领域
The recommended workflow uses the onboarding document with your AI coding assistant:
bash
undefined推荐的工作流程是使用入职文档与AI编码助手配合:
bash
undefined1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)
1. 在你的AI编码助手(Claude Code、Cursor等)中打开ONBOARDING.md
and have a conversation about your domain. This produces domain_spec.md.
并就你的领域进行对话,生成domain_spec.md。
2. domain_spec.md will contain:
2. domain_spec.md将包含:
- What the harness controls in your domain
- 在你的领域中,封装框架控制的内容
- How to evaluate harness quality (benchmark / metric)
- 如何评估封装框架的质量(基准测试/指标)
- What the proposer agent should modify
- 提出Agent应修改的内容
- Constraints and budget considerations
- 约束条件和预算考量
undefinedundefinedMinimum Required Components for a New Domain
新领域所需的最低组件
my_domain/
├── pyproject.toml # uv-managed dependencies
├── domain_spec.md # generated via ONBOARDING.md conversation
├── meta_harness.py # main optimization loop
├── harness.py # base harness implementation
├── evaluator.py # benchmark runner → numeric score
└── claude_wrapper.py # proposer agent wrappermy_domain/
├── pyproject.toml # uv管理的依赖
├── domain_spec.md # 通过ONBOARDING.md对话生成
├── meta_harness.py # 主优化循环
├── harness.py # 基础封装框架实现
├── evaluator.py # 基准测试运行器 → 数值评分
└── claude_wrapper.py # 提出Agent的封装器Implementing a Harness
实现封装框架
A harness wraps a base model and manages context/memory/tools:
python
undefined封装框架包裹基础模型并管理上下文/记忆/工具:
python
undefinedharness.py — minimal harness structure
harness.py — 最小封装框架结构
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HarnessConfig:
model: str = "claude-3-5-sonnet-20241022"
memory_strategy: str = "last_k"
k: int = 5
retrieval_enabled: bool = False
system_prompt: str = "You are a helpful assistant."
class AgentHarness:
def init(self, config: HarnessConfig):
self.config = config
self.memory: list[dict] = []
def reset(self):
self.memory = []
def _build_context(self, new_input: str) -> list[dict]:
"""Core harness logic: what does the model see?"""
if self.config.memory_strategy == "last_k":
recent = self.memory[-self.config.k:]
elif self.config.memory_strategy == "all":
recent = self.memory[:]
else:
recent = []
return recent + [{"role": "user", "content": new_input}]
def step(self, user_input: str) -> str:
messages = self._build_context(user_input)
# Call base model with constructed context
response = call_model(
model=self.config.model,
system=self.config.system_prompt,
messages=messages
)
# Update memory
self.memory.append({"role": "user", "content": user_input})
self.memory.append({"role": "assistant", "content": response})
return responseundefinedfrom dataclasses import dataclass, field
from typing import Any
@dataclass
class HarnessConfig:
model: str = "claude-3-5-sonnet-20241022"
memory_strategy: str = "last_k"
k: int = 5
retrieval_enabled: bool = False
system_prompt: str = "You are a helpful assistant."
class AgentHarness:
def init(self, config: HarnessConfig):
self.config = config
self.memory: list[dict] = []
def reset(self):
self.memory = []
def _build_context(self, new_input: str) -> list[dict]:
"""核心封装框架逻辑:模型能看到什么?"""
if self.config.memory_strategy == "last_k":
recent = self.memory[-self.config.k:]
elif self.config.memory_strategy == "all":
recent = self.memory[:]
else:
recent = []
return recent + [{"role": "user", "content": new_input}]
def step(self, user_input: str) -> str:
messages = self._build_context(user_input)
# 使用构建好的上下文调用基础模型
response = call_model(
model=self.config.model,
system=self.config.system_prompt,
messages=messages
)
# 更新记忆
self.memory.append({"role": "user", "content": user_input})
self.memory.append({"role": "assistant", "content": response})
return responseundefinedImplementing the Evaluator
实现评估器
python
undefinedpython
undefinedevaluator.py — runs harness on benchmark, returns score
evaluator.py — 在基准测试上运行封装框架,返回评分
from harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float:
"""
Evaluate a harness configuration on a dataset.
Returns a scalar score (higher is better).
"""
harness = AgentHarness(config)
correct = 0
for example in dataset:
harness.reset()
prediction = harness.step(example["input"])
if grade(prediction, example["label"]):
correct += 1
return correct / len(dataset)def grade(prediction: str, label: str) -> bool:
"""Task-specific grading logic."""
return label.lower().strip() in prediction.lower()
undefinedfrom harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float:
"""
在数据集上评估封装框架配置。
返回标量评分(越高越好)。
"""
harness = AgentHarness(config)
correct = 0
for example in dataset:
harness.reset()
prediction = harness.step(example["input"])
if grade(prediction, example["label"]):
correct += 1
return correct / len(dataset)def grade(prediction: str, label: str) -> bool:
"""特定任务的评分逻辑。"""
return label.lower().strip() in prediction.lower()
undefinedThe Meta-Harness Loop
Meta-Harness循环
python
undefinedpython
undefinedmeta_harness.py — the optimization loop
meta_harness.py — 优化循环
import json
from pathlib import Path
from evaluator import evaluate_harness
from claude_wrapper import run_proposer
def meta_harness_loop(
iterations: int = 10,
train_dataset: list = None,
val_dataset: list = None,
):
history: list[dict] = []
best_score = 0.0
best_config = None
for i in range(iterations):
print(f"\n=== Iteration {i+1}/{iterations} ===")
# 1. Propose: ask the proposer agent for a new harness variant
proposal = run_proposer(
history=history,
task_description="Optimize the memory system for text classification.",
code_context=Path("harness.py").read_text(),
)
# 2. Evaluate: run the proposed harness
try:
new_config = parse_proposal(proposal)
score = evaluate_harness(new_config, train_dataset)
except Exception as e:
score = 0.0
print(f"Evaluation failed: {e}")
# 3. Record: log result for proposer feedback
record = {
"iteration": i + 1,
"proposal": proposal,
"score": score,
}
history.append(record)
print(f"Score: {score:.4f}")
if score > best_score:
best_score = score
best_config = new_config
print(f"New best: {best_score:.4f}")
# Final validation on held-out set
if best_config and val_dataset:
val_score = evaluate_harness(best_config, val_dataset)
print(f"\nFinal val score: {val_score:.4f}")
return best_config, historyundefinedimport json
from pathlib import Path
from evaluator import evaluate_harness
from claude_wrapper import run_proposer
def meta_harness_loop(
iterations: int = 10,
train_dataset: list = None,
val_dataset: list = None,
):
history: list[dict] = []
best_score = 0.0
best_config = None
for i in range(iterations):
print(f"\n=== 迭代 {i+1}/{iterations} ===")
# 1. 提出:请求提出Agent提供新的封装框架变体
proposal = run_proposer(
history=history,
task_description="Optimize the memory system for text classification.",
code_context=Path("harness.py").read_text(),
)
# 2. 评估:运行提出的封装框架
try:
new_config = parse_proposal(proposal)
score = evaluate_harness(new_config, train_dataset)
except Exception as e:
score = 0.0
print(f"评估失败:{e}")
# 3. 记录:记录结果以供提出Agent反馈
record = {
"iteration": i + 1,
"proposal": proposal,
"score": score,
}
history.append(record)
print(f"评分:{score:.4f}")
if score > best_score:
best_score = score
best_config = new_config
print(f"新最优:{best_score:.4f}")
# 在预留数据集上进行最终验证
if best_config and val_dataset:
val_score = evaluate_harness(best_config, val_dataset)
print(f"\n最终验证评分:{val_score:.4f}")
return best_config, historyundefinedProposer Agent Wrapper (Claude Code)
提出Agent封装器(Claude Code)
The shipped examples use Claude Code as the proposer. Adapt :
claude_wrapper.pypython
undefined附带的示例使用Claude Code作为提出Agent。调整:
claude_wrapper.pypython
undefinedclaude_wrapper.py — wraps proposer agent calls
claude_wrapper.py — 封装提出Agent的调用
import subprocess
import json
from pathlib import Path
def run_proposer(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
"""
Call Claude Code (or another proposer) to suggest harness modifications.
Logs all interactions for reproducibility.
"""
prompt = build_proposer_prompt(history, task_description, code_context)
# Example: call Claude via API
import anthropic
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
result = response.content[0].text
# Log for reproducibility
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
return resultdef build_proposer_prompt(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
history_str = "\n".join(
f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}"
for h in history[-5:] # last 5 for context window
)
return f"""You are optimizing a model harness for: {task_description}
Current harness code:
python
{code_context}Optimization history (recent):
{history_str if history_str else "No history yet — this is the first iteration."}
Propose a modified HarnessConfig or changes to the harness code that may improve performance.
Output your proposal as a JSON config dict, followed by any code changes.
"""
undefinedimport subprocess
import json
from pathlib import Path
def run_proposer(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
"""
调用Claude Code(或其他提出Agent)建议封装框架修改方案。
记录所有交互以保证可复现性。
"""
prompt = build_proposer_prompt(history, task_description, code_context)
# 示例:通过API调用Claude
import anthropic
client = anthropic.Anthropic() # 使用ANTHROPIC_API_KEY环境变量
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
result = response.content[0].text
# 记录以保证可复现性
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
return resultdef build_proposer_prompt(
history: list[dict],
task_description: str,
code_context: str,
) -> str:
history_str = "\n".join(
f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}"
for h in history[-5:] # 取最近5轮作为上下文窗口
)
return f"""You are optimizing a model harness for: {task_description}
Current harness code:
python
{code_context}Optimization history (recent):
{history_str if history_str else "No history yet — this is the first iteration."}
Propose a modified HarnessConfig or changes to the harness code that may improve performance.
Output your proposal as a JSON config dict, followed by any code changes.
"""
undefinedEnvironment Variables
环境变量
bash
undefinedbash
undefinedRequired for Claude-based proposer
使用基于Claude的提出Agent时必填
export ANTHROPIC_API_KEY=your_key_here
export ANTHROPIC_API_KEY=your_key_here
Optional: control model used
可选:控制使用的模型
export PROPOSER_MODEL=claude-opus-4-5
export EVALUATOR_MODEL=claude-3-5-sonnet-20241022
undefinedexport PROPOSER_MODEL=claude-opus-4-5
export EVALUATOR_MODEL=claude-3-5-sonnet-20241022
undefinedReference Experiment Structure
参考实验结构
Text Classification (reference_examples/text_classification/
)
reference_examples/text_classification/文本分类(reference_examples/text_classification/
)
reference_examples/text_classification/Searches over memory system configurations for a classification task:
- Proposer modifies memory strategy, retrieval settings, prompt templates
- Evaluator scores on held-out classification benchmark
- Optimized config is saved for reuse
bash
uv run python meta_harness.py --iterations 20 --dataset ag_news针对分类任务搜索记忆系统配置:
- 提出Agent修改记忆策略、检索设置、提示词模板
- 评估器在预留分类基准测试上评分
- 优化后的配置保存以供复用
bash
uv run python meta_harness.py --iterations 20 --dataset ag_newsTerminal-Bench 2 (reference_examples/terminal_bench_2/
)
reference_examples/terminal_bench_2/Terminal-Bench 2(reference_examples/terminal_bench_2/
)
reference_examples/terminal_bench_2/Evolves agent scaffolding for computer-use / terminal tasks:
bash
undefined演进化用于计算机操作/终端任务的Agent脚手架:
bash
undefinedRun baseline agent on a specific task
在特定任务上运行基准Agent
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
Arguments: module:Class <split> <num_tasks> <num_workers> [task_filter]
参数:module:Class <split> <num_tasks> <num_workers> [task_filter]
Optimized artifact: stanford-iris-lab/meta-harness-tbench2-artifact
优化后的产物:stanford-iris-lab/meta-harness-tbench2-artifact
undefinedundefinedCommon Patterns
常见模式
Saving and Loading Optimized Configs
保存和加载优化后的配置
python
import json
from dataclasses import asdictpython
import json
from dataclasses import asdictSave
保存
with open("best_config.json", "w") as f:
json.dump(asdict(best_config), f, indent=2)
with open("best_config.json", "w") as f:
json.dump(asdict(best_config), f, indent=2)
Load
加载
with open("best_config.json") as f:
data = json.load(f)
config = HarnessConfig(**data)
undefinedwith open("best_config.json") as f:
data = json.load(f)
config = HarnessConfig(**data)
undefinedAdding Early Stopping
添加早停机制
python
PATIENCE = 3
no_improve = 0
for i in range(iterations):
score = evaluate_harness(config, dataset)
if score > best_score + 1e-4:
best_score = score
no_improve = 0
else:
no_improve += 1
if no_improve >= PATIENCE:
print(f"Early stop at iteration {i+1}")
breakpython
PATIENCE = 3
no_improve = 0
for i in range(iterations):
score = evaluate_harness(config, dataset)
if score > best_score + 1e-4:
best_score = score
no_improve = 0
else:
no_improve += 1
if no_improve >= PATIENCE:
print(f"在第{i+1}轮迭代时早停")
breakParallel Evaluation
并行评估
python
from concurrent.futures import ProcessPoolExecutor
def batch_evaluate(configs, dataset, num_workers=4):
with ProcessPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
return [f.result() for f in futures]python
from concurrent.futures import ProcessPoolExecutor
def batch_evaluate(configs, dataset, num_workers=4):
with ProcessPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
return [f.result() for f in futures]Troubleshooting
故障排除
| Problem | Likely Cause | Fix |
|---|---|---|
| Missing Python version | Install Python 3.11+ via |
| Proposer returns unparseable JSON | Prompt too vague | Add explicit JSON schema to proposer prompt |
| Scores don't improve | Too few iterations or search space too large | Increase |
| API rate limits | Too many evaluator calls | Add |
| Claude Code not found | CLI not installed | |
| 问题 | 可能原因 | 解决方法 |
|---|---|---|
| 缺少对应Python版本 | 通过 |
| 提出Agent返回无法解析的JSON | 提示词过于模糊 | 在提出Agent的提示词中添加明确的JSON schema |
| 评分没有提升 | 迭代次数太少或搜索空间过大 | 增加 |
| API速率限制 | 评估器调用过于频繁 | 添加 |
| 找不到Claude Code | 未安装CLI | |
Citation
引用
bibtex
@misc{lee2026metaharnessendtoendoptimizationmodel,
title={Meta-Harness: End-to-End Optimization of Model Harnesses},
author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
year={2026},
eprint={2603.28052},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.28052},
}bibtex
@misc{lee2026metaharnessendtoendoptimizationmodel,
title={Meta-Harness: End-to-End Optimization of Model Harnesses},
author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
year={2026},
eprint={2603.28052},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.28052},
}