caw-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CAW Eval

端到端评测 CAW Agent 质量，产出评分数据和分析报告。

End-to-end evaluation of CAW Agent quality, generating scoring data and analysis reports.

Step 0: 环境识别（必做）

Step 0: Environment Identification (Mandatory)

执行任何后续步骤前，先确认当前环境：

bash

[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"

```
env=openclaw
```
：本 SKILL（caw-eval）是本地 CC 版本，不能在 openclaw 服务器跑。告诉用户：
"当前在 openclaw 服务器，本地 CC 评测 SKILL 不适用。请使用
```
caw-eval-openclaw
```
（说'弱模型评测'或'openclaw 评测'触发）。" 然后停止。
```
env=local
```
：继续下面的流程路由。

Before executing any subsequent steps, first confirm the current environment:

bash

[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"

```
env=openclaw
```
: This SKILL (caw-eval) is the local CC version and cannot run on the openclaw server. Tell the user:
"You are currently on the openclaw server, and the local CC evaluation SKILL is not applicable. Please use
```
caw-eval-openclaw
```
(trigger by saying 'weak model evaluation' or 'openclaw evaluation')." Then stop.
```
env=local
```
: Proceed to the following process routing.

流程路由

Process Routing

根据用户意图选择评测方式，然后读取对应的 reference 文件并按步骤执行：

用户说	评测方式	读取并执行
"跑评测"、"测评 CAW"、"eval"、"评分"、"claude code 评测"	Claude Code 评测（默认 dataset: `caw-agent-eval-seth-v2` ）	→ 读 run-eval-cc.md 按步骤执行
"recipe 评测"、"recipe eval"	Recipe 评测（交易构建模式）	→ 读 run-eval-recipe.md 按步骤执行
"recipe 对比评测"、"recipe 对比"	Recipe 三模式对比评测	→ 读 run-eval-recipe.md（跑三次对比）
"弱模型验证"、"openclaw 评测"、"模型兼容性"	Openclaw 弱模型验证	→ 读 run-eval-openclaw.md 按步骤执行（SSH 到服务器跑评测 → 下载 session → 本地评分出报告）

默认走 Claude Code 评测（如果用户没有明确说"弱模型"或"openclaw"）。

Select the evaluation method based on the user's intent, then read the corresponding reference file and execute according to the steps:

User Says	Evaluation Method	Read and Execute
"run evaluation", "evaluate CAW", "eval", "score", "claude code evaluation"	Claude Code Evaluation (default dataset: `caw-agent-eval-seth-v2` )	→ Read run-eval-cc.md and execute step by step
"recipe evaluation", "recipe eval"	Recipe Evaluation (transaction construction mode)	→ Read run-eval-recipe.md and execute step by step
"recipe comparison evaluation", "recipe comparison"	Recipe Three-Mode Comparison Evaluation	→ Read run-eval-recipe.md (run three times for comparison)
"weak model verification", "openclaw evaluation", "model compatibility"	Openclaw Weak Model Verification	→ Read run-eval-openclaw.md and execute step by step (SSH to the server to run evaluation → download session → local scoring and report generation)

Default to Claude Code Evaluation (if the user does not explicitly mention "weak model" or "openclaw").

概览

Overview

Claude Code 评测（主要方式）

Claude Code Evaluation (Main Method)

在本地 Claude Code 中用 Sonnet subagent 并行执行 + 评分，最后用 Opus subagent 生成分析报告。

检查环境 → 获取 case 列表 → Sonnet subagent 并行执行 14 case
→ 收集 session → 上传 Langfuse → LLM Judge 评分 → 应用评分
→ Opus subagent 生成报告

时间：约 40 分钟（14 case 并行 4 个）
模型分工：
- 主会话 / Step 1-8：Sonnet（编排与脚本调度，不消耗 Opus 额度）
- Step 3 评测 subagent：Sonnet（独立周额度）
- Step 7 judge subagent：Sonnet（或走 API 直调节省 CC 额度）
- Step 9 报告 subagent：Opus（深度分析，隔离 context 省 token）
详细步骤：run-eval-cc.md

Use Sonnet subagent in local Claude Code for parallel execution + scoring, and finally use Opus subagent to generate analysis reports.

Check environment → Obtain case list → Sonnet subagent executes 14 cases in parallel
→ Collect sessions → Upload to Langfuse → LLM Judge scoring → Apply scores
→ Opus subagent generates report

Time: Approximately 40 minutes (14 cases in parallel with 4 threads)
Model Division:
- Main Session / Step 1-8: Sonnet (orchestration and script scheduling, no Opus quota consumption)
- Step 3 Evaluation Subagent: Sonnet (independent weekly quota)
- Step 7 Judge Subagent: Sonnet (or call API directly to save CC quota)
- Step 9 Report Subagent: Opus (in-depth analysis, isolate context to save tokens)
Detailed Steps: run-eval-cc.md

Openclaw 弱模型验证

Openclaw Weak Model Verification

在 Openclaw 服务器上用弱模型执行，本地 Claude Code 评分。三层分离架构。

服务器: 脚本生成 prompt → 弱模型执行 task → 脚本收集 session → 打包
  ↓ gcp scp
本地: 导入 session → LLM Judge 评分 → 上传 Langfuse → 生成报告

适用：上线前验证 Skill 对弱模型的兼容性
详细步骤：run-eval-openclaw.md

Execute with weak models on the Openclaw server, and score with local Claude Code. Three-layer separated architecture.

Server: Script generates prompt → Weak model executes task → Script collects sessions → Package
  ↓ gcp scp
Local: Import sessions → LLM Judge scoring → Upload to Langfuse → Generate report

Applicable: Verify Skill compatibility with weak models before launch
Detailed Steps: run-eval-openclaw.md

评分体系

Scoring System

标准模式

Standard Mode

综合分 = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(意图) × 0.15 + S2(Pact) × 0.45 + S3(执行) × 0.4

Comprehensive Score = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(Intention) × 0.15 + S2(Pact) × 0.45 + S3(Execution) × 0.4

Recipe 模式（交易构建评测）

Recipe Mode (Transaction Construction Evaluation)

综合分 = S1(意图) × 0.20 + S2(Pact) × 0.45 + S3(交易构建) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2

无 Task Completion 评分。仅评估交易是否被正确构建和提交，不评估链上执行结果。

所有分数 0-1。详见 scoring.md。

Comprehensive Score = S1(Intention) × 0.20 + S2(Pact) × 0.45 + S3(Transaction Construction) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2

No Task Completion score. Only evaluate whether the transaction is correctly constructed and submitted, not the on-chain execution result.

All scores range from 0 to 1. For details, see scoring.md.

数据集

Datasets

数据集	Case 数	场景类型	说明
`caw-agent-eval-seth-v2`	14	transfer/swap/lend/dca/...	默认，Ethereum Sepolia 测试链
`caw-recipe-eval-seth-v1`	-	recipe	Recipe 多步骤场景，Sepolia 测试链

默认使用
```
caw-agent-eval-seth-v2
```
，用户明确说"recipe 评测"时改用
```
caw-recipe-eval-seth-v1
```
可通过
```
--dataset-name
```
指定其他数据集
已有数据集和创建新数据集参见 dataset-management.md

Dataset	Number of Cases	Scenario Type	Description
`caw-agent-eval-seth-v2`	14	transfer/swap/lend/dca/...	Default, Ethereum Sepolia testnet
`caw-recipe-eval-seth-v1`	-	recipe	Recipe multi-step scenarios, Sepolia testnet

Default to
```
caw-agent-eval-seth-v2
```
, switch to
```
caw-recipe-eval-seth-v1
```
when the user explicitly says "recipe evaluation"
Other datasets can be specified via
```
--dataset-name
```
For existing datasets and creating new datasets, see dataset-management.md

服务器环境搭建

Server Environment Setup

新建 openclaw 服务器时（安装 openclaw / caw / langfuse / 钱包 onboarding / 充值）：

→ server-setup.md

When creating a new openclaw server (install openclaw / caw / langfuse / wallet onboarding / recharge):

→ server-setup.md

Scripts

脚本	用途
`run_eval_cc.py`	Claude Code 评测编排（prepare/collect/upload/score/import-sessions）
`run_eval_openclaw.py`	Openclaw 评测编排（prepare/collect/upload/pack）
`eval_utils.py`	公共工具（Langfuse 客户端/数据集/上传）
`judge_cc.py`	LLM-as-Judge（prompt 构建 + API 调用）
`assertions.py`	结构化断言 + 门槛检查
`score_traces.py`	评分管线（断言 + judge → 综合分 → Langfuse）
`upload_session.py`	session → Langfuse trace
`generate_dataset.py`	数据集生成

Script	Purpose
`run_eval_cc.py`	Claude Code evaluation orchestration (prepare/collect/upload/score/import-sessions)
`run_eval_openclaw.py`	Openclaw evaluation orchestration (prepare/collect/upload/pack)
`eval_utils.py`	Common utilities (Langfuse client/dataset/upload)
`judge_cc.py`	LLM-as-Judge (prompt construction + API call)
`assertions.py`	Structured assertions + threshold checks
`score_traces.py`	Scoring pipeline (assertions + judge → comprehensive score → Langfuse)
`upload_session.py`	Session → Langfuse trace
`generate_dataset.py`	Dataset generation