caw-eval

Original：🇨🇳 Chinese

Translated

8 scripts

Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).

5installs

Sourcecobosteven/cobo-agent-wallet-manual

Added on2026-04-18

NPX Install

npx skill4agent add cobosteven/cobo-agent-wallet-manual caw-eval

SKILL.md Content (Chinese)

View Translation Comparison →

CAW Eval

End-to-end evaluation of CAW Agent quality, generating scoring data and analysis reports.

Step 0: Environment Identification (Mandatory)

Before executing any subsequent steps, first confirm the current environment:

bash

[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"

```
env=openclaw
```
: This SKILL (caw-eval) is the local CC version and cannot run on the openclaw server. Tell the user:
"You are currently on the openclaw server, and the local CC evaluation SKILL is not applicable. Please use
caw-eval-openclaw
(trigger by saying 'weak model evaluation' or 'openclaw evaluation')." Then stop.
```
env=local
```
: Proceed to the following process routing.

Process Routing

Select the evaluation method based on the user's intent, then read the corresponding reference file and execute according to the steps:

User Says	Evaluation Method	Read and Execute
"run evaluation", "evaluate CAW", "eval", "score", "claude code evaluation"	Claude Code Evaluation (default dataset: `caw-agent-eval-seth-v2` )	→ Read run-eval-cc.md and execute step by step
"recipe evaluation", "recipe eval"	Recipe Evaluation (transaction construction mode)	→ Read run-eval-recipe.md and execute step by step
"recipe comparison evaluation", "recipe comparison"	Recipe Three-Mode Comparison Evaluation	→ Read run-eval-recipe.md (run three times for comparison)
"weak model verification", "openclaw evaluation", "model compatibility"	Openclaw Weak Model Verification	→ Read run-eval-openclaw.md and execute step by step (SSH to the server to run evaluation → download session → local scoring and report generation)

Default to Claude Code Evaluation (if the user does not explicitly mention "weak model" or "openclaw").

Overview

Claude Code Evaluation (Main Method)

Use Sonnet subagent in local Claude Code for parallel execution + scoring, and finally use Opus subagent to generate analysis reports.

Check environment → Obtain case list → Sonnet subagent executes 14 cases in parallel
→ Collect sessions → Upload to Langfuse → LLM Judge scoring → Apply scores
→ Opus subagent generates report

Time: Approximately 40 minutes (14 cases in parallel with 4 threads)
Model Division:
- Main Session / Step 1-8: Sonnet (orchestration and script scheduling, no Opus quota consumption)
- Step 3 Evaluation Subagent: Sonnet (independent weekly quota)
- Step 7 Judge Subagent: Sonnet (or call API directly to save CC quota)
- Step 9 Report Subagent: Opus (in-depth analysis, isolate context to save tokens)
Detailed Steps: run-eval-cc.md

Openclaw Weak Model Verification

Execute with weak models on the Openclaw server, and score with local Claude Code. Three-layer separated architecture.

Server: Script generates prompt → Weak model executes task → Script collects sessions → Package
  ↓ gcp scp
Local: Import sessions → LLM Judge scoring → Upload to Langfuse → Generate report

Applicable: Verify Skill compatibility with weak models before launch
Detailed Steps: run-eval-openclaw.md

Scoring System

Standard Mode

Comprehensive Score = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(Intention) × 0.15 + S2(Pact) × 0.45 + S3(Execution) × 0.4

Recipe Mode (Transaction Construction Evaluation)

Comprehensive Score = S1(Intention) × 0.20 + S2(Pact) × 0.45 + S3(Transaction Construction) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2

No Task Completion score. Only evaluate whether the transaction is correctly constructed and submitted, not the on-chain execution result.

All scores range from 0 to 1. For details, see scoring.md.

Datasets

Dataset	Number of Cases	Scenario Type	Description
`caw-agent-eval-seth-v2`	14	transfer/swap/lend/dca/...	Default, Ethereum Sepolia testnet
`caw-recipe-eval-seth-v1`	-	recipe	Recipe multi-step scenarios, Sepolia testnet

Default to
```
caw-agent-eval-seth-v2
```
, switch to
```
caw-recipe-eval-seth-v1
```
when the user explicitly says "recipe evaluation"
Other datasets can be specified via
```
--dataset-name
```
For existing datasets and creating new datasets, see dataset-management.md

Server Environment Setup

When creating a new openclaw server (install openclaw / caw / langfuse / wallet onboarding / recharge):

→ server-setup.md

Scripts

Script	Purpose
`run_eval_cc.py`	Claude Code evaluation orchestration (prepare/collect/upload/score/import-sessions)
`run_eval_openclaw.py`	Openclaw evaluation orchestration (prepare/collect/upload/pack)
`eval_utils.py`	Common utilities (Langfuse client/dataset/upload)
`judge_cc.py`	LLM-as-Judge (prompt construction + API call)
`assertions.py`	Structured assertions + threshold checks
`score_traces.py`	Scoring pipeline (assertions + judge → comprehensive score → Langfuse)
`upload_session.py`	Session → Langfuse trace
`generate_dataset.py`	Dataset generation