caw-eval

Original🇨🇳 Chinese
Translated
8 scripts

Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).

5installs
Added on

NPX Install

npx skill4agent add cobosteven/cobo-agent-wallet-manual caw-eval

Tags

Translated version includes tags in frontmatter

SKILL.md Content (Chinese)

View Translation Comparison →

CAW Eval

End-to-end evaluation of CAW Agent quality, generating scoring data and analysis reports.

Step 0: Environment Identification (Mandatory)

Before executing any subsequent steps, first confirm the current environment:
bash
[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"
  • env=openclaw
    : This SKILL (caw-eval) is the local CC version and cannot run on the openclaw server. Tell the user:
    "You are currently on the openclaw server, and the local CC evaluation SKILL is not applicable. Please use
    caw-eval-openclaw
    (trigger by saying 'weak model evaluation' or 'openclaw evaluation')." Then stop.
  • env=local
    : Proceed to the following process routing.

Process Routing

Select the evaluation method based on the user's intent, then read the corresponding reference file and execute according to the steps:
User SaysEvaluation MethodRead and Execute
"run evaluation", "evaluate CAW", "eval", "score", "claude code evaluation"Claude Code Evaluation (default dataset:
caw-agent-eval-seth-v2
)
→ Read run-eval-cc.md and execute step by step
"recipe evaluation", "recipe eval"Recipe Evaluation (transaction construction mode)→ Read run-eval-recipe.md and execute step by step
"recipe comparison evaluation", "recipe comparison"Recipe Three-Mode Comparison Evaluation→ Read run-eval-recipe.md (run three times for comparison)
"weak model verification", "openclaw evaluation", "model compatibility"Openclaw Weak Model Verification→ Read run-eval-openclaw.md and execute step by step (SSH to the server to run evaluation → download session → local scoring and report generation)
Default to Claude Code Evaluation (if the user does not explicitly mention "weak model" or "openclaw").

Overview

Claude Code Evaluation (Main Method)

Use Sonnet subagent in local Claude Code for parallel execution + scoring, and finally use Opus subagent to generate analysis reports.
Check environment → Obtain case list → Sonnet subagent executes 14 cases in parallel
→ Collect sessions → Upload to Langfuse → LLM Judge scoring → Apply scores
→ Opus subagent generates report
  • Time: Approximately 40 minutes (14 cases in parallel with 4 threads)
  • Model Division:
    • Main Session / Step 1-8: Sonnet (orchestration and script scheduling, no Opus quota consumption)
    • Step 3 Evaluation Subagent: Sonnet (independent weekly quota)
    • Step 7 Judge Subagent: Sonnet (or call API directly to save CC quota)
    • Step 9 Report Subagent: Opus (in-depth analysis, isolate context to save tokens)
  • Detailed Steps: run-eval-cc.md

Openclaw Weak Model Verification

Execute with weak models on the Openclaw server, and score with local Claude Code. Three-layer separated architecture.
Server: Script generates prompt → Weak model executes task → Script collects sessions → Package
  ↓ gcp scp
Local: Import sessions → LLM Judge scoring → Upload to Langfuse → Generate report
  • Applicable: Verify Skill compatibility with weak models before launch
  • Detailed Steps: run-eval-openclaw.md

Scoring System

Standard Mode

Comprehensive Score = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(Intention) × 0.15 + S2(Pact) × 0.45 + S3(Execution) × 0.4

Recipe Mode (Transaction Construction Evaluation)

Comprehensive Score = S1(Intention) × 0.20 + S2(Pact) × 0.45 + S3(Transaction Construction) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2
No Task Completion score. Only evaluate whether the transaction is correctly constructed and submitted, not the on-chain execution result.
All scores range from 0 to 1. For details, see scoring.md.

Datasets

DatasetNumber of CasesScenario TypeDescription
caw-agent-eval-seth-v2
14transfer/swap/lend/dca/...Default, Ethereum Sepolia testnet
caw-recipe-eval-seth-v1
-recipeRecipe multi-step scenarios, Sepolia testnet
  • Default to
    caw-agent-eval-seth-v2
    , switch to
    caw-recipe-eval-seth-v1
    when the user explicitly says "recipe evaluation"
  • Other datasets can be specified via
    --dataset-name
  • For existing datasets and creating new datasets, see dataset-management.md

Server Environment Setup

When creating a new openclaw server (install openclaw / caw / langfuse / wallet onboarding / recharge):
server-setup.md

Scripts

ScriptPurpose
run_eval_cc.py
Claude Code evaluation orchestration (prepare/collect/upload/score/import-sessions)
run_eval_openclaw.py
Openclaw evaluation orchestration (prepare/collect/upload/pack)
eval_utils.py
Common utilities (Langfuse client/dataset/upload)
judge_cc.py
LLM-as-Judge (prompt construction + API call)
assertions.py
Structured assertions + threshold checks
score_traces.py
Scoring pipeline (assertions + judge → comprehensive score → Langfuse)
upload_session.py
Session → Langfuse trace
generate_dataset.py
Dataset generation