caw-eval
Original:🇨🇳 Chinese
Translated
8 scripts
Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).
5installs
Added on
NPX Install
npx skill4agent add cobosteven/cobo-agent-wallet-manual caw-evalTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →CAW Eval
End-to-end evaluation of CAW Agent quality, generating scoring data and analysis reports.
Step 0: Environment Identification (Mandatory)
Before executing any subsequent steps, first confirm the current environment:
bash
[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"- : This SKILL (caw-eval) is the local CC version and cannot run on the openclaw server. Tell the user:
env=openclaw"You are currently on the openclaw server, and the local CC evaluation SKILL is not applicable. Please use(trigger by saying 'weak model evaluation' or 'openclaw evaluation')." Then stop.caw-eval-openclaw - : Proceed to the following process routing.
env=local
Process Routing
Select the evaluation method based on the user's intent, then read the corresponding reference file and execute according to the steps:
| User Says | Evaluation Method | Read and Execute |
|---|---|---|
| "run evaluation", "evaluate CAW", "eval", "score", "claude code evaluation" | Claude Code Evaluation (default dataset: | → Read run-eval-cc.md and execute step by step |
| "recipe evaluation", "recipe eval" | Recipe Evaluation (transaction construction mode) | → Read run-eval-recipe.md and execute step by step |
| "recipe comparison evaluation", "recipe comparison" | Recipe Three-Mode Comparison Evaluation | → Read run-eval-recipe.md (run three times for comparison) |
| "weak model verification", "openclaw evaluation", "model compatibility" | Openclaw Weak Model Verification | → Read run-eval-openclaw.md and execute step by step (SSH to the server to run evaluation → download session → local scoring and report generation) |
Default to Claude Code Evaluation (if the user does not explicitly mention "weak model" or "openclaw").
Overview
Claude Code Evaluation (Main Method)
Use Sonnet subagent in local Claude Code for parallel execution + scoring, and finally use Opus subagent to generate analysis reports.
Check environment → Obtain case list → Sonnet subagent executes 14 cases in parallel
→ Collect sessions → Upload to Langfuse → LLM Judge scoring → Apply scores
→ Opus subagent generates report- Time: Approximately 40 minutes (14 cases in parallel with 4 threads)
- Model Division:
- Main Session / Step 1-8: Sonnet (orchestration and script scheduling, no Opus quota consumption)
- Step 3 Evaluation Subagent: Sonnet (independent weekly quota)
- Step 7 Judge Subagent: Sonnet (or call API directly to save CC quota)
- Step 9 Report Subagent: Opus (in-depth analysis, isolate context to save tokens)
- Detailed Steps: run-eval-cc.md
Openclaw Weak Model Verification
Execute with weak models on the Openclaw server, and score with local Claude Code. Three-layer separated architecture.
Server: Script generates prompt → Weak model executes task → Script collects sessions → Package
↓ gcp scp
Local: Import sessions → LLM Judge scoring → Upload to Langfuse → Generate report- Applicable: Verify Skill compatibility with weak models before launch
- Detailed Steps: run-eval-openclaw.md
Scoring System
Standard Mode
Comprehensive Score = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(Intention) × 0.15 + S2(Pact) × 0.45 + S3(Execution) × 0.4Recipe Mode (Transaction Construction Evaluation)
Comprehensive Score = S1(Intention) × 0.20 + S2(Pact) × 0.45 + S3(Transaction Construction) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2No Task Completion score. Only evaluate whether the transaction is correctly constructed and submitted, not the on-chain execution result.
All scores range from 0 to 1. For details, see scoring.md.
Datasets
| Dataset | Number of Cases | Scenario Type | Description |
|---|---|---|---|
| 14 | transfer/swap/lend/dca/... | Default, Ethereum Sepolia testnet |
| - | recipe | Recipe multi-step scenarios, Sepolia testnet |
- Default to , switch to
caw-agent-eval-seth-v2when the user explicitly says "recipe evaluation"caw-recipe-eval-seth-v1 - Other datasets can be specified via
--dataset-name - For existing datasets and creating new datasets, see dataset-management.md
Server Environment Setup
When creating a new openclaw server (install openclaw / caw / langfuse / wallet onboarding / recharge):
→ server-setup.md
Scripts
| Script | Purpose |
|---|---|
| Claude Code evaluation orchestration (prepare/collect/upload/score/import-sessions) |
| Openclaw evaluation orchestration (prepare/collect/upload/pack) |
| Common utilities (Langfuse client/dataset/upload) |
| LLM-as-Judge (prompt construction + API call) |
| Structured assertions + threshold checks |
| Scoring pipeline (assertions + judge → comprehensive score → Langfuse) |
| Session → Langfuse trace |
| Dataset generation |