Loading...
Loading...
Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).
npx skill4agent add cobosteven/cobo-agent-wallet-manual caw-eval[[ "$(hostname)" == *openclaw* ]] && echo "env=openclaw" || echo "env=local"env=openclaw"You are currently on the openclaw server, and the local CC evaluation SKILL is not applicable. Please use(trigger by saying 'weak model evaluation' or 'openclaw evaluation')." Then stop.caw-eval-openclaw
env=local| User Says | Evaluation Method | Read and Execute |
|---|---|---|
| "run evaluation", "evaluate CAW", "eval", "score", "claude code evaluation" | Claude Code Evaluation (default dataset: | → Read run-eval-cc.md and execute step by step |
| "recipe evaluation", "recipe eval" | Recipe Evaluation (transaction construction mode) | → Read run-eval-recipe.md and execute step by step |
| "recipe comparison evaluation", "recipe comparison" | Recipe Three-Mode Comparison Evaluation | → Read run-eval-recipe.md (run three times for comparison) |
| "weak model verification", "openclaw evaluation", "model compatibility" | Openclaw Weak Model Verification | → Read run-eval-openclaw.md and execute step by step (SSH to the server to run evaluation → download session → local scoring and report generation) |
Check environment → Obtain case list → Sonnet subagent executes 14 cases in parallel
→ Collect sessions → Upload to Langfuse → LLM Judge scoring → Apply scores
→ Opus subagent generates reportServer: Script generates prompt → Weak model executes task → Script collects sessions → Package
↓ gcp scp
Local: Import sessions → LLM Judge scoring → Upload to Langfuse → Generate reportComprehensive Score = task_completion × 0.3 + process_quality × 0.7
process_quality = S1(Intention) × 0.15 + S2(Pact) × 0.45 + S3(Execution) × 0.4Comprehensive Score = S1(Intention) × 0.20 + S2(Pact) × 0.45 + S3(Transaction Construction) × 0.35
S3 = tx_construction_correctness × 0.5 + recipe_adherence × 0.3 + tx_submission_success × 0.2| Dataset | Number of Cases | Scenario Type | Description |
|---|---|---|---|
| 14 | transfer/swap/lend/dca/... | Default, Ethereum Sepolia testnet |
| - | recipe | Recipe multi-step scenarios, Sepolia testnet |
caw-agent-eval-seth-v2caw-recipe-eval-seth-v1--dataset-name| Script | Purpose |
|---|---|
| Claude Code evaluation orchestration (prepare/collect/upload/score/import-sessions) |
| Openclaw evaluation orchestration (prepare/collect/upload/pack) |
| Common utilities (Langfuse client/dataset/upload) |
| LLM-as-Judge (prompt construction + API call) |
| Structured assertions + threshold checks |
| Scoring pipeline (assertions + judge → comprehensive score → Langfuse) |
| Session → Langfuse trace |
| Dataset generation |