openclaw-qa-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenClaw QA Testing

OpenClaw QA测试

Use this skill for
qa-lab
/
qa-channel
work. Repo-local QA only.
本技能适用于
qa-lab
/
qa-channel
相关工作。仅支持代码仓库本地QA测试。

Read first

必读文档

  • docs/concepts/qa-e2e-automation.md
  • docs/help/testing.md
  • docs/channels/qa-channel.md
  • qa/README.md
  • qa/scenarios/index.md
  • extensions/qa-lab/src/suite.ts
  • extensions/qa-lab/src/character-eval.ts
  • docs/concepts/qa-e2e-automation.md
  • docs/help/testing.md
  • docs/channels/qa-channel.md
  • qa/README.md
  • qa/scenarios/index.md
  • extensions/qa-lab/src/suite.ts
  • extensions/qa-lab/src/character-eval.ts

Model policy

模型策略

  • Live OpenAI lane:
    openai/gpt-5.4
  • Fast mode: on
  • Do not use:
    • openai/gpt-5.4-pro
    • openai/gpt-5.4-mini
  • Only change model policy if the user explicitly asks.
  • 实时OpenAI通道:
    openai/gpt-5.4
  • 快速模式:开启
  • 禁止使用:
    • openai/gpt-5.4-pro
    • openai/gpt-5.4-mini
  • 仅当用户明确要求时,才可更改模型策略。

Default workflow

默认工作流

  1. Read the scenario pack and current suite implementation.
  2. Decide lane:
    • mock/dev:
      mock-openai
    • real validation:
      live-frontier
  3. For live OpenAI, use:
bash
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
  --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>
  1. Watch outputs:
    • summary:
      .artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json
    • report:
      .artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
  2. If the user wants to watch the live UI, find the current
    openclaw-qa
    listen port and report
    http://127.0.0.1:<port>
    .
  3. If a scenario fails, fix the product or harness root cause, then rerun the full lane.
  1. 阅读场景包和当前套件实现代码。
  2. 选择通道:
    • 模拟/开发环境:
      mock-openai
    • 真实验证:
      live-frontier
  3. 针对实时OpenAI环境,使用以下命令:
bash
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
  --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>
  1. 查看输出结果:
    • 摘要:
      .artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json
    • 报告:
      .artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
  2. 如果用户需要查看实时UI,找到当前
    openclaw-qa
    监听端口,并告知
    http://127.0.0.1:<port>
  3. 如果某个场景失败,修复产品或测试工具的根本原因,然后重新运行整个通道的测试。

Character evals

角色评估

Use
qa character-eval
for style/persona/vibe checks across multiple live models.
bash
pnpm openclaw qa character-eval \
  --model openai/gpt-5.4,thinking=xhigh \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
  --model anthropic/claude-sonnet-4-6,thinking=high \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --concurrency 16 \
  --judge-concurrency 16 \
  --output-dir .artifacts/qa-e2e/character-eval-<tag>
  • Runs local QA gateway child processes, not Docker.
  • Preferred model spec syntax is
    provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]
    for both
    --model
    and
    --judge-model
    .
  • Do not add new examples with separate
    --model-thinking
    ; keep that flag as legacy compatibility only.
  • Defaults to candidate models
    openai/gpt-5.4
    ,
    openai/gpt-5.2
    ,
    openai/gpt-5
    ,
    anthropic/claude-opus-4-6
    ,
    anthropic/claude-sonnet-4-6
    ,
    zai/glm-5.1
    ,
    moonshot/kimi-k2.5
    , and
    google/gemini-3.1-pro-preview
    when no
    --model
    is passed.
  • Candidate thinking defaults to
    high
    , with
    xhigh
    for OpenAI models that support it. Prefer inline
    --model provider/model,thinking=<level>
    ;
    --thinking <level>
    and
    --model-thinking <provider/model=level>
    remain compatibility shims.
  • OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline
    ,fast
    ,
    ,no-fast
    , or
    ,fast=false
    for one model; use
    --fast
    only to force fast mode for every candidate.
  • Judges default to
    openai/gpt-5.4,thinking=xhigh,fast
    and
    anthropic/claude-opus-4-6,thinking=high
    .
  • Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
  • Candidate and judge concurrency default to 16. Use
    --concurrency <n>
    and
    --judge-concurrency <n>
    to override when local gateways or provider limits need a gentler lane.
  • Scenario source should stay markdown-driven under
    qa/scenarios/
    .
  • For isolated character/persona evals, write the persona into
    SOUL.md
    and blank
    IDENTITY.md
    in the scenario flow. Use
    SOUL.md + IDENTITY.md
    only when intentionally testing how the normal OpenClaw identity combines with the character.
  • Keep prompts natural and task-shaped. The candidate model should receive character setup through
    SOUL.md
    , then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval.
  • Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.
使用
qa character-eval
对多个实时模型进行风格/角色/氛围检查。
bash
pnpm openclaw qa character-eval \
  --model openai/gpt-5.4,thinking=xhigh \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
  --model anthropic/claude-sonnet-4-6,thinking=high \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --concurrency 16 \
  --judge-concurrency 16 \
  --output-dir .artifacts/qa-e2e/character-eval-<tag>
  • 运行本地QA网关子进程,而非Docker。
  • --model
    --judge-model
    优先使用语法格式为
    provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]
    的模型规范。
  • 请勿添加使用单独
    --model-thinking
    参数的新示例;该参数仅作为遗留兼容项保留。
  • 未传入
    --model
    参数时,默认候选模型为
    openai/gpt-5.4
    openai/gpt-5.2
    openai/gpt-5
    anthropic/claude-opus-4-6
    anthropic/claude-sonnet-4-6
    zai/glm-5.1
    moonshot/kimi-k2.5
    google/gemini-3.1-pro-preview
  • 候选模型的thinking级别默认为
    high
    ,支持
    xhigh
    的OpenAI模型则使用
    xhigh
    。优先使用内联格式
    --model provider/model,thinking=<level>
    --thinking <level>
    --model-thinking <provider/model=level>
    仅作为兼容垫片保留。
  • OpenAI候选模型默认启用快速模式,以便在支持的情况下使用优先处理。针对单个模型,可使用内联参数
    ,fast
    ,no-fast
    ,fast=false
    ;仅当需要为所有候选模型强制启用快速模式时,才使用
    --fast
    参数。
  • 评估模型默认使用
    openai/gpt-5.4,thinking=xhigh,fast
    anthropic/claude-opus-4-6,thinking=high
  • 报告应包含评估模型的排名、运行统计数据、耗时和完整对话记录;请勿包含评估模型的原始回复。耗时仅作为基准参考,而非评分指标。
  • 候选模型和评估模型的并发数默认为16。当本地网关或服务商限制需要更温和的测试通道时,可使用
    --concurrency <n>
    --judge-concurrency <n>
    参数进行覆盖。
  • 场景源应保持为
    qa/scenarios/
    目录下的Markdown驱动文件。
  • 对于独立的角色/ persona评估,需将角色信息写入场景流程中的
    SOUL.md
    ,并清空
    IDENTITY.md
    。仅当有意测试正常OpenClaw身份与角色的组合效果时,才同时使用
    SOUL.md + IDENTITY.md
  • 保持提示语自然且符合任务场景。候选模型应通过
    SOUL.md
    接收角色设置,然后处理正常用户交互,如聊天、工作区帮助和小型文件任务;请勿询问“你会如何回应?”或告知模型它正处于评估中。
  • 优先包含至少一项真实任务,例如创建或编辑小型工作区工件,以便对话记录捕获模型在正常工具使用下的角色表现,而非纯粹的角色扮演。

Codex CLI model lane

Codex CLI模型通道

Use model refs shaped like
codex-cli/<codex-model>
whenever QA should exercise Codex as a model backend.
Examples:
bash
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model codex-cli/<codex-model> \
  --alt-model codex-cli/<codex-model> \
  --scenario <scenario-id> \
  --output-dir .artifacts/qa-e2e/codex-<tag>
bash
pnpm openclaw qa manual \
  --model codex-cli/<codex-model> \
  --message "Reply exactly: CODEX_OK"
  • Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
  • Live QA preserves
    CODEX_HOME
    so Codex CLI auth/config works while keeping
    HOME
    and
    OPENCLAW_HOME
    sandboxed.
  • Mock QA should scrub
    CODEX_HOME
    .
  • If Codex returns fallback/auth text every turn, first check
    CODEX_HOME
    ,
    ~/.profile
    , and gateway child logs before changing scenario assertions.
  • For model comparison, include
    codex-cli/<codex-model>
    as another candidate in
    qa character-eval
    ; the report should label it as an opaque model name.
当QA测试需要将Codex作为模型后端时,请使用格式为
codex-cli/<codex-model>
的模型引用。
示例:
bash
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model codex-cli/<codex-model> \
  --alt-model codex-cli/<codex-model> \
  --scenario <scenario-id> \
  --output-dir .artifacts/qa-e2e/codex-<tag>
bash
pnpm openclaw qa manual \
  --model codex-cli/<codex-model> \
  --message "Reply exactly: CODEX_OK"
  • 将具体的Codex模型名称视为用户/配置输入;请勿在源代码、文档示例或场景中硬编码该名称。
  • 实时QA测试会保留
    CODEX_HOME
    ,以便Codex CLI的认证/配置正常工作,同时保持
    HOME
    OPENCLAW_HOME
    处于沙箱环境。
  • 模拟QA测试应清除
    CODEX_HOME
  • 如果Codex每次交互都返回 fallback/认证文本,请先检查
    CODEX_HOME
    ~/.profile
    和网关子进程日志,再修改场景断言。
  • 进行模型对比时,可在
    qa character-eval
    中添加
    codex-cli/<codex-model>
    作为另一候选模型;报告中应将其标记为不透明模型名称。

Repo facts

代码仓库相关信息

  • Seed scenarios live in
    qa/
    .
  • Main live runner:
    extensions/qa-lab/src/suite.ts
  • QA lab server:
    extensions/qa-lab/src/lab-server.ts
  • Child gateway harness:
    extensions/qa-lab/src/gateway-child.ts
  • Synthetic channel:
    extensions/qa-channel/
  • 初始场景位于
    qa/
    目录下。
  • 主实时运行器:
    extensions/qa-lab/src/suite.ts
  • QA Lab服务器:
    extensions/qa-lab/src/lab-server.ts
  • 子网关测试工具:
    extensions/qa-lab/src/gateway-child.ts
  • 合成通道:
    extensions/qa-channel/

What “done” looks like

完成标准

  • Full suite green for the requested lane.
  • User gets:
    • watch URL if applicable
    • pass/fail counts
    • artifact paths
    • concise note on what was fixed
  • 请求通道的完整测试套件全部通过。
  • 用户获取以下内容:
    • 适用时的监控URL
    • 测试通过/失败数量
    • 工件路径
    • 关于修复内容的简要说明

Common failure patterns

常见失败模式

  • Live timeout too short:
    • widen live waits in
      extensions/qa-lab/src/suite.ts
  • Discovery cannot find repo files:
    • point prompts at
      repo/...
      inside seeded workspace
  • Subagent proof too brittle:
    • prefer stable final reply evidence over transient child-session listing
  • Harness “rebuild” delay:
    • dirty tree can trigger a pre-run build; expect that before ports appear
  • 实时超时时间过短:
    • extensions/qa-lab/src/suite.ts
      中延长实时等待时间
  • 发现功能无法找到代码仓库文件:
    • 将提示语指向种子工作区内的
      repo/...
      路径
  • 子代理验证逻辑过于脆弱:
    • 优先使用稳定的最终回复证据,而非临时的子会话列表
  • 测试工具“重建”延迟:
    • 未提交的代码变更可能触发预运行构建;端口出现前需等待该过程完成

When adding scenarios

添加场景时的注意事项

  • Add or update scenario markdown under
    qa/scenarios/
  • Keep kickoff expectations in
    qa/scenarios/index.md
    aligned
  • Add executable coverage in
    extensions/qa-lab/src/suite.ts
  • Prefer end-to-end assertions over mock-only checks
  • Save outputs under
    .artifacts/qa-e2e/
  • qa/scenarios/
    目录下添加或更新场景Markdown文件
  • 保持
    qa/scenarios/index.md
    中的启动预期一致
  • extensions/qa-lab/src/suite.ts
    中添加可执行的测试覆盖
  • 优先选择端到端断言,而非仅模拟环境检查
  • 将输出结果保存至
    .artifacts/qa-e2e/
    目录下