openclaw-qa-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenClaw QA Testing

OpenClaw QA测试

Use this skill for

qa-lab

qa-channel

work. Repo-local QA only.

本技能适用于

qa-lab

qa-channel

相关工作。仅支持代码仓库本地QA测试。

Read first

必读文档

```
docs/concepts/qa-e2e-automation.md
```
```
docs/help/testing.md
```
```
docs/channels/qa-channel.md
```
```
qa/README.md
```
```
qa/scenarios/index.md
```
```
extensions/qa-lab/src/suite.ts
```
```
extensions/qa-lab/src/character-eval.ts
```

```
docs/concepts/qa-e2e-automation.md
```
```
docs/help/testing.md
```
```
docs/channels/qa-channel.md
```
```
qa/README.md
```
```
qa/scenarios/index.md
```
```
extensions/qa-lab/src/suite.ts
```
```
extensions/qa-lab/src/character-eval.ts
```

Model policy

模型策略

Live OpenAI lane:
```
openai/gpt-5.4
```
Fast mode: on
Do not use:
- ```
openai/gpt-5.4-pro
```
- ```
openai/gpt-5.4-mini
```
Only change model policy if the user explicitly asks.

实时OpenAI通道：
```
openai/gpt-5.4
```
快速模式：开启
禁止使用：
- ```
openai/gpt-5.4-pro
```
- ```
openai/gpt-5.4-mini
```
仅当用户明确要求时，才可更改模型策略。

Default workflow

默认工作流

Read the scenario pack and current suite implementation.
Decide lane:
- mock/dev:
```
mock-openai
```
- real validation:
```
live-frontier
```
For live OpenAI, use:

bash

OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
  --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>

Watch outputs:

summary:

.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json

report:

.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md

If the user wants to watch the live UI, find the current
```
openclaw-qa
```
listen port and report
```
http://127.0.0.1:<port>
```
.
If a scenario fails, fix the product or harness root cause, then rerun the full lane.

阅读场景包和当前套件实现代码。
选择通道：
- 模拟/开发环境：
```
mock-openai
```
- 真实验证：
```
live-frontier
```
针对实时OpenAI环境，使用以下命令：

bash

OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
  --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>

查看输出结果：

摘要：

.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json

报告：

.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md

如果用户需要查看实时UI，找到当前
```
openclaw-qa
```
监听端口，并告知
```
http://127.0.0.1:<port>
```
。
如果某个场景失败，修复产品或测试工具的根本原因，然后重新运行整个通道的测试。

Character evals

角色评估

Use

qa character-eval

for style/persona/vibe checks across multiple live models.

bash

pnpm openclaw qa character-eval \
  --model openai/gpt-5.4,thinking=xhigh \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
  --model anthropic/claude-sonnet-4-6,thinking=high \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --concurrency 16 \
  --judge-concurrency 16 \
  --output-dir .artifacts/qa-e2e/character-eval-<tag>

Runs local QA gateway child processes, not Docker.

Preferred model spec syntax is

provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]

for both

--model

and

--judge-model

Do not add new examples with separate
```
--model-thinking
```
; keep that flag as legacy compatibility only.

Defaults to candidate models

openai/gpt-5.4

openai/gpt-5.2

openai/gpt-5

anthropic/claude-opus-4-6

anthropic/claude-sonnet-4-6

zai/glm-5.1

moonshot/kimi-k2.5

, and

google/gemini-3.1-pro-preview

when no

--model

is passed.

Candidate thinking defaults to
```
high
```
, with
```
xhigh
```
for OpenAI models that support it. Prefer inline
```
--model provider/model,thinking=<level>
```
;
```
--thinking <level>
```
and
```
--model-thinking <provider/model=level>
```
remain compatibility shims.
OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline
```
,fast
```
,
```
,no-fast
```
, or
```
,fast=false
```
for one model; use
```
--fast
```
only to force fast mode for every candidate.

Judges default to

openai/gpt-5.4,thinking=xhigh,fast

and

anthropic/claude-opus-4-6,thinking=high

Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
Candidate and judge concurrency default to 16. Use
```
--concurrency <n>
```
and
```
--judge-concurrency <n>
```
to override when local gateways or provider limits need a gentler lane.
Scenario source should stay markdown-driven under
```
qa/scenarios/
```
.
For isolated character/persona evals, write the persona into
```
SOUL.md
```
and blank
```
IDENTITY.md
```
in the scenario flow. Use
```
SOUL.md + IDENTITY.md
```
only when intentionally testing how the normal OpenClaw identity combines with the character.
Keep prompts natural and task-shaped. The candidate model should receive character setup through
```
SOUL.md
```
, then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval.
Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.

使用

qa character-eval

对多个实时模型进行风格/角色/氛围检查。

bash

pnpm openclaw qa character-eval \
  --model openai/gpt-5.4,thinking=xhigh \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
  --model anthropic/claude-sonnet-4-6,thinking=high \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --concurrency 16 \
  --judge-concurrency 16 \
  --output-dir .artifacts/qa-e2e/character-eval-<tag>

运行本地QA网关子进程，而非Docker。

--model

和

--judge-model

优先使用语法格式为

provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]

的模型规范。

请勿添加使用单独
```
--model-thinking
```
参数的新示例；该参数仅作为遗留兼容项保留。

未传入

--model

参数时，默认候选模型为

openai/gpt-5.4

、

openai/gpt-5.2

、

openai/gpt-5

、

anthropic/claude-opus-4-6

、

anthropic/claude-sonnet-4-6

、

zai/glm-5.1

、

moonshot/kimi-k2.5

和

google/gemini-3.1-pro-preview

。

候选模型的thinking级别默认为
```
high
```
，支持
```
xhigh
```
的OpenAI模型则使用
```
xhigh
```
。优先使用内联格式
```
--model provider/model,thinking=<level>
```
；
```
--thinking <level>
```
和
```
--model-thinking <provider/model=level>
```
仅作为兼容垫片保留。
OpenAI候选模型默认启用快速模式，以便在支持的情况下使用优先处理。针对单个模型，可使用内联参数
```
,fast
```
、
```
,no-fast
```
或
```
,fast=false
```
；仅当需要为所有候选模型强制启用快速模式时，才使用
```
--fast
```
参数。

评估模型默认使用

openai/gpt-5.4,thinking=xhigh,fast

和

anthropic/claude-opus-4-6,thinking=high

。

报告应包含评估模型的排名、运行统计数据、耗时和完整对话记录；请勿包含评估模型的原始回复。耗时仅作为基准参考，而非评分指标。
候选模型和评估模型的并发数默认为16。当本地网关或服务商限制需要更温和的测试通道时，可使用
```
--concurrency <n>
```
和
```
--judge-concurrency <n>
```
参数进行覆盖。
场景源应保持为
```
qa/scenarios/
```
目录下的Markdown驱动文件。
对于独立的角色/ persona评估，需将角色信息写入场景流程中的
```
SOUL.md
```
，并清空
```
IDENTITY.md
```
。仅当有意测试正常OpenClaw身份与角色的组合效果时，才同时使用
```
SOUL.md + IDENTITY.md
```
。
保持提示语自然且符合任务场景。候选模型应通过
```
SOUL.md
```
接收角色设置，然后处理正常用户交互，如聊天、工作区帮助和小型文件任务；请勿询问“你会如何回应？”或告知模型它正处于评估中。
优先包含至少一项真实任务，例如创建或编辑小型工作区工件，以便对话记录捕获模型在正常工具使用下的角色表现，而非纯粹的角色扮演。

Codex CLI model lane

Codex CLI模型通道

Use model refs shaped like

codex-cli/<codex-model>

whenever QA should exercise Codex as a model backend.

Examples:

bash

pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model codex-cli/<codex-model> \
  --alt-model codex-cli/<codex-model> \
  --scenario <scenario-id> \
  --output-dir .artifacts/qa-e2e/codex-<tag>

bash

pnpm openclaw qa manual \
  --model codex-cli/<codex-model> \
  --message "Reply exactly: CODEX_OK"

Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
Live QA preserves
```
CODEX_HOME
```
so Codex CLI auth/config works while keeping
```
HOME
```
and
```
OPENCLAW_HOME
```
sandboxed.
Mock QA should scrub
```
CODEX_HOME
```
.
If Codex returns fallback/auth text every turn, first check
```
CODEX_HOME
```
,
```
~/.profile
```
, and gateway child logs before changing scenario assertions.
For model comparison, include
```
codex-cli/<codex-model>
```
as another candidate in
```
qa character-eval
```
; the report should label it as an opaque model name.

当QA测试需要将Codex作为模型后端时，请使用格式为

codex-cli/<codex-model>

的模型引用。

示例：

bash

pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --model codex-cli/<codex-model> \
  --alt-model codex-cli/<codex-model> \
  --scenario <scenario-id> \
  --output-dir .artifacts/qa-e2e/codex-<tag>

bash

pnpm openclaw qa manual \
  --model codex-cli/<codex-model> \
  --message "Reply exactly: CODEX_OK"

将具体的Codex模型名称视为用户/配置输入；请勿在源代码、文档示例或场景中硬编码该名称。
实时QA测试会保留
```
CODEX_HOME
```
，以便Codex CLI的认证/配置正常工作，同时保持
```
HOME
```
和
```
OPENCLAW_HOME
```
处于沙箱环境。
模拟QA测试应清除
```
CODEX_HOME
```
。
如果Codex每次交互都返回 fallback/认证文本，请先检查
```
CODEX_HOME
```
、
```
~/.profile
```
和网关子进程日志，再修改场景断言。
进行模型对比时，可在
```
qa character-eval
```
中添加
```
codex-cli/<codex-model>
```
作为另一候选模型；报告中应将其标记为不透明模型名称。

Repo facts

代码仓库相关信息

Seed scenarios live in
```
qa/
```
.
Main live runner:
```
extensions/qa-lab/src/suite.ts
```
QA lab server:
```
extensions/qa-lab/src/lab-server.ts
```
Child gateway harness:
```
extensions/qa-lab/src/gateway-child.ts
```
Synthetic channel:
```
extensions/qa-channel/
```

初始场景位于
```
qa/
```
目录下。
主实时运行器：
```
extensions/qa-lab/src/suite.ts
```
QA Lab服务器：
```
extensions/qa-lab/src/lab-server.ts
```
子网关测试工具：
```
extensions/qa-lab/src/gateway-child.ts
```
合成通道：
```
extensions/qa-channel/
```

What “done” looks like

完成标准

Full suite green for the requested lane.
User gets:
- watch URL if applicable
- pass/fail counts
- artifact paths
- concise note on what was fixed

请求通道的完整测试套件全部通过。
用户获取以下内容：
- 适用时的监控URL
- 测试通过/失败数量
- 工件路径
- 关于修复内容的简要说明

Common failure patterns

常见失败模式

Live timeout too short:
- widen live waits in
```
extensions/qa-lab/src/suite.ts
```
Discovery cannot find repo files:
- point prompts at
```
repo/...
```
  inside seeded workspace
Subagent proof too brittle:
- prefer stable final reply evidence over transient child-session listing
Harness “rebuild” delay:
- dirty tree can trigger a pre-run build; expect that before ports appear

实时超时时间过短：
- 在
```
extensions/qa-lab/src/suite.ts
```
  中延长实时等待时间
发现功能无法找到代码仓库文件：
- 将提示语指向种子工作区内的
```
repo/...
```
  路径
子代理验证逻辑过于脆弱：
- 优先使用稳定的最终回复证据，而非临时的子会话列表
测试工具“重建”延迟：
- 未提交的代码变更可能触发预运行构建；端口出现前需等待该过程完成

When adding scenarios

添加场景时的注意事项

Add or update scenario markdown under
```
qa/scenarios/
```
Keep kickoff expectations in
```
qa/scenarios/index.md
```
aligned
Add executable coverage in
```
extensions/qa-lab/src/suite.ts
```
Prefer end-to-end assertions over mock-only checks
Save outputs under
```
.artifacts/qa-e2e/
```

在
```
qa/scenarios/
```
目录下添加或更新场景Markdown文件
保持
```
qa/scenarios/index.md
```
中的启动预期一致
在
```
extensions/qa-lab/src/suite.ts
```
中添加可执行的测试覆盖
优先选择端到端断言，而非仅模拟环境检查
将输出结果保存至
```
.artifacts/qa-e2e/
```
目录下