openclaw-qa-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenClaw QA Testing
OpenClaw QA测试
Use this skill for / work. Repo-local QA only.
qa-labqa-channel本技能适用于 / 相关工作。仅支持代码仓库本地QA测试。
qa-labqa-channelRead first
必读文档
docs/concepts/qa-e2e-automation.mddocs/help/testing.mddocs/channels/qa-channel.mdqa/README.mdqa/scenarios/index.mdextensions/qa-lab/src/suite.tsextensions/qa-lab/src/character-eval.ts
docs/concepts/qa-e2e-automation.mddocs/help/testing.mddocs/channels/qa-channel.mdqa/README.mdqa/scenarios/index.mdextensions/qa-lab/src/suite.tsextensions/qa-lab/src/character-eval.ts
Model policy
模型策略
- Live OpenAI lane:
openai/gpt-5.4 - Fast mode: on
- Do not use:
openai/gpt-5.4-proopenai/gpt-5.4-mini
- Only change model policy if the user explicitly asks.
- 实时OpenAI通道:
openai/gpt-5.4 - 快速模式:开启
- 禁止使用:
openai/gpt-5.4-proopenai/gpt-5.4-mini
- 仅当用户明确要求时,才可更改模型策略。
Default workflow
默认工作流
- Read the scenario pack and current suite implementation.
- Decide lane:
- mock/dev:
mock-openai - real validation:
live-frontier
- mock/dev:
- For live OpenAI, use:
bash
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4 \
--output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>- Watch outputs:
- summary:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json - report:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
- summary:
- If the user wants to watch the live UI, find the current listen port and report
openclaw-qa.http://127.0.0.1:<port> - If a scenario fails, fix the product or harness root cause, then rerun the full lane.
- 阅读场景包和当前套件实现代码。
- 选择通道:
- 模拟/开发环境:
mock-openai - 真实验证:
live-frontier
- 模拟/开发环境:
- 针对实时OpenAI环境,使用以下命令:
bash
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4 \
--output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>- 查看输出结果:
- 摘要:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json - 报告:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
- 摘要:
- 如果用户需要查看实时UI,找到当前监听端口,并告知
openclaw-qa。http://127.0.0.1:<port> - 如果某个场景失败,修复产品或测试工具的根本原因,然后重新运行整个通道的测试。
Character evals
角色评估
Use for style/persona/vibe checks across multiple live models.
qa character-evalbash
pnpm openclaw qa character-eval \
--model openai/gpt-5.4,thinking=xhigh \
--model openai/gpt-5.2,thinking=xhigh \
--model openai/gpt-5,thinking=xhigh \
--model anthropic/claude-opus-4-6,thinking=high \
--model anthropic/claude-sonnet-4-6,thinking=high \
--model zai/glm-5.1,thinking=high \
--model moonshot/kimi-k2.5,thinking=high \
--model google/gemini-3.1-pro-preview,thinking=high \
--judge-model openai/gpt-5.4,thinking=xhigh,fast \
--judge-model anthropic/claude-opus-4-6,thinking=high \
--concurrency 16 \
--judge-concurrency 16 \
--output-dir .artifacts/qa-e2e/character-eval-<tag>- Runs local QA gateway child processes, not Docker.
- Preferred model spec syntax is for both
provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]and--model.--judge-model - Do not add new examples with separate ; keep that flag as legacy compatibility only.
--model-thinking - Defaults to candidate models ,
openai/gpt-5.4,openai/gpt-5.2,openai/gpt-5,anthropic/claude-opus-4-6,anthropic/claude-sonnet-4-6,zai/glm-5.1, andmoonshot/kimi-k2.5when nogoogle/gemini-3.1-pro-previewis passed.--model - Candidate thinking defaults to , with
highfor OpenAI models that support it. Prefer inlinexhigh;--model provider/model,thinking=<level>and--thinking <level>remain compatibility shims.--model-thinking <provider/model=level> - OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline ,
,fast, or,no-fastfor one model; use,fast=falseonly to force fast mode for every candidate.--fast - Judges default to and
openai/gpt-5.4,thinking=xhigh,fast.anthropic/claude-opus-4-6,thinking=high - Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
- Candidate and judge concurrency default to 16. Use and
--concurrency <n>to override when local gateways or provider limits need a gentler lane.--judge-concurrency <n> - Scenario source should stay markdown-driven under .
qa/scenarios/ - For isolated character/persona evals, write the persona into and blank
SOUL.mdin the scenario flow. UseIDENTITY.mdonly when intentionally testing how the normal OpenClaw identity combines with the character.SOUL.md + IDENTITY.md - Keep prompts natural and task-shaped. The candidate model should receive character setup through , then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval.
SOUL.md - Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.
使用对多个实时模型进行风格/角色/氛围检查。
qa character-evalbash
pnpm openclaw qa character-eval \
--model openai/gpt-5.4,thinking=xhigh \
--model openai/gpt-5.2,thinking=xhigh \
--model openai/gpt-5,thinking=xhigh \
--model anthropic/claude-opus-4-6,thinking=high \
--model anthropic/claude-sonnet-4-6,thinking=high \
--model zai/glm-5.1,thinking=high \
--model moonshot/kimi-k2.5,thinking=high \
--model google/gemini-3.1-pro-preview,thinking=high \
--judge-model openai/gpt-5.4,thinking=xhigh,fast \
--judge-model anthropic/claude-opus-4-6,thinking=high \
--concurrency 16 \
--judge-concurrency 16 \
--output-dir .artifacts/qa-e2e/character-eval-<tag>- 运行本地QA网关子进程,而非Docker。
- 和
--model优先使用语法格式为--judge-model的模型规范。provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>] - 请勿添加使用单独参数的新示例;该参数仅作为遗留兼容项保留。
--model-thinking - 未传入参数时,默认候选模型为
--model、openai/gpt-5.4、openai/gpt-5.2、openai/gpt-5、anthropic/claude-opus-4-6、anthropic/claude-sonnet-4-6、zai/glm-5.1和moonshot/kimi-k2.5。google/gemini-3.1-pro-preview - 候选模型的thinking级别默认为,支持
high的OpenAI模型则使用xhigh。优先使用内联格式xhigh;--model provider/model,thinking=<level>和--thinking <level>仅作为兼容垫片保留。--model-thinking <provider/model=level> - OpenAI候选模型默认启用快速模式,以便在支持的情况下使用优先处理。针对单个模型,可使用内联参数、
,fast或,no-fast;仅当需要为所有候选模型强制启用快速模式时,才使用,fast=false参数。--fast - 评估模型默认使用和
openai/gpt-5.4,thinking=xhigh,fast。anthropic/claude-opus-4-6,thinking=high - 报告应包含评估模型的排名、运行统计数据、耗时和完整对话记录;请勿包含评估模型的原始回复。耗时仅作为基准参考,而非评分指标。
- 候选模型和评估模型的并发数默认为16。当本地网关或服务商限制需要更温和的测试通道时,可使用和
--concurrency <n>参数进行覆盖。--judge-concurrency <n> - 场景源应保持为目录下的Markdown驱动文件。
qa/scenarios/ - 对于独立的角色/ persona评估,需将角色信息写入场景流程中的,并清空
SOUL.md。仅当有意测试正常OpenClaw身份与角色的组合效果时,才同时使用IDENTITY.md。SOUL.md + IDENTITY.md - 保持提示语自然且符合任务场景。候选模型应通过接收角色设置,然后处理正常用户交互,如聊天、工作区帮助和小型文件任务;请勿询问“你会如何回应?”或告知模型它正处于评估中。
SOUL.md - 优先包含至少一项真实任务,例如创建或编辑小型工作区工件,以便对话记录捕获模型在正常工具使用下的角色表现,而非纯粹的角色扮演。
Codex CLI model lane
Codex CLI模型通道
Use model refs shaped like whenever QA should exercise Codex as a model backend.
codex-cli/<codex-model>Examples:
bash
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model codex-cli/<codex-model> \
--alt-model codex-cli/<codex-model> \
--scenario <scenario-id> \
--output-dir .artifacts/qa-e2e/codex-<tag>bash
pnpm openclaw qa manual \
--model codex-cli/<codex-model> \
--message "Reply exactly: CODEX_OK"- Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
- Live QA preserves so Codex CLI auth/config works while keeping
CODEX_HOMEandHOMEsandboxed.OPENCLAW_HOME - Mock QA should scrub .
CODEX_HOME - If Codex returns fallback/auth text every turn, first check ,
CODEX_HOME, and gateway child logs before changing scenario assertions.~/.profile - For model comparison, include as another candidate in
codex-cli/<codex-model>; the report should label it as an opaque model name.qa character-eval
当QA测试需要将Codex作为模型后端时,请使用格式为的模型引用。
codex-cli/<codex-model>示例:
bash
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model codex-cli/<codex-model> \
--alt-model codex-cli/<codex-model> \
--scenario <scenario-id> \
--output-dir .artifacts/qa-e2e/codex-<tag>bash
pnpm openclaw qa manual \
--model codex-cli/<codex-model> \
--message "Reply exactly: CODEX_OK"- 将具体的Codex模型名称视为用户/配置输入;请勿在源代码、文档示例或场景中硬编码该名称。
- 实时QA测试会保留,以便Codex CLI的认证/配置正常工作,同时保持
CODEX_HOME和HOME处于沙箱环境。OPENCLAW_HOME - 模拟QA测试应清除。
CODEX_HOME - 如果Codex每次交互都返回 fallback/认证文本,请先检查、
CODEX_HOME和网关子进程日志,再修改场景断言。~/.profile - 进行模型对比时,可在中添加
qa character-eval作为另一候选模型;报告中应将其标记为不透明模型名称。codex-cli/<codex-model>
Repo facts
代码仓库相关信息
- Seed scenarios live in .
qa/ - Main live runner:
extensions/qa-lab/src/suite.ts - QA lab server:
extensions/qa-lab/src/lab-server.ts - Child gateway harness:
extensions/qa-lab/src/gateway-child.ts - Synthetic channel:
extensions/qa-channel/
- 初始场景位于目录下。
qa/ - 主实时运行器:
extensions/qa-lab/src/suite.ts - QA Lab服务器:
extensions/qa-lab/src/lab-server.ts - 子网关测试工具:
extensions/qa-lab/src/gateway-child.ts - 合成通道:
extensions/qa-channel/
What “done” looks like
完成标准
- Full suite green for the requested lane.
- User gets:
- watch URL if applicable
- pass/fail counts
- artifact paths
- concise note on what was fixed
- 请求通道的完整测试套件全部通过。
- 用户获取以下内容:
- 适用时的监控URL
- 测试通过/失败数量
- 工件路径
- 关于修复内容的简要说明
Common failure patterns
常见失败模式
- Live timeout too short:
- widen live waits in
extensions/qa-lab/src/suite.ts
- widen live waits in
- Discovery cannot find repo files:
- point prompts at inside seeded workspace
repo/...
- point prompts at
- Subagent proof too brittle:
- prefer stable final reply evidence over transient child-session listing
- Harness “rebuild” delay:
- dirty tree can trigger a pre-run build; expect that before ports appear
- 实时超时时间过短:
- 在中延长实时等待时间
extensions/qa-lab/src/suite.ts
- 在
- 发现功能无法找到代码仓库文件:
- 将提示语指向种子工作区内的路径
repo/...
- 将提示语指向种子工作区内的
- 子代理验证逻辑过于脆弱:
- 优先使用稳定的最终回复证据,而非临时的子会话列表
- 测试工具“重建”延迟:
- 未提交的代码变更可能触发预运行构建;端口出现前需等待该过程完成
When adding scenarios
添加场景时的注意事项
- Add or update scenario markdown under
qa/scenarios/ - Keep kickoff expectations in aligned
qa/scenarios/index.md - Add executable coverage in
extensions/qa-lab/src/suite.ts - Prefer end-to-end assertions over mock-only checks
- Save outputs under
.artifacts/qa-e2e/
- 在目录下添加或更新场景Markdown文件
qa/scenarios/ - 保持中的启动预期一致
qa/scenarios/index.md - 在中添加可执行的测试覆盖
extensions/qa-lab/src/suite.ts - 优先选择端到端断言,而非仅模拟环境检查
- 将输出结果保存至目录下
.artifacts/qa-e2e/