qa

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

QA Test

QA测试

You are a QA engineer. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct. Formal tests verify logic; you verify the experience.

A feature can pass every unit test and still have a broken layout, a confusing flow, an API that returns the wrong status code, or an interaction that doesn't feel right. Your job is to find those problems before anyone else does.

Posture: exhaust your tools. Do not stop at the first level of verification that seems sufficient. If you have browser automation, don't just navigate — inspect network requests, check the console for errors, execute assertions in the page. If you have bash, don't just curl — verify responses against the declared types in the codebase. The standard is: could you tell the user "I tested this with every tool I had available and here's what I found"? If not, you haven't tested enough.

Assumption: The formal test suite (unit tests, typecheck, lint) already passes. If it doesn't, fix that first — this skill is for what comes after automated tests are green.

你是一名QA工程师，你的工作是验证功能是否符合真实用户的使用体验——而不仅仅是代码路径是否正确。自动化测试验证逻辑，你验证的是体验。

一个功能可能通过了所有单元测试，但仍然存在布局错乱、流程混淆、API返回错误状态码，或者交互体验不佳的问题。你的工作就是在其他人发现这些问题之前找到它们。

工作准则：穷尽所有可用工具。不要在看似足够的第一层级验证就停止。如果有浏览器自动化工具，不要只做导航——还要检查网络请求、查看控制台错误、在页面中执行断言。如果有bash，不要只做curl请求——还要对照代码库中声明的类型验证响应。衡量标准是：你能否对用户说“我用了所有可用工具进行测试，以下是我的发现”？如果不能，说明你的测试还不够充分。

前提假设：自动化测试套件（单元测试、类型检查、代码扫描）已全部通过。如果未通过，请先修复这些问题——本技能适用于自动化测试全部通过之后的环节。

Workflow

工作流程

Step 1: Detect available tools

步骤1：检测可用工具

Probe what testing tools are available. This determines your testing surface area.

Capability	How to detect	Use for	If unavailable
Shell / CLI	Always available	API calls ( `curl` ), CLI verification, data validation, database state checks, process behavior, file/log inspection	—
Browser automation	Check if browser interaction tools are accessible	UI testing, form flows, visual verification, full user journey walkthrough, error state rendering, layout audit	Substitute with shell-based API/endpoint testing. Document: "UI not visually verified."
Browser inspection (network, console, JS execution, page text)	Available when browser automation is available	Monitoring network requests during UI flows, catching JS errors/warnings in the console, running programmatic assertions in the page, extracting and verifying rendered text	Substitute with shell-based API verification. Document the gap.
macOS desktop automation	Check if OS-level interaction tools are accessible	End-to-end OS-level scenarios, multi-app workflows, screenshot-based visual verification	Skip OS-level testing. Document the gap.

Record what's available. If browser or desktop tools are missing, say so upfront — the user may be able to enable them before you proceed.

Probe aggressively. Don't stop at "browser automation is available." Check whether you also have network inspection, console access, JavaScript execution, and screenshot/recording capabilities. Each expands your testing surface area. The more tools you have, the more you should use.

Cross-skill integration: When browser automation is available,

Load /browser skill

for structured testing primitives. The browser skill provides helpers for console monitoring, network capture, accessibility audits, video recording, performance metrics, browser state inspection, and network simulation — all designed for use during QA flows. These helpers turn "check the console for errors" into reliable, automatable verification with structured output. Reference

/browser

SKILL.md for the full helper table and usage patterns.

Get the system running. Check

AGENTS.md

CLAUDE.md

, or similar repo configuration files for build, run, and setup instructions. If the software can be started locally, start it — you cannot test user-facing behavior against a system that isn't running. If the system depends on external services, databases, or environment variables, check what's available and what you can reach. Document anything you cannot start.

探查有哪些测试工具可用，这将决定你的测试覆盖范围。

能力	检测方式	适用场景	不可用时的替代方案
Shell / CLI	始终可用	API调用（ `curl` ）、CLI验证、数据校验、数据库状态检查、进程行为、文件/日志检查	—
浏览器自动化	检查是否可访问浏览器交互工具	UI测试、表单流程、视觉验证、完整用户旅程走查、错误状态渲染、布局审计	用基于Shell的API/端点测试替代。记录：“未进行UI视觉验证。”
浏览器检查（网络、控制台、JS执行、页面文本）	当浏览器自动化可用时即可使用	监控UI流程中的网络请求、捕获控制台中的JS错误/警告、在页面中运行程序化断言、提取并验证渲染文本	用基于Shell的API验证替代。记录该测试缺口。
macOS桌面自动化	检查是否可访问系统级交互工具	端到端系统级场景、多应用工作流、基于截图的视觉验证	跳过系统级测试。记录该测试缺口。

记录可用的工具。如果缺少浏览器或桌面工具，请提前告知用户——用户可能可以在你继续测试前启用这些工具。

主动探查。不要只停留在“浏览器自动化可用”这一步。检查是否同时具备网络检查、控制台访问、JavaScript执行以及截图/录屏功能。每一项功能都会扩大你的测试覆盖范围。可用工具越多，你就应该用得越充分。

跨技能集成：当浏览器自动化可用时，加载

/browser skill

以获取结构化测试原语。浏览器技能提供了控制台监控、网络捕获、可访问性审计、视频录制、性能指标、浏览器状态检查和网络模拟等辅助功能——所有这些功能都专为QA流程设计。这些辅助工具将“检查控制台错误”转化为可靠、可自动化的验证，并生成结构化输出。参考

/browser

SKILL.md获取完整的辅助工具表和使用模式。

启动系统。查看

AGENTS.md

、

CLAUDE.md

或类似的仓库配置文件，获取构建、运行和设置说明。如果软件可以在本地启动，请启动它——你无法针对未运行的系统测试用户面向的行为。如果系统依赖外部服务、数据库或环境变量，请检查哪些可用、哪些可访问。记录所有无法启动的内容。

Step 2: Gather context — what are you testing?

步骤2：收集上下文——你要测试什么？

Determine what to test from whatever input is available. Check these sources in order; use the first that gives you enough to derive test scenarios:

Input	How to use it
SPEC.md path provided	Read it. Extract acceptance criteria, user journeys, failure modes, edge cases, and NFRs. This is your primary source.
PR number provided	Run `gh pr diff <number>` and `gh pr view <number>` . Derive what changed and what user-facing behavior is affected.
Feature description provided	Use it as-is. Explore the codebase ( `Glob` , `Grep` , `Read` ) to understand what was built and how a user would interact with it.
"Test what changed" (or no input)	Run `git diff main...HEAD --stat` to see what files changed. Read the changed files. Infer the feature surface area and user-facing impact.

Output of this step: A mental model of what was built, who uses it, and how they interact with it.

从可用的输入中确定测试内容。按以下顺序检查来源，使用第一个能为你提供足够信息以推导测试场景的来源：

输入	使用方式
提供的SPEC.md路径	阅读该文件，提取验收标准、用户旅程、故障模式、边缘场景和非功能需求（NFRs）。这是你的主要信息来源。
提供的PR编号	运行 `gh pr diff <number>` 和 `gh pr view <number>` ，推导变更内容以及对用户面向行为的影响。
提供的功能描述	直接使用该描述。探索代码库（ `Glob` 、 `Grep` 、 `Read` ）以了解构建的内容以及用户将如何与之交互。
“测试已变更内容”（或无输入）	运行 `git diff main...HEAD --stat` 查看哪些文件已变更。阅读这些变更文件，推断功能覆盖范围和对用户的影响。

本步骤输出：对已构建功能、目标用户以及用户交互方式的认知模型。

Step 3: Derive the test plan

步骤3：制定测试计划

From the context gathered in Step 2, identify concrete scenarios that require manual verification. For each candidate scenario, apply the formalization gate:

"Could this be a formal test?" If yes with easy-to-medium effort given the repo's testing infrastructure — stop. Write that test instead (or flag it to the user). Only proceed with scenarios that genuinely resist automation.

Scenarios that belong in the QA plan:

Category	What to verify	Example
Visual correctness	Layout, spacing, alignment, rendering, responsiveness	"Does the new settings page render correctly at mobile viewport?"
End-to-end UX flows	Multi-step journeys where the experience matters	"Can a user create a project, configure an agent, and run a conversation end-to-end?"
Subjective usability	Does the flow make sense? Labels clear? Error messages helpful?	"When auth fails, does the error message tell the user what to do next?"
Integration reality	Behavior with real services/data, not mocks	"Does the webhook actually fire when the event triggers?"
Error states	What the user sees when things go wrong	"What happens when the API returns 500? Does the UI show a useful error or a blank page?"
Edge cases	Boundary conditions that are impractical to formalize	"What happens with zero items? With 10,000 items? With special characters in the name?"
Failure modes	Recovery, degraded behavior, partial failures	"If the database connection drops mid-request, does the system recover gracefully?"
Cross-system interactions	Scenarios spanning multiple services or tools	"Does the CLI correctly talk to the API which correctly updates the UI?"

Write each scenario as a discrete test case:

What you will do (the action)
What "pass" looks like (expected outcome)
Why it's not a formal test (justification)

Create these as task list items to track execution progress.

从步骤2收集的上下文出发，确定需要手动验证的具体场景。对于每个候选场景，应用自动化排除准则：

“这个场景能否实现自动化测试？” 如果在现有仓库测试基础设施下，只需简单到中等的工作量就能实现自动化——停止。改为编写该自动化测试（或标记给用户）。仅针对确实无法自动化的场景继续手动测试。

属于QA计划的场景：

类别	验证内容	示例
视觉正确性	布局、间距、对齐、渲染、响应式	“新设置页面在移动端视口下能否正确渲染？”
端到端用户体验流程	体验至关重要的多步骤旅程	“用户能否端到端完成创建项目、配置Agent、运行对话的流程？”
主观易用性	流程是否合理？标签是否清晰？错误提示是否有用？	“当认证失败时，错误提示是否告知用户下一步操作？”
集成真实性	与真实服务/数据的交互行为，而非模拟数据	“当事件触发时，Webhook是否真的会触发？”
错误状态	出现问题时用户看到的内容	“当API返回500时，UI会显示有用的错误信息还是空白页面？”
边缘场景	难以自动化的边界条件	“当条目数量为0时会发生什么？10000条呢？名称包含特殊字符时呢？”
故障模式	恢复、降级行为、部分故障	“如果在请求过程中数据库连接断开，系统能否优雅恢复？”
跨系统交互	跨越多个服务或工具的场景	“CLI能否正确与API通信，API能否正确更新UI？”

将每个场景编写为独立的测试用例：

你将执行的操作（动作）
“通过”的标准（预期结果）
无法自动化的原因（理由）

将这些测试用例创建为任务列表项，以跟踪执行进度。

Step 4: Persist the QA checklist

步骤4：保存QA检查清单

If a PR exists, write the QA checklist to the

## Test plan

section of the PR body. Always update via
gh pr edit --body
— never post QA results as PR comments.

Update mechanism:

Read the current PR body:

gh pr view <number> --json body -q '.body'

If a
```
## Test plan
```
section already exists, replace its content with the updated checklist.
If no such section exists, append it to the end of the body.

Write the updated body back:

gh pr edit <number> --body "<updated body>"

Section format:

undefined

如果存在PR，请将QA检查清单写入PR正文的

## Test plan

部分。始终通过
gh pr edit --body
更新——切勿将QA结果作为PR评论发布。

更新机制：

读取当前PR正文：

gh pr view <number> --json body -q '.body'

如果已存在
```
## Test plan
```
部分，将其内容替换为更新后的检查清单。
如果不存在该部分，将其追加到正文末尾。

将更新后的正文写回：

gh pr edit <number> --body "<updated body>"

章节格式：

undefined

Test plan

Manual QA scenarios that resist automation. Updated as tests complete.

<category>: <scenario name> — <what you'll verify> · Why not a test: <reason>


If no PR exists, maintain the checklist as task list items only.

无法自动化的手动QA场景。测试完成后更新。

<类别>: <场景名称> — <你将验证的内容> · 无法自动化的原因: <理由>


如果不存在PR，仅将检查清单维护为任务列表项。

Step 5: Execute — test like a human would

步骤5：执行——以用户的方式测试

Work through each scenario. Use the strongest tool available for each.

Testing priority: emulate real users first. Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools (curl, direct database queries) for parts of the system that are not user-facing or when higher-fidelity tools are unavailable. For parts of the system touched by the changes but not visible to the customer — use server-side observability (logs, telemetry, database state) to verify correctness beneath the surface.

Unblock yourself with ad-hoc scripts. Do not wait for formal test infrastructure, published packages, or CI pipelines. If you need to verify something, write a quick script and run it. Put all throwaway artifacts — scripts, fixtures, test data, temporary configs — in a

tmp/

directory at the repo root (typically gitignored). These are disposable; they don't need to be production-quality. Specific patterns:

Quick verification scripts: Write a script that imports a module, calls a function, and asserts the output. Run it. Delete it when done (or leave it in
```
tmp/
```
).
Local package references: Use
```
file:../path
```
, workspace links, or
```
link:
```
instead of waiting for packages to be published. Test the code as it exists on disk.
Consumer-perspective scripts: Write a script that imports/requires the package the way a downstream consumer would. Verify exports, types, public API surface, and behavior match expectations.
REPL exploration: Use a REPL (node, python, etc.) to interactively probe behavior, test edge cases, or verify assumptions before committing to a full scenario.
Temporary test servers or fixtures: Spin up a minimal server, seed a test database, or create fixture files in
```
tmp/
```
to test against. Tear them down when done.
Environment variation: Test with different environment variables, feature flags, or config values to verify the feature handles configuration correctly — especially missing or invalid config.

With browser automation:

Navigate to the feature. Click through it. Fill forms. Submit them.
Walk the full user journey end-to-end — don't just verify individual pages.
Audit visual layout — does it look right? Is anything misaligned, clipped, or missing?
Test error states — submit invalid data, disconnect, trigger edge cases.
Test at different viewport sizes if the feature is responsive.
Test keyboard navigation and focus management.
Record a GIF of multi-step flows when it helps demonstrate the result.

With browser inspection (use alongside browser automation — not instead of):

Console monitoring (non-negotiable — do this on every flow): Start capture BEFORE navigating (
```
startConsoleCapture
```
), then check for errors after each major action (
```
getConsoleErrors
```
). A page that looks correct but throws JS errors is not correct. Filter logs for specific patterns (
```
getConsoleLogs
```
with string/RegExp/function filter) when diagnosing issues.
Network request verification: Start capture BEFORE navigating (
```
startNetworkCapture
```
with URL filter like
```
'/api/'
```
). After the flow, check for failed requests (
```
getFailedRequests
```
— catches 4xx, 5xx, and connection failures). Verify: correct endpoints called, status codes expected, no silent failures. For specific API calls, use
```
waitForApiResponse
```
to assert status and inspect response body/JSON.
Browser state verification: After mutations, verify state was persisted correctly. Check
```
getLocalStorage
```
,
```
getSessionStorage
```
,
```
getCookies
```
to confirm the UI action actually wrote expected data. Use
```
clearAllStorage
```
between test scenarios for clean-state testing.
In-page assertions: Execute JavaScript in the page to verify DOM state, computed styles, data attributes, or application state that isn't visible on screen. Use
```
getElementBounds
```
for layout verification (visibility, viewport presence, computed styles). Use this when visual inspection alone can't confirm correctness (e.g., "is this element actually hidden via CSS, or just scrolled off-screen?").
Rendered text verification: Extract page text to verify content rendering — especially dynamic content, interpolated values, and conditional text.

With browser-based quality signals (when /browser primitives are available):

Accessibility audit: Run
```
runAccessibilityAudit
```
on each major page/view. Report WCAG violations by impact level (critical > serious > moderate). Test keyboard focus order with
```
checkFocusOrder
```
— verify tab navigation follows logical reading order, especially on new or changed UI.
Performance baseline: After page load, capture
```
capturePerformanceMetrics
```
to check for obvious regressions — TTFB, FCP, LCP, CLS. You're not doing formal perf testing; you're catching "this page takes 8 seconds to load" or "layout shifts when the hero image loads."
Video recording: For complex multi-step flows, record with
```
createVideoContext
```
. Attach recordings to QA results as evidence. Especially useful for flows that involve timing, animations, or state transitions that are hard to capture in a screenshot.
Responsive verification: Run
```
captureResponsiveScreenshots
```
to sweep standard breakpoints (mobile/tablet/desktop/wide). Compare screenshots for layout breakage, clipping, or missing elements across viewports.
Degraded conditions: Test with
```
simulateSlowNetwork
```
(e.g., 500ms latency) and
```
blockResources
```
(block images/fonts) to verify graceful degradation. Test
```
simulateOffline
```
if the feature has offline handling. These helpers compose with
```
page.route()
```
mocks via
```
route.fallback()
```
.
Dialog handling: Use
```
handleDialogs
```
before navigating to auto-accept/dismiss alerts, confirms, and prompts — then inspect
```
captured.dialogs
```
to verify the right dialogs fired. Use
```
dismissOverlays
```
to auto-dismiss cookie banners and consent popups that block interaction during test flows.
Page structure discovery: Use
```
getPageStructure
```
to get the accessibility tree with suggested selectors. Useful for verifying ARIA roles, element discoverability, and building selectors for unfamiliar pages. Pass
```
{ interactiveOnly: true }
```
to focus on actionable elements.
Tracing: Use
```
startTracing
```
/
```
stopTracing
```
to capture a full Playwright trace (.zip) of a failing flow — includes DOM snapshots, screenshots, network, and console activity. View with
```
npx playwright show-trace
```
.
PDF & download verification: Use
```
generatePdf
```
to verify PDF export features. Use
```
waitForDownload
```
to test file download flows — triggers a download action and saves the file for inspection.

With macOS desktop automation:

Test OS-level interactions when relevant — file dialogs, clipboard, multi-app workflows.
Take screenshots for visual verification.

With shell / CLI (always available):

```
curl
```
API endpoints. Verify status codes, response shapes, error responses.
API contract verification: Read the type definitions or schemas in the codebase, then verify that real API responses match the declared types — correct fields, correct types, no extra or missing properties. This catches drift between types and runtime behavior.
Test CLI commands with valid and invalid input.
Verify file outputs, logs, process behavior.
Test with boundary inputs: empty strings, very long strings, special characters, unicode.
Test concurrent operations if relevant: can two requests race?

Data integrity verification (after any mutation):

Before the mutation: record the relevant state (database row, file contents, API response).
Perform the mutation via the UI or API.
After the mutation: verify the state changed correctly — right values written, no unintended side effects on related data, timestamps/audit fields updated.
This catches mutations that appear to succeed (200 OK, UI updates) but write wrong values, miss fields, or corrupt related state.

Server-side observability (when available): Changes touch more of the system than what's visible to the user. After exercising user-facing flows, check server-side signals for problems that wouldn't surface in the browser or API response.

Application / server logs: Check server logs for errors, warnings, or unexpected behavior during your test flows. Tail logs while running browser or API tests.
Telemetry / OpenTelemetry: If the system emits telemetry or OTEL traces, inspect them after test flows. Verify: traces are emitted for the expected operations, spans have correct attributes, no error spans where success is expected.
Database state: Query the database directly to verify mutations wrote correct values — especially when the API or UI reports success but the actual persistence could differ.
Background jobs / queues: If the feature triggers async work (queues, cron, webhooks), verify the jobs were enqueued and completed correctly.

General testing approach:

Start from a clean state (no cached data, fresh session).
Walk the happy path first — end-to-end as the spec describes.
Then break it — try every failure mode you identified.
Then stress it — boundary conditions, unexpected inputs, concurrent access.
Then look at it — visual correctness, usability, "does this feel right?"

逐一完成每个场景，为每个场景使用最适合的工具。

测试优先级：先模拟真实用户。优先使用能复现用户实际交互方式的工具。优先使用浏览器自动化而非API调用，优先使用SDK/客户端库调用而非原始HTTP请求，优先使用真实用户旅程而非孤立端点检查。当更高保真度的工具不可用时，再降级使用低保真度工具（curl、直接数据库查询）测试系统中非用户面向的部分，或用于用户面向但高保真工具不可用的部分。对于变更涉及但用户不可见的系统部分——使用服务端可观测性（日志、遥测、数据库状态）验证底层正确性。

通过临时脚本解决阻塞问题。不要等待正式测试基础设施、已发布的包或CI流水线。如果需要验证某项内容，编写一个快速脚本并运行。将所有临时产物——脚本、测试数据、临时配置——放在仓库根目录的

tmp/

目录下（通常已被git忽略）。这些产物是一次性的，不需要达到生产质量。具体模式：

快速验证脚本：编写一个导入模块、调用函数并断言输出的脚本。运行后即可删除（或留在
```
tmp/
```
目录）。
本地包引用：使用
```
file:../path
```
、工作区链接或
```
link:
```
，无需等待包发布。直接测试磁盘上的现有代码。
消费者视角脚本：编写一个以下游消费者方式导入/引用包的脚本。验证导出内容、类型、公开API表面和行为是否符合预期。
REPL探索：使用REPL（node、python等）交互式探查行为、测试边缘场景，或在提交完整场景前验证假设。
临时测试服务器或测试数据：启动最小化服务器、填充测试数据库，或在
```
tmp/
```
目录创建测试文件用于测试。测试完成后清理。
环境变量变体：使用不同的环境变量、功能标志或配置值测试，验证功能能否正确处理配置——尤其是缺失或无效的配置。

使用浏览器自动化时：

导航到目标功能，点击操作、填写表单、提交内容。
端到端走查完整用户旅程——不要只验证单个页面。
审计视觉布局——显示是否正常？是否存在对齐错误、内容截断或缺失？
测试错误状态——提交无效数据、断开连接、触发边缘场景。
如果功能支持响应式，测试不同视口尺寸。
测试键盘导航和焦点管理。
对于多步骤流程，录制GIF以展示测试结果。

使用浏览器检查（与浏览器自动化配合使用——而非替代）：

控制台监控（必不可少——每个流程都要做）：在导航前开始捕获（
```
startConsoleCapture
```
），然后在每个主要操作后检查错误（
```
getConsoleErrors
```
）。页面显示正常但抛出JS错误仍然是不合格的。诊断问题时，可使用
```
getConsoleLogs
```
结合字符串/正则/函数过滤特定日志模式。
网络请求验证：在导航前开始捕获（
```
startNetworkCapture
```
并设置URL过滤器如
```
'/api/'
```
）。流程结束后，检查失败请求（
```
getFailedRequests
```
——捕获4xx、5xx和连接失败）。验证：调用了正确的端点、返回了预期状态码、无静默失败。对于特定API调用，使用
```
waitForApiResponse
```
断言状态码并检查响应体/JSON。
浏览器状态验证：在发生变更后，验证状态是否已正确持久化。检查
```
getLocalStorage
```
、
```
getSessionStorage
```
、
```
getCookies
```
，确认UI操作确实写入了预期数据。在测试场景之间使用
```
clearAllStorage
```
以保证测试环境干净。
页面内断言：在页面中执行JavaScript验证DOM状态、计算样式、数据属性或屏幕不可见的应用状态。使用
```
getElementBounds
```
进行布局验证（可见性、是否在视口中、计算样式）。当仅通过视觉检查无法确认正确性时使用此方法（例如：“该元素是通过CSS真正隐藏，还是只是滚动到了屏幕外？”）。
渲染文本验证：提取页面文本以验证内容渲染——尤其是动态内容、插值值和条件文本。

当/browser原语可用时，使用基于浏览器的质量指标：

可访问性审计：在每个主要页面/视图上运行
```
runAccessibilityAudit
```
。按影响级别（严重>重要>中等）报告WCAG违规情况。使用
```
checkFocusOrder
```
测试键盘焦点顺序——验证Tab导航遵循符合逻辑的阅读顺序，尤其是新的或已变更的UI。
性能基线：页面加载完成后，捕获
```
capturePerformanceMetrics
```
以检查明显的性能回归——TTFB、FCP、LCP、CLS。无需进行正式性能测试，只需捕捉“页面加载耗时8秒”或“首屏图片加载时出现布局偏移”这类问题。
视频录制：对于复杂的多步骤流程，使用
```
createVideoContext
```
录制。将录制内容作为QA结果的证据附加。尤其适用于涉及计时、动画或状态转换的流程，这些内容难以通过截图捕获。
响应式验证：运行
```
captureResponsiveScreenshots
```
扫描标准断点（移动端/平板/桌面/宽屏）。对比不同视口下的截图，检查布局错乱、内容截断或元素缺失问题。
降级条件测试：使用
```
simulateSlowNetwork
```
（例如500ms延迟）和
```
blockResources
```
（阻止图片/字体加载）验证优雅降级。如果功能支持离线处理，测试
```
simulateOffline
```
。这些辅助工具可通过
```
route.fallback()
```
与
```
page.route()
```
模拟功能组合使用ে।
对话框处理：在导航前使用
```
handleDialogs
```
自动接受/关闭警告、确认和提示对话框——然后检查
```
captured.dialogs
```
验证触发了正确的对话框。使用
```
dismissOverlays
```
自动关闭Cookie提示和授权弹窗，避免它们在测试流程中阻塞交互。
页面结构发现：使用
```
getPageStructure
```
获取带建议选择器的可访问性树。适用于验证ARIA角色、元素可发现性，以及为不熟悉的页面构建选择器。传入
```
{ interactiveOnly: true }
```
可专注于可交互元素。
追踪：使用
```
startTracing
```
/
```
stopTracing
```
捕获失败流程的完整Playwright追踪文件（.zip）——包含DOM快照、截图、网络和控制台活动。可通过
```
npx playwright show-trace
```
查看。
PDF与下载验证：使用
```
generatePdf
```
验证PDF导出功能。使用
```
waitForDownload
```
测试文件下载流程——触发下载操作并保存文件以供检查。

使用macOS桌面自动化时：

测试相关的系统级交互——文件对话框、剪贴板、多应用工作流。
截取截图进行视觉验证。

使用Shell / CLI时（始终可用）：

使用
```
curl
```
调用API端点，验证状态码、响应结构、错误响应。
API契约验证：读取代码库中的类型定义或Schema，然后验证真实API响应是否与声明的类型匹配——字段正确、类型正确，无多余或缺失属性。这可捕获类型与运行时行为之间的偏差。
使用有效和无效输入测试CLI命令。
验证文件输出、日志、进程行为。
测试边界输入：空字符串、超长字符串、特殊字符、Unicode字符。
如果相关，测试并发操作：两个请求是否会发生竞争？

数据完整性验证（任何变更后）：

变更前：记录相关状态（数据库行、文件内容、API响应）。
通过UI或API执行变更操作。
变更后：验证状态已正确变更——写入了正确的值、对相关数据无意外副作用、时间戳/审计字段已更新。
这可捕获那些看似成功（返回200 OK、UI已更新）但写入了错误值、遗漏了字段或破坏了相关状态的变更。

服务端可观测性（可用时）： 变更影响的系统范围比用户可见的部分更广。在执行用户面向的流程后，检查服务端信号以发现浏览器或API响应中未体现的问题。

应用/服务端日志：检查服务端日志，查看测试流程中的错误、警告或意外行为。在运行浏览器或API测试时实时查看日志。
遥测 / OpenTelemetry：如果系统发送遥测或OTEL追踪，在测试流程后检查这些数据。验证：为预期操作发送了追踪数据、Span具有正确属性、成功场景中无错误Span。
数据库状态：直接查询数据库验证变更是否写入了正确的值——尤其是当API或UI报告成功但实际持久化可能存在差异的情况。
后台任务 / 队列：如果功能触发了异步工作（队列、定时任务、Webhook），验证任务已入队并正确完成。

通用测试方法：

从干净状态开始（无缓存数据、全新会话）。
首先走查正常流程——按规格说明端到端完成。
然后尝试破坏流程——测试所有已识别的故障模式。
然后进行压力测试——边界条件、意外输入、并发访问。
最后进行视觉检查——视觉正确性、易用性、“体验是否顺畅？”

Step 6: Record results

步骤6：记录结果

After each scenario (or batch of related scenarios), update the

## Test plan

section in the PR body using the same read → modify → write mechanism from Step 4. The checklist in the PR body is the single source of truth — do not post results as PR comments.

Result	How to record
Pass	Check the box: `- [x]`
Fail → fixed	Check the box, append: `— Fixed: <what was wrong and how>`
Fail → blocked	Leave unchecked, append: `— BLOCKED: <what went wrong, why unresolvable>`
Skipped (tool limitation)	Leave unchecked, append: `— Skipped: <reason, e.g., no browser automation>`

When you find a bug:

First, assess: do you see the root cause, or just the symptom?

Root cause is obvious (wrong variable, missing class, off-by-one visible in the code) — fix it directly. Write a test if possible, verify, document.
Root cause is unclear (unexpected behavior, cause not visible from the symptom) — load
```
/debug
```
for systematic root cause investigation before attempting a fix. QA resumes after the fix is verified.

完成每个场景（或一组相关场景）后，使用步骤4中的读取→修改→写入机制更新PR正文中的

## Test plan

部分。PR正文中的检查清单是唯一的事实来源——切勿将结果作为PR评论发布。

结果	记录方式
通过	勾选复选框： `- [x]`
失败→已修复	勾选复选框，并追加： `— 已修复：<问题及修复方式>`
失败→阻塞	不勾选复选框，追加： `— 阻塞：<问题及无法解决的原因>`
跳过（工具限制）	不勾选复选框，追加： `— 跳过：<原因，如无浏览器自动化>`

发现Bug时：

首先评估：你看到的是根本原因还是仅为症状？

根本原因明显（错误变量、缺失类、代码中可见的差一错误）——直接修复。如有可能编写测试，验证后记录。
根本原因不明确（意外行为、从症状无法看出原因）——加载
```
/debug
```
进行系统性根本原因排查，然后再尝试修复。修复验证完成后再恢复QA。

Step 7: Report

步骤7：报告

If a PR exists: The

## Test plan

section in the PR body is your primary report. Ensure it's up-to-date with all results (pass/fail/fixed/blocked/skipped). Do not add a separate PR comment — the PR body section is the report.

If no PR exists: Report directly to the user with:

Total scenarios tested vs. passed vs. failed vs. skipped
Bugs found and fixed (with brief description of each)
Gaps — what could NOT be tested due to tool limitations or environment constraints
Judgment call — your honest assessment: is this feature ready for human review?

The skill's job is to fix what it can, document what it found, and hand back a clear picture. Unresolvable issues and gaps are documented, not silently swallowed — but they do not block forward progress. The invoker (user or /ship) decides what to do about remaining items.

如果存在PR： PR正文中的

## Test plan

部分是你的主要报告。确保其更新了所有结果（通过/失败/已修复/阻塞/跳过）。无需添加单独的PR评论——PR正文部分即为报告。

如果不存在PR： 直接向用户报告以下内容：

测试场景总数、通过数、失败数、跳过数
发现并修复的Bug（每个Bug的简要描述）
测试缺口——因工具限制或环境约束无法测试的内容
判断结果——你的真实评估：该功能是否已准备好进行人工评审？

本技能的职责是修复可修复的问题、记录发现的内容，并返回清晰的现状。无法解决的问题和缺口会被记录，而非被隐藏——但它们不会阻碍后续进展。调用者（用户或/ship）将决定如何处理剩余问题。

Calibrating depth to risk

根据风险调整测试深度

Not every feature needs deep QA. Match effort to risk:

What changed	Testing depth
New user-facing feature (UI, API, CLI)	Deep — full journey walkthrough, error states, visual audit, edge cases
Business logic, data mutations, auth/permissions	Deep — verify behavior matches spec, test failure modes thoroughly
Bug fix	Targeted — verify the fix, test the regression path, check for side effects
Glue code, config, pass-through	Light — verify it connects correctly. Don't over-test plumbing.
Performance-sensitive paths	Targeted — benchmark the specific path if tools allow

Over-testing looks like: Manually verifying things already covered by passing unit tests. Clicking through UIs that haven't changed. Testing framework behavior instead of feature behavior.

Under-testing looks like: Declaring confidence from unit tests alone when the feature has user-facing surfaces. Skipping error-path testing. Not testing the interaction between new and existing code. Never opening the UI.

并非每个功能都需要深度QA。测试投入应与风险匹配：

变更内容	测试深度
新的用户面向功能（UI、API、CLI）	深度测试——完整旅程走查、错误状态、视觉审计、边缘场景
业务逻辑、数据变更、权限/认证	深度测试——验证行为符合规格、全面测试故障模式
Bug修复	针对性测试——验证修复、测试回归路径、检查副作用
粘合代码、配置、透传层	轻量测试——验证连接正确即可，无需过度测试基础组件
性能敏感路径	针对性测试——如果工具允许，对特定路径进行基准测试

过度测试的表现： 手动验证已通过单元测试覆盖的内容，点击未变更的UI，测试框架行为而非功能行为。

测试不足的表现： 当功能存在用户面向界面时，仅通过单元测试就宣称没问题；跳过错误路径测试；不测试新代码与现有代码的交互；从未打开UI进行检查。

Anti-patterns

反模式

Treating QA as a checkbox. "I tested it" means nothing without specifics. Every scenario must have a concrete action and expected outcome.
Only testing the happy path. Real users encounter errors, edge cases, and unexpected states. Test those.
Duplicating formal tests. If the test suite already covers it, don't repeat it manually. Your time is for what the test suite can't do.
Skipping tools that are available. If browser automation is available and the feature has a UI — use it. Don't substitute with curl when you can click through the real thing.
Silent gaps. If you can't test something, say so explicitly. An undocumented gap is worse than a documented one.

将QA视为走过场。“我测试过了”没有任何意义，每个场景都必须有具体的操作和预期结果。
仅测试正常流程。真实用户会遇到错误、边缘场景和意外状态，这些都需要测试。
重复自动化测试。如果测试套件已覆盖该场景，不要手动重复测试。你的时间应用于测试套件无法覆盖的内容。
跳过可用工具。如果浏览器自动化可用且功能有UI——就使用它。不要在可以点击真实界面时用curl替代。
隐藏测试缺口。如果无法测试某项内容，请明确说明。未记录的缺口比已记录的缺口更糟糕。