testing-guide
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTesting Guide
测试指南
What to test, how to test it, and what NOT to test — for a plugin made of prompt files, Python glue, and configuration.
针对由提示词文件、Python衔接代码和配置组成的插件,明确测试内容、测试方法以及无需测试的内容。
Philosophy: GenAI-First Testing
理念:以GenAI为核心的测试
Traditional unit tests work for deterministic logic. But most bugs in this project are drift — docs diverge from code, agents contradict commands, component counts go stale. GenAI congruence tests catch these. Unit tests don't.
Decision rule: Can you write and it won't break next week? → Unit test. Otherwise → GenAI test or structural test.
assert x == y传统单元测试适用于确定性逻辑。但本项目中的大多数bug属于漂移问题——文档与代码不一致、Agent与指令矛盾、组件数量过时。GenAI一致性测试可以捕捉这些问题,而单元测试做不到。
判定规则:你能否写出且该断言在未来一周不会失效?→ 用单元测试。否则→用GenAI测试或结构测试。
assert x == yThree Test Patterns
三种测试模式
1. Judge Pattern (single artifact evaluation)
1. 评判模式(单一工件评估)
An LLM evaluates one artifact against criteria. Use for: doc completeness, security posture, architectural intent.
python
pytestmark = [pytest.mark.genai]
def test_agents_documented_in_claude_md(self, genai):
agents_on_disk = list_agents()
claude_md = Path("CLAUDE.md").read_text()
result = genai.judge(
question="Does CLAUDE.md document all active agents?",
context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
criteria="All active agents should be referenced. Score by coverage %."
)
assert result["score"] >= 5, f"Gap: {result['reasoning']}"由LLM根据标准评估单个工件。适用于:文档完整性、安全态势、架构意图验证。
python
pytestmark = [pytest.mark.genai]
def test_agents_documented_in_claude_md(self, genai):
agents_on_disk = list_agents()
claude_md = Path("CLAUDE.md").read_text()
result = genai.judge(
question="Does CLAUDE.md document all active agents?",
context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
criteria="All active agents should be referenced. Score by coverage %."
)
assert result["score"] >= 5, f"Gap: {result['reasoning']}"2. Congruence Pattern (two-source cross-reference)
2. 一致性模式(双源交叉验证)
The most valuable pattern. An LLM checks two files that should agree. Use for: command↔agent alignment, FORBIDDEN lists, config↔reality.
python
def test_implement_and_implementer_share_forbidden_list(self, genai):
implement = Path("commands/implement.md").read_text()
implementer = Path("agents/implementer.md").read_text()
result = genai.judge(
question="Do these files have matching FORBIDDEN behavior lists?",
context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
)
assert result["score"] >= 5这是最有价值的模式。由LLM检查两个本应一致的文件。适用于:指令↔Agent对齐、FORBIDDEN列表同步、配置↔实际情况匹配。
python
def test_implement_and_implementer_share_forbidden_list(self, genai):
implement = Path("commands/implement.md").read_text()
implementer = Path("agents/implementer.md").read_text()
result = genai.judge(
question="Do these files have matching FORBIDDEN behavior lists?",
context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
)
assert result["score"] >= 53. Structural Pattern (dynamic filesystem discovery)
3. 结构模式(动态文件系统发现)
No LLM needed. Discover components dynamically and assert structural properties. Use for: component existence, manifest sync, skill loading.
python
def test_all_active_skills_have_content(self):
skills_dir = Path("plugins/autonomous-dev/skills")
for skill in skills_dir.iterdir():
if skill.name == "archived" or not skill.is_dir():
continue
skill_md = skill / "SKILL.md"
assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"无需LLM。动态发现组件并断言其结构属性。适用于:组件存在性验证、清单同步、Skill加载检查。
python
def test_all_active_skills_have_content(self):
skills_dir = Path("plugins/autonomous-dev/skills")
for skill in skills_dir.iterdir():
if skill.name == "archived" or not skill.is_dir():
continue
skill_md = skill / "SKILL.md"
assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"Anti-Patterns (NEVER do these)
反模式(绝对不要做这些)
Hardcoded counts
硬编码计数
python
undefinedpython
undefinedBAD — breaks every time a component is added/removed
错误示例——每次添加/删除组件都会失效
assert len(agents) == 14
assert hook_count == 17
assert len(agents) == 14
assert hook_count == 17
GOOD — minimum thresholds + structural checks
正确示例——最小阈值+结构检查
assert len(agents) >= 8, "Pipeline needs at least 8 agents"
assert "implementer.md" in agent_names, "Core agent missing"
undefinedassert len(agents) >= 8, "流水线至少需要8个Agent"
assert "implementer.md" in agent_names, "核心Agent缺失"
undefinedTesting config values
测试配置值
python
undefinedpython
undefinedBAD — breaks on every config update
错误示例——每次配置更新都会失效
assert settings["version"] == "3.51.0"
assert settings["version"] == "3.51.0"
GOOD — test structure, not values
正确示例——测试结构而非具体值
assert "version" in settings
assert re.match(r"\d+.\d+.\d+", settings["version"])
undefinedassert "version" in settings
assert re.match(r"\d+.\d+.\d+", settings["version"])
undefinedTesting file paths that move
测试易变动的文件路径
python
undefinedpython
undefinedBAD — breaks on renames/moves
错误示例——重命名/移动文件后失效
assert Path("plugins/autonomous-dev/lib/old_name.py").exists()
assert Path("plugins/autonomous-dev/lib/old_name.py").exists()
GOOD — use glob discovery
正确示例——使用glob发现
assert any(Path("plugins/autonomous-dev/lib").glob("skill"))
**Rule**: If the test itself is the thing that needs updating most often, delete it.
---assert any(Path("plugins/autonomous-dev/lib").glob("skill"))
**规则**:如果测试本身是最需要频繁更新的内容,那就删掉它。
---Test Tiers (auto-categorized by directory)
测试分层(按目录自动分类)
No manual needed — directory location determines tier.
@pytest.marktests/
├── regression/
│ ├── smoke/ # Tier 0: Critical path (<5s) — CI GATE
│ ├── regression/ # Tier 1: Feature protection (<30s)
│ ├── extended/ # Tier 2: Deep validation (<5min)
│ └── progression/ # Tier 3: TDD red phase (not yet implemented)
├── unit/ # Isolated functions (<1s each)
├── integration/ # Multi-component workflows (<30s)
├── genai/ # LLM-as-judge (opt-in via --genai flag)
└── archived/ # Excluded from runsWhere to put a new test:
- Protecting a released critical path? →
regression/smoke/ - Protecting a released feature? →
regression/regression/ - Testing a pure function? →
unit/ - Testing component interaction? →
integration/ - Checking doc↔code drift? →
genai/
Run commands:
bash
pytest -m smoke # CI gate
pytest -m "smoke or regression" # Feature protection
pytest tests/genai/ --genai # GenAI validation (opt-in)无需手动添加——目录位置决定分层。
@pytest.marktests/
├── regression/
│ ├── smoke/ # 第0层:关键路径(<5秒)——CI门禁
│ ├── regression/ # 第1层:功能保护(<30秒)
│ ├── extended/ # 第2层:深度验证(<5分钟)
│ └── progression/ # 第3层:TDD红阶段(尚未实现)
├── unit/ # 独立函数(每个<1秒)
├── integration/ # 多组件工作流(<30秒)
├── genai/ # LLM作为评判者(通过--genai flag启用)
└── archived/ # 排除在运行范围外新测试的存放位置:
- 保护已发布的关键路径?→
regression/smoke/ - 保护已发布的功能?→
regression/regression/ - 测试纯函数?→
unit/ - 测试组件交互?→
integration/ - 检查文档与代码的漂移?→
genai/
运行命令:
bash
pytest -m smoke # CI门禁测试
pytest -m "smoke or regression" # 功能保护测试
pytest tests/genai/ --genai # GenAI验证(需手动启用)GenAI Test Infrastructure
GenAI测试基础设施
python
undefinedpython
undefinedtests/genai/conftest.py provides two fixtures:
tests/genai/conftest.py提供两个fixture:
- genai: Gemini Flash via OpenRouter (cheap, fast)
- genai:通过OpenRouter调用Gemini Flash(低成本、快速)
- genai_smart: Haiku 4.5 via OpenRouter (complex reasoning)
- genai_smart:通过OpenRouter调用Haiku 4.5(复杂推理)
Requires: OPENROUTER_API_KEY env var + --genai pytest flag
要求:设置OPENROUTER_API_KEY环境变量 + --genai pytest参数
Cost: ~$0.02 per full run with 24h response caching
成本:每次完整运行约0.02美元,包含24小时响应缓存
**Scaffold for any repo**: `/scaffold-genai-uat` generates the full `tests/genai/` setup with portable client, universal tests, and project-specific congruence tests auto-discovered by GenAI.
---
**适用于任意仓库的脚手架**:`/scaffold-genai-uat`可生成完整的`tests/genai/`环境,包含可移植客户端、通用测试以及由GenAI自动发现的项目专属一致性测试。
---What to Test vs What Not To
测试内容与无需测试的内容
| Test This | With This | Not This |
|---|---|---|
| Pure Python functions | Unit tests | — |
| Component interactions | Integration tests | — |
| Doc ↔ code alignment | GenAI congruence | Hardcoded string matching |
| Component existence | Structural (glob) | Hardcoded counts |
| FORBIDDEN list sync | GenAI congruence | Manual comparison |
| Security posture | GenAI judge | Regex scanning |
| Config structure | Structural | Config values |
| Agent output quality | GenAI judge | Output string matching |
| 需测试内容 | 测试方法 | 无需测试内容 |
|---|---|---|
| 纯Python函数 | 单元测试 | — |
| 组件交互 | 集成测试 | — |
| 文档与代码对齐 | GenAI一致性测试 | 硬编码字符串匹配 |
| 组件存在性 | 结构测试(glob) | 硬编码计数 |
| FORBIDDEN列表同步 | GenAI一致性测试 | 手动对比 |
| 安全态势 | GenAI评判 | 正则扫描 |
| 配置结构 | 结构测试 | 配置具体值 |
| Agent输出质量 | GenAI评判 | 输出字符串匹配 |
Hard Rules
硬性规则
- 100% pass rate required — ALL tests must pass, 0 failures. Coverage targets are separate.
- Tests before implementation — write failing tests, then implement.
- Regression test for every bug fix — named .
test_regression_issue_NNN_description - No test is better than a flaky test — if it fails randomly, fix or delete it.
- GenAI tests are opt-in — flag required, no surprise API costs.
--genai
- 必须100%通过——所有测试必须通过,0失败。覆盖率目标是另一回事。
- 先写测试再实现——先编写失败的测试,再进行功能实现。
- 每个bug修复都要添加回归测试——命名格式为。
test_regression_issue_NNN_description - 不稳定的测试不如没有——如果测试随机失败,要么修复要么删除。
- GenAI测试为可选启用——需要参数,避免意外产生API费用。
--genai