testing-guide

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Testing Guide

测试指南

What to test, how to test it, and what NOT to test — for a plugin made of prompt files, Python glue, and configuration.

针对由提示词文件、Python衔接代码和配置组成的插件，明确测试内容、测试方法以及无需测试的内容。

Philosophy: GenAI-First Testing

理念：以GenAI为核心的测试

Traditional unit tests work for deterministic logic. But most bugs in this project are drift — docs diverge from code, agents contradict commands, component counts go stale. GenAI congruence tests catch these. Unit tests don't.

Decision rule: Can you write

assert x == y

and it won't break next week? → Unit test. Otherwise → GenAI test or structural test.

传统单元测试适用于确定性逻辑。但本项目中的大多数bug属于漂移问题——文档与代码不一致、Agent与指令矛盾、组件数量过时。GenAI一致性测试可以捕捉这些问题，而单元测试做不到。

判定规则：你能否写出

assert x == y

且该断言在未来一周不会失效？→ 用单元测试。否则→用GenAI测试或结构测试。

Three Test Patterns

三种测试模式

1. Judge Pattern (single artifact evaluation)

1. 评判模式（单一工件评估）

An LLM evaluates one artifact against criteria. Use for: doc completeness, security posture, architectural intent.

python

pytestmark = [pytest.mark.genai]

def test_agents_documented_in_claude_md(self, genai):
    agents_on_disk = list_agents()
    claude_md = Path("CLAUDE.md").read_text()
    result = genai.judge(
        question="Does CLAUDE.md document all active agents?",
        context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
        criteria="All active agents should be referenced. Score by coverage %."
    )
    assert result["score"] >= 5, f"Gap: {result['reasoning']}"

由LLM根据标准评估单个工件。适用于：文档完整性、安全态势、架构意图验证。

python

pytestmark = [pytest.mark.genai]

def test_agents_documented_in_claude_md(self, genai):
    agents_on_disk = list_agents()
    claude_md = Path("CLAUDE.md").read_text()
    result = genai.judge(
        question="Does CLAUDE.md document all active agents?",
        context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
        criteria="All active agents should be referenced. Score by coverage %."
    )
    assert result["score"] >= 5, f"Gap: {result['reasoning']}"

2. Congruence Pattern (two-source cross-reference)

2. 一致性模式（双源交叉验证）

The most valuable pattern. An LLM checks two files that should agree. Use for: command↔agent alignment, FORBIDDEN lists, config↔reality.

python

def test_implement_and_implementer_share_forbidden_list(self, genai):
    implement = Path("commands/implement.md").read_text()
    implementer = Path("agents/implementer.md").read_text()
    result = genai.judge(
        question="Do these files have matching FORBIDDEN behavior lists?",
        context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
        criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
    )
    assert result["score"] >= 5

这是最有价值的模式。由LLM检查两个本应一致的文件。适用于：指令↔Agent对齐、FORBIDDEN列表同步、配置↔实际情况匹配。

python

def test_implement_and_implementer_share_forbidden_list(self, genai):
    implement = Path("commands/implement.md").read_text()
    implementer = Path("agents/implementer.md").read_text()
    result = genai.judge(
        question="Do these files have matching FORBIDDEN behavior lists?",
        context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
        criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
    )
    assert result["score"] >= 5

3. Structural Pattern (dynamic filesystem discovery)

3. 结构模式（动态文件系统发现）

No LLM needed. Discover components dynamically and assert structural properties. Use for: component existence, manifest sync, skill loading.

python

def test_all_active_skills_have_content(self):
    skills_dir = Path("plugins/autonomous-dev/skills")
    for skill in skills_dir.iterdir():
        if skill.name == "archived" or not skill.is_dir():
            continue
        skill_md = skill / "SKILL.md"
        assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
        assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"

无需LLM。动态发现组件并断言其结构属性。适用于：组件存在性验证、清单同步、Skill加载检查。

python

def test_all_active_skills_have_content(self):
    skills_dir = Path("plugins/autonomous-dev/skills")
    for skill in skills_dir.iterdir():
        if skill.name == "archived" or not skill.is_dir():
            continue
        skill_md = skill / "SKILL.md"
        assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
        assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"

Anti-Patterns (NEVER do these)

反模式（绝对不要做这些）

Hardcoded counts

硬编码计数

python

undefined

python

undefined

BAD — breaks every time a component is added/removed

错误示例——每次添加/删除组件都会失效

assert len(agents) == 14 assert hook_count == 17

GOOD — minimum thresholds + structural checks

正确示例——最小阈值+结构检查

assert len(agents) >= 8, "Pipeline needs at least 8 agents" assert "implementer.md" in agent_names, "Core agent missing"

undefined

assert len(agents) >= 8, "流水线至少需要8个Agent" assert "implementer.md" in agent_names, "核心Agent缺失"

undefined

Testing config values

测试配置值

python

undefined

python

undefined

BAD — breaks on every config update

错误示例——每次配置更新都会失效

assert settings["version"] == "3.51.0"

GOOD — test structure, not values

正确示例——测试结构而非具体值

assert "version" in settings assert re.match(r"\d+.\d+.\d+", settings["version"])

undefined

assert "version" in settings assert re.match(r"\d+.\d+.\d+", settings["version"])

undefined

Testing file paths that move

测试易变动的文件路径

python

undefined

python

undefined

BAD — breaks on renames/moves

错误示例——重命名/移动文件后失效

assert Path("plugins/autonomous-dev/lib/old_name.py").exists()

GOOD — use glob discovery

正确示例——使用glob发现

assert any(Path("plugins/autonomous-dev/lib").glob("skill"))


**Rule**: If the test itself is the thing that needs updating most often, delete it.

---

assert any(Path("plugins/autonomous-dev/lib").glob("skill"))


**规则**：如果测试本身是最需要频繁更新的内容，那就删掉它。

---

Test Tiers (auto-categorized by directory)

测试分层（按目录自动分类）

No manual

@pytest.mark

needed — directory location determines tier.

tests/
├── regression/
│   ├── smoke/           # Tier 0: Critical path (<5s) — CI GATE
│   ├── regression/      # Tier 1: Feature protection (<30s)
│   ├── extended/        # Tier 2: Deep validation (<5min)
│   └── progression/     # Tier 3: TDD red phase (not yet implemented)
├── unit/                # Isolated functions (<1s each)
├── integration/         # Multi-component workflows (<30s)
├── genai/               # LLM-as-judge (opt-in via --genai flag)
└── archived/            # Excluded from runs

Where to put a new test:

Protecting a released critical path? →
```
regression/smoke/
```
Protecting a released feature? →
```
regression/regression/
```
Testing a pure function? →
```
unit/
```
Testing component interaction? →
```
integration/
```
Checking doc↔code drift? →
```
genai/
```

Run commands:

bash

pytest -m smoke                    # CI gate
pytest -m "smoke or regression"    # Feature protection
pytest tests/genai/ --genai        # GenAI validation (opt-in)

无需手动添加

@pytest.mark

——目录位置决定分层。

tests/
├── regression/
│   ├── smoke/           # 第0层：关键路径（<5秒）——CI门禁
│   ├── regression/      # 第1层：功能保护（<30秒）
│   ├── extended/        # 第2层：深度验证（<5分钟）
│   └── progression/     # 第3层：TDD红阶段（尚未实现）
├── unit/                # 独立函数（每个<1秒）
├── integration/         # 多组件工作流（<30秒）
├── genai/               # LLM作为评判者（通过--genai flag启用）
└── archived/            # 排除在运行范围外

新测试的存放位置：

保护已发布的关键路径？→
```
regression/smoke/
```
保护已发布的功能？→
```
regression/regression/
```
测试纯函数？→
```
unit/
```
测试组件交互？→
```
integration/
```
检查文档与代码的漂移？→
```
genai/
```

运行命令：

bash

pytest -m smoke                    # CI门禁测试
pytest -m "smoke or regression"    # 功能保护测试
pytest tests/genai/ --genai        # GenAI验证（需手动启用）

GenAI Test Infrastructure

GenAI测试基础设施

python

undefined

python

undefined

tests/genai/conftest.py provides two fixtures:

tests/genai/conftest.py提供两个fixture：

- genai: Gemini Flash via OpenRouter (cheap, fast)

- genai：通过OpenRouter调用Gemini Flash（低成本、快速）

- genai_smart: Haiku 4.5 via OpenRouter (complex reasoning)

- genai_smart：通过OpenRouter调用Haiku 4.5（复杂推理）

Requires: OPENROUTER_API_KEY env var + --genai pytest flag

要求：设置OPENROUTER_API_KEY环境变量 + --genai pytest参数

Cost: ~$0.02 per full run with 24h response caching

成本：每次完整运行约0.02美元，包含24小时响应缓存


**Scaffold for any repo**: `/scaffold-genai-uat` generates the full `tests/genai/` setup with portable client, universal tests, and project-specific congruence tests auto-discovered by GenAI.

---


**适用于任意仓库的脚手架**：`/scaffold-genai-uat`可生成完整的`tests/genai/`环境，包含可移植客户端、通用测试以及由GenAI自动发现的项目专属一致性测试。

---

What to Test vs What Not To

测试内容与无需测试的内容

Test This	With This	Not This
Pure Python functions	Unit tests	—
Component interactions	Integration tests	—
Doc ↔ code alignment	GenAI congruence	Hardcoded string matching
Component existence	Structural (glob)	Hardcoded counts
FORBIDDEN list sync	GenAI congruence	Manual comparison
Security posture	GenAI judge	Regex scanning
Config structure	Structural	Config values
Agent output quality	GenAI judge	Output string matching

需测试内容	测试方法	无需测试内容
纯Python函数	单元测试	—
组件交互	集成测试	—
文档与代码对齐	GenAI一致性测试	硬编码字符串匹配
组件存在性	结构测试（glob）	硬编码计数
FORBIDDEN列表同步	GenAI一致性测试	手动对比
安全态势	GenAI评判	正则扫描
配置结构	结构测试	配置具体值
Agent输出质量	GenAI评判	输出字符串匹配

Hard Rules

硬性规则

100% pass rate required — ALL tests must pass, 0 failures. Coverage targets are separate.
Tests before implementation — write failing tests, then implement.
Regression test for every bug fix — named
```
test_regression_issue_NNN_description
```
.
No test is better than a flaky test — if it fails randomly, fix or delete it.
GenAI tests are opt-in —
```
--genai
```
flag required, no surprise API costs.

必须100%通过——所有测试必须通过，0失败。覆盖率目标是另一回事。
先写测试再实现——先编写失败的测试，再进行功能实现。
每个bug修复都要添加回归测试——命名格式为
```
test_regression_issue_NNN_description
```
。
不稳定的测试不如没有——如果测试随机失败，要么修复要么删除。
GenAI测试为可选启用——需要
```
--genai
```
参数，避免意外产生API费用。