multi-ai-verification
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMulti-AI Verification
多AI验证
Overview
概述
multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.
Purpose: Multi-layer independent verification ensuring production-ready quality
Pattern: Task-based (5 independent verification operations, one per layer)
Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming
Core Principles (validated by tri-AI research):
- Multi-Layer Defense - 5 layers catch different types of issues
- Independent Verification - Separate agent from implementation/testing
- Progressive Automation - Automate what can be automated (95% → 0%)
- Quality Scoring - Objective 0-100 scoring with ≥90 threshold
- Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)
Quality Gates: All 5 layers must pass for production approval
multi-ai-verification通过5层验证金字塔提供全面的质量保证,覆盖从自动化规则校验到LLM-as-judge评估的全流程。
核心目标:通过多层独立验证,确保代码达到生产就绪质量标准
模式:基于任务的验证(5层对应5项独立验证操作)
核心创新点:5层金字塔(底层95%自动化→顶层0%自动化)搭配独立验证机制,可避免偏见与测试作弊
核心原则(经Tri-AI研究验证):
- 多层防御 - 5层验证可覆盖不同类型的问题
- 独立验证 - 验证Agent与实现/测试环节完全分离
- 渐进式自动化 - 对可自动化的环节实现自动化(95%→0%)
- 质量评分 - 采用0-100分的客观评分体系,合格阈值≥90分
- 可落地反馈 - 100%的反馈需具体且可执行(包含问题内容、位置、原因、修复方案、优先级)
质量门禁:需通过全部5层验证,方可获得生产部署批准
When to Use
适用场景
Use multi-ai-verification when:
- Final quality check before commit/deployment
- Independent code review (preventing bias)
- Security verification (OWASP, vulnerabilities)
- Comprehensive QA (all layers)
- Test quality verification (prevent gaming)
- Production readiness validation
在以下场景中使用multi-ai-verification:
- 代码提交/部署前的最终质量检查
- 独立代码评审(避免偏见)
- 安全验证(OWASP规范、漏洞扫描)
- 全面质量保证(覆盖所有验证层)
- 测试质量验证(防止测试作弊)
- 生产就绪状态确认
Prerequisites
前置条件
Required
必备条件
- Code to verify (implementation complete)
- Tests available (for functional verification)
- Quality standards defined
- 待验证的代码(已完成实现)
- 可用的测试用例(用于功能验证)
- 已定义的质量标准
Recommended
推荐搭配
- multi-ai-testing - For generating/running tests
- multi-ai-implementation - For implementing fixes
- multi-ai-testing - 用于生成/执行测试用例
- multi-ai-implementation - 用于修复问题
Tools Available
可用工具
- Linters (ESLint, Pylint)
- Type checkers (TypeScript, mypy)
- Coverage tools (c8, pytest-cov)
- Security scanners (Semgrep, Bandit)
- Test frameworks (Jest, pytest)
- 代码检查工具(ESLint、Pylint)
- 类型检查工具(TypeScript、mypy)
- 覆盖率工具(c8、pytest-cov)
- 安全扫描工具(Semgrep、Bandit)
- 测试框架(Jest、pytest)
The 5-Layer Verification Pyramid
5层验证金字塔
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)原则:在进入成本较高的LLM-as-judge评估前,先通过自动化层快速拦截问题(成本低、速度快)
Verification Operations
验证操作流程
Operation 1: Rules-Based Verification (Layer 1)
操作1:基于规则的验证(第1层)
Purpose: Automated validation of code structure, formatting, types
Automation: 95% automated
Speed: Seconds (fast feedback)
Confidence: High (deterministic)
Process:
-
Schema Validation (if applicable):bash
# Validate JSON/YAML against schemas ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json -
Linting:bash
# JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # Expected: Zero linting errors -
Type Checking:bash
# TypeScript npx tsc --noEmit # Python mypy src/ # Expected: Zero type errors -
Format Validation:bash
# Check formatting npx prettier --check src/**/*.{ts,tsx} # Or auto-fix npx prettier --write src/**/*.{ts,tsx} -
Security Scanning (SAST):bash
# Static security analysis npx semgrep --config=auto src/ # Or for Python bandit -r src/ # Check for: # - Hardcoded secrets # - SQL injection risks # - XSS vulnerabilities # - Insecure dependencies -
Generate Layer 1 Report:markdown
# Layer 1: Rules-Based Verification ## Schema Validation ✅ plan.json validates ✅ All task files validate ## Linting ✅ 0 linting errors ⚠️ 3 warnings (non-blocking) ## Type Checking ✅ 0 type errors ## Formatting ✅ All files formatted correctly ## Security Scan (SAST) ✅ No critical vulnerabilities ⚠️ 1 medium: Weak password hashing rounds (bcrypt) **Layer 1 Status**: ✅ PASS (0 critical issues) **Issues to Address**: 1 medium security issue
Outputs:
- Lint report (errors/warnings)
- Type check results
- Schema validation results
- Security scan findings
- Layer 1 status (PASS/FAIL)
Validation:
- All automated checks run
- Results documented
- Critical issues = 0 for PASS
- Actionable feedback for warnings
Time Estimate: 15-30 minutes (mostly automated)
Gate 1: ✅ PASS if no critical issues (warnings acceptable)
目标:自动化验证代码结构、格式、类型合规性
自动化程度:95%自动化
速度:秒级(快速反馈)
可信度:高(确定性结果)
流程:
-
Schema验证(如适用):bash
# Validate JSON/YAML against schemas ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json -
代码检查:bash
# JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # Expected: Zero linting errors -
类型检查:bash
# TypeScript npx tsc --noEmit # Python mypy src/ # Expected: Zero type errors -
格式验证:bash
# Check formatting npx prettier --check src/**/*.{ts,tsx} # Or auto-fix npx prettier --write src/**/*.{ts,tsx} -
安全扫描(SAST):bash
# Static security analysis npx semgrep --config=auto src/ # Or for Python bandit -r src/ # Check for: # - Hardcoded secrets # - SQL injection risks # - XSS vulnerabilities # - Insecure dependencies -
生成第1层验证报告:markdown
# Layer 1: Rules-Based Verification ## Schema Validation ✅ plan.json validates ✅ All task files validate ## Linting ✅ 0 linting errors ⚠️ 3 warnings (non-blocking) ## Type Checking ✅ 0 type errors ## Formatting ✅ All files formatted correctly ## Security Scan (SAST) ✅ No critical vulnerabilities ⚠️ 1 medium: Weak password hashing rounds (bcrypt) **Layer 1 Status**: ✅ PASS (0 critical issues) **Issues to Address**: 1 medium security issue
输出结果:
- 代码检查报告(错误/警告)
- 类型检查结果
- Schema验证结果
- 安全扫描发现
- 第1层验证状态(通过/失败)
验证标准:
- 已执行所有自动化检查
- 结果已记录
- 严重问题数为0则通过
- 警告需提供可执行反馈
时间预估:15-30分钟(主要为自动化执行)
门禁1:✅ 无严重问题则通过(警告可接受)
Operation 2: Functional Verification (Layer 2)
操作2:功能验证(第2层)
Purpose: Validate functionality through test execution and coverage
Automation: 60-80% automated
Speed: Minutes (medium feedback)
Confidence: High (measurable outcomes)
Process:
-
Execute Complete Test Suite:bash
# Run all tests with coverage npm test -- --coverage --verbose # Capture results # - Tests passed/failed # - Coverage metrics # - Execution time -
Validate Example Code (from documentation):bash
# Extract examples from SKILL.md # Execute each example automatically # Verify outputs match expected # Target: ≥90% examples work -
Check Coverage:markdown
# Coverage Report **Line Coverage**: 87% ✅ (gate: ≥80%) **Branch Coverage**: 82% ✅ **Function Coverage**: 92% ✅ **Path Coverage**: 74% ✅ **Gate Status**: PASS ✅ (all ≥80%) **Uncovered Code**: - src/admin/legacy.ts: 23% (low priority) - src/utils/deprecated.ts: 15% (deprecated, ok) -
Regression Testing (for updates):bash
# Compare before/after git diff main...feature --stat # Run all tests npm test # Verify: No new failures (regression prevention) -
Performance Validation:bash
# Run performance tests npm run test:performance # Check response times # Verify: Within acceptable ranges -
Generate Layer 2 Report:markdown
# Layer 2: Functional Verification ## Test Execution ✅ 245/245 tests passing (100%) ⏱️ Execution time: 8.3 seconds ## Coverage ✅ Line: 87% (gate: ≥80%) ✅ Branch: 82% ✅ Function: 92% ## Example Validation ✅ 18/20 examples work (90%) ❌ 2 examples fail (outdated) ## Regression ✅ All existing tests still pass ## Performance ✅ All endpoints <200ms **Layer 2 Status**: ✅ PASS **Issues**: 2 outdated examples (update docs)
Outputs:
- Test execution results
- Coverage report
- Example validation results
- Regression check
- Performance metrics
- Layer 2 status
Validation:
- All tests executed
- Coverage meets gate (≥80%)
- Examples validated (≥90%)
- No regressions
- Performance acceptable
Time Estimate: 30-60 minutes
Gate 2: ✅ PASS if tests pass + coverage ≥80%
目标:通过测试执行与覆盖率验证功能正确性
自动化程度:60-80%自动化
速度:分钟级(中等速度反馈)
可信度:高(可量化结果)
流程:
-
执行完整测试套件:bash
# Run all tests with coverage npm test -- --coverage --verbose # Capture results # - Tests passed/failed # - Coverage metrics # - Execution time -
验证示例代码(来自文档):bash
# Extract examples from SKILL.md # Execute each example automatically # Verify outputs match expected # Target: ≥90% examples work -
覆盖率检查:markdown
# Coverage Report **Line Coverage**: 87% ✅ (gate: ≥80%) **Branch Coverage**: 82% ✅ **Function Coverage**: 92% ✅ **Path Coverage**: 74% ✅ **Gate Status**: PASS ✅ (all ≥80%) **Uncovered Code**: - src/admin/legacy.ts: 23% (low priority) - src/utils/deprecated.ts: 15% (deprecated, ok) -
回归测试(针对代码更新):bash
# Compare before/after git diff main...feature --stat # Run all tests npm test # Verify: No new failures (regression prevention) -
性能验证:bash
# Run performance tests npm run test:performance # Check response times # Verify: Within acceptable ranges -
生成第2层验证报告:markdown
# Layer 2: Functional Verification ## Test Execution ✅ 245/245 tests passing (100%) ⏱️ Execution time: 8.3 seconds ## Coverage ✅ Line: 87% (gate: ≥80%) ✅ Branch: 82% ✅ Function: 92% ## Example Validation ✅ 18/20 examples work (90%) ❌ 2 examples fail (outdated) ## Regression ✅ All existing tests still pass ## Performance ✅ All endpoints <200ms **Layer 2 Status**: ✅ PASS **Issues**: 2 outdated examples (update docs)
输出结果:
- 测试执行结果
- 覆盖率报告
- 示例代码验证结果
- 回归检查结果
- 性能指标
- 第2层验证状态
验证标准:
- 已执行所有测试
- 覆盖率达标(≥80%)
- 示例代码验证通过率≥90%
- 无回归问题
- 性能符合要求
时间预估:30-60分钟
门禁2:✅ 测试全部通过且覆盖率≥80%则通过
Operation 3: Visual Verification (Layer 3)
操作3:视觉验证(第3层)
Purpose: Validate UI appearance, layout, accessibility (for UI features)
Automation: 30-50% automated
Speed: Minutes-Hours
Confidence: Medium (subjective elements)
Process:
-
Screenshot Generation:bash
# Generate screenshots of UI npx playwright test --screenshot=on # Or manually: # Open application # Capture screenshots of key views -
Visual Comparison (if previous version exists):bash
# Compare against baseline npx playwright test --update-snapshots=missing # Or use Percy/Chromatic for visual regression npx percy snapshot screenshots/ -
Layout Validation:markdown
# Visual Checklist ## Layout - [ ] Components positioned correctly - [ ] Spacing/margins match mockup - [ ] Alignment proper - [ ] No overlapping elements ## Styling - [ ] Colors match design system - [ ] Typography correct (fonts, sizes) - [ ] Icons/images display properly ## Responsiveness - [ ] Mobile view (320px-480px): ✅ - [ ] Tablet view (768px-1024px): ✅ - [ ] Desktop view (>1024px): ✅ -
Accessibility Testing:bash
# Automated accessibility scan npx axe-core src/ # Check WCAG compliance npx pa11y http://localhost:3000 # Manual checks: # - Keyboard navigation # - Screen reader compatibility # - Color contrast ratios -
Generate Layer 3 Report:markdown
# Layer 3: Visual Verification ## Screenshot Comparison ✅ Login page matches mockup ✅ Dashboard layout correct ⚠️ Profile page: Avatar alignment off by 5px ## Responsiveness ✅ Mobile: All components visible ✅ Tablet: Layout adapts correctly ✅ Desktop: Full functionality ## Accessibility ✅ WCAG 2.1 AA compliance ✅ Keyboard navigation works ⚠️ 2 color contrast warnings (non-critical) **Layer 3 Status**: ✅ PASS (minor issues acceptable) **Issues**: Avatar alignment (cosmetic), contrast warnings
Outputs:
- Screenshots of UI
- Visual comparison results
- Responsiveness validation
- Accessibility report
- Layer 3 status
Validation:
- Screenshots captured
- Visual comparison done (if applicable)
- Layout validated
- Responsiveness tested
- Accessibility checked
- No critical visual issues
Time Estimate: 30-90 minutes (skip if no UI)
Gate 3: ✅ PASS if no critical visual/a11y issues
目标:验证UI外观、布局、可访问性(针对含UI的功能)
自动化程度:30-50%自动化
速度:分钟-小时级
可信度:中等(存在主观判断元素)
流程:
-
生成截图:bash
# Generate screenshots of UI npx playwright test --screenshot=on # Or manually: # Open application # Capture screenshots of key views -
视觉对比(若存在历史版本):bash
# Compare against baseline npx playwright test --update-snapshots=missing # Or use Percy/Chromatic for visual regression npx percy snapshot screenshots/ -
布局验证:markdown
# Visual Checklist ## Layout - [ ] Components positioned correctly - [ ] Spacing/margins match mockup - [ ] Alignment proper - [ ] No overlapping elements ## Styling - [ ] Colors match design system - [ ] Typography correct (fonts, sizes) - [ ] Icons/images display properly ## Responsiveness - [ ] Mobile view (320px-480px): ✅ - [ ] Tablet view (768px-1024px): ✅ - [ ] Desktop view (>1024px): ✅ -
可访问性测试:bash
# Automated accessibility scan npx axe-core src/ # Check WCAG compliance npx pa11y http://localhost:3000 # Manual checks: # - Keyboard navigation # - Screen reader compatibility # - Color contrast ratios -
生成第3层验证报告:markdown
# Layer 3: Visual Verification ## Screenshot Comparison ✅ Login page matches mockup ✅ Dashboard layout correct ⚠️ Profile page: Avatar alignment off by 5px ## Responsiveness ✅ Mobile: All components visible ✅ Tablet: Layout adapts correctly ✅ Desktop: Full functionality ## Accessibility ✅ WCAG 2.1 AA compliance ✅ Keyboard navigation works ⚠️ 2 color contrast warnings (non-critical) **Layer 3 Status**: ✅ PASS (minor issues acceptable) **Issues**: Avatar alignment (cosmetic), contrast warnings
输出结果:
- UI截图
- 视觉对比结果
- 响应式验证结果
- 可访问性报告
- 第3层验证状态
验证标准:
- 已捕获截图
- 已完成视觉对比(若适用)
- 布局已验证
- 已测试响应式
- 已检查可访问性
- 无严重视觉问题
时间预估:30-90分钟(无UI则可跳过)
门禁3:✅ 无严重视觉/可访问性问题则通过
Operation 4: Integration Verification (Layer 4)
操作4:集成验证(第4层)
Purpose: Validate system-level integration, data flow, API compatibility
Automation: 20-30% automated
Speed: Hours (complex)
Confidence: Medium-High
Process:
-
Component Integration Tests:bash
# Run integration test suite npm test -- tests/integration/ # Verify components work together # - Database ← → API # - API ← → Frontend # - Frontend ← → User -
Data Flow Validation:markdown
# Data Flow Verification **Flow 1: User Registration** Frontend form → API endpoint → Validation → Database → Email service ✅ Data flows correctly ✅ No data loss ✅ Transactions atomic **Flow 2: Authentication** Login request → API → Database lookup → Token generation → Response ✅ Token generated correctly ✅ Session stored ✅ Response includes token -
API Integration Tests:bash
# Test all API endpoints npm run test:api # Verify: # - All endpoints respond # - Status codes correct # - Response formats match spec # - Error handling works -
End-to-End Workflow Tests:typescript
// Complete user journeys test('Complete registration and login flow', async () => { // 1. Register new user const registerResponse = await api.post('/register', userData); expect(registerResponse.status).toBe(201); // 2. Confirm email const confirmResponse = await api.get(confirmLink); expect(confirmResponse.status).toBe(200); // 3. Login const loginResponse = await api.post('/login', credentials); expect(loginResponse.status).toBe(200); expect(loginResponse.data.token).toBeDefined(); // 4. Access protected resource const profileResponse = await api.get('/profile', { headers: { Authorization: `Bearer ${loginResponse.data.token}` } }); expect(profileResponse.status).toBe(200); }); -
Dependency Compatibility:bash
# Check external dependencies work npm audit # Check for breaking changes npm outdated # Verify integration with services # - Database connection # - Redis/cache # - External APIs -
Generate Layer 4 Report:markdown
# Layer 4: Integration Verification ## Component Integration ✅ 12/12 integration tests passing ✅ All components integrate correctly ## Data Flow ✅ All 5 data flows validated ✅ No data loss or corruption ## API Integration ✅ All 15 endpoints functional ✅ Response formats correct ✅ Error handling works ## E2E Workflows ✅ 8/8 user journeys complete successfully ✅ No workflow breaks ## Dependencies ✅ 0 critical vulnerabilities ⚠️ 2 moderate (non-blocking) **Layer 4 Status**: ✅ PASS
Outputs:
- Integration test results
- Data flow validation
- API compatibility report
- E2E workflow results
- Dependency audit
- Layer 4 status
Validation:
- Integration tests pass
- Data flows validated
- APIs integrate correctly
- E2E workflows function
- Dependencies secure
Time Estimate: 45-90 minutes
Gate 4: ✅ PASS if all integration tests pass, no critical dependencies
目标:验证系统级集成、数据流、API兼容性
自动化程度:20-30%自动化
速度:小时级(复杂度较高)
可信度:中-高
流程:
-
组件集成测试:bash
# Run integration test suite npm test -- tests/integration/ # Verify components work together # - Database ← → API # - API ← → Frontend # - Frontend ← → User -
数据流验证:markdown
# Data Flow Verification **Flow 1: User Registration** Frontend form → API endpoint → Validation → Database → Email service ✅ Data flows correctly ✅ No data loss ✅ Transactions atomic **Flow 2: Authentication** Login request → API → Database lookup → Token generation → Response ✅ Token generated correctly ✅ Session stored ✅ Response includes token -
API集成测试:bash
# Test all API endpoints npm run test:api # Verify: # - All endpoints respond # - Status codes correct # - Response formats match spec # - Error handling works -
端到端工作流测试:typescript
// Complete user journeys test('Complete registration and login flow', async () => { // 1. Register new user const registerResponse = await api.post('/register', userData); expect(registerResponse.status).toBe(201); // 2. Confirm email const confirmResponse = await api.get(confirmLink); expect(confirmResponse.status).toBe(200); // 3. Login const loginResponse = await api.post('/login', credentials); expect(loginResponse.status).toBe(200); expect(loginResponse.data.token).toBeDefined(); // 4. Access protected resource const profileResponse = await api.get('/profile', { headers: { Authorization: `Bearer ${loginResponse.data.token}` } }); expect(profileResponse.status).toBe(200); }); -
依赖兼容性检查:bash
# Check external dependencies work npm audit # Check for breaking changes npm outdated # Verify integration with services # - Database connection # - Redis/cache # - External APIs -
生成第4层验证报告:markdown
# Layer 4: Integration Verification ## Component Integration ✅ 12/12 integration tests passing ✅ All components integrate correctly ## Data Flow ✅ All 5 data flows validated ✅ No data loss or corruption ## API Integration ✅ All 15 endpoints functional ✅ Response formats correct ✅ Error handling works ## E2E Workflows ✅ 8/8 user journeys complete successfully ✅ No workflow breaks ## Dependencies ✅ 0 critical vulnerabilities ⚠️ 2 moderate (non-blocking) **Layer 4 Status**: ✅ PASS
输出结果:
- 集成测试结果
- 数据流验证报告
- API兼容性报告
- 端到端工作流结果
- 依赖审计报告
- 第4层验证状态
验证标准:
- 集成测试全部通过
- 数据流已验证
- API集成正常
- 端到端工作流可正常执行
- 依赖安全
时间预估:45-90分钟
门禁4:✅ 所有集成测试通过且无严重依赖问题则通过
Operation 5: Quality Scoring (Layer 5)
操作5:质量评分(第5层)
Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns
Automation: 0-20% automated
Speed: Hours (expensive)
Confidence: Medium (requires judgment)
Process:
-
Spawn Independent Quality Assessor (Agent-as-a-Judge):Key: Use different model family if possible (prevent self-preference bias)typescript
const qualityAssessment = await task({ description: "Assess code quality holistically", prompt: `Evaluate code quality in src/ and tests/. DO NOT read implementation conversation history. You have access to tools: - Read files - Execute tests - Run linters - Query database (if needed) Assess 5 dimensions (score each /20): 1. CORRECTNESS (/20): - Logic correctness - Edge case handling - Error handling completeness - Security considerations 2. FUNCTIONALITY (/20): - Meets all requirements - User workflows work - Performance acceptable - No regressions 3. QUALITY (/20): - Code maintainability - Best practices followed - Anti-patterns avoided - Documentation complete 4. INTEGRATION (/20): - Components integrate smoothly - API contracts correct - Data flow works - Backward compatible 5. SECURITY (/20): - No vulnerabilities - Input validation - Authentication/authorization - Data protection TOTAL: /100 (sum of 5 dimensions) For each dimension, provide: - Score (/20) - Strengths (what's good) - Weaknesses (what needs improvement) - Evidence (file:line references) - Recommendations (specific, actionable) Write comprehensive report to: quality-assessment.md` }); -
Multi-Agent Ensemble (for critical features):3-5 Agent Voting Committee:typescript
// Spawn 3 independent quality assessors const [judge1, judge2, judge3] = await Promise.all([ task({description: "Quality Judge 1", prompt: assessmentPrompt}), task({description: "Quality Judge 2", prompt: assessmentPrompt}), task({description: "Quality Judge 3", prompt: assessmentPrompt}) ]); // Aggregate scores const scores = { correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]), functionality: median([...]), quality: median([...]), integration: median([...]), security: median([...]) }; const totalScore = sum(Object.values(scores)); // Total /100 // Check variance const totalScores = [judge1.total, judge2.total, judge3.total]; const variance = max(totalScores) - min(totalScores); if (variance > 15) { // High disagreement → spawn 2 more judges (total 5) // Use 5-agent ensemble for final score } // Final score: median of 3 or 5 -
Calibration Against Rubric:markdown
# Scoring Calibration ## Correctness: 18/20 (Excellent) **20**: Zero errors, all edge cases handled perfectly **18**: Minor edge case missing, otherwise excellent ✅ (achieved) **15**: 1-2 significant edge cases missing **10**: Some logic errors present **0**: Major functionality broken **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor) ## Functionality: 19/20 (Excellent) [Similar rubric with evidence] ## Quality: 17/20 (Good) [Similar rubric with evidence] ## Integration: 18/20 (Excellent) [Similar rubric with evidence] ## Security: 16/20 (Good) [Similar rubric with evidence] **Total**: 88/100 ⚠️ (Below ≥90 gate) -
Gap Analysis (if <90):markdown
# Quality Gap Analysis **Current Score**: 88/100 **Target**: ≥90/100 **Gap**: 2 points ## Critical Gaps (Blocking Approval) None ## High Priority (Should Fix for ≥90) 1. **Security: Weak bcrypt rounds** - **What**: bcrypt using 10 rounds (outdated) - **Where**: src/auth/hash.ts:15 - **Why**: Current standard is 12-14 rounds - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)` - **Priority**: High - **Impact**: +2 points → 90/100 ## Medium Priority 1. **Quality: Missing JSDoc for 3 functions** - Impact: +1 point → 91/100 **Recommendation**: Fix high priority issue to reach ≥90 threshold **Estimated Effort**: 15 minutes -
Generate Comprehensive Quality Report:markdown
# Layer 5: Quality Scoring Report ## Executive Summary **Total Score**: 88/100 ⚠️ (Below ≥90 gate) **Status**: NEEDS MINOR REVISION ## Dimension Scores - Correctness: 18/20 ⭐⭐⭐⭐⭐ - Functionality: 19/20 ⭐⭐⭐⭐⭐ - Quality: 17/20 ⭐⭐⭐⭐ - Integration: 18/20 ⭐⭐⭐⭐⭐ - Security: 16/20 ⭐⭐⭐⭐ ## Strengths 1. Comprehensive test coverage (87%) 2. All functionality working correctly 3. Clean integration with all components 4. Good error handling ## Weaknesses 1. Bcrypt rounds below current standard (security) 2. Missing documentation for helper functions (quality) 3. One timezone edge case not handled (correctness) ## Recommendations (Prioritized) ### Priority 1 (High - Needed for ≥90) 1. Increase bcrypt rounds: 10 → 12 - File: src/auth/hash.ts:15 - Effort: 5 min - Impact: +2 points ### Priority 2 (Medium - Nice to Have) 1. Add JSDoc to helper functions - Files: src/utils/validation.ts - Effort: 30 min - Impact: +1 point 2. Handle timezone DST edge case - File: src/auth/tokens.ts:78 - Effort: 20 min - Impact: +1 point **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
Outputs:
- Quality score (0-100) with dimension breakdown
- Calibrated against rubric
- Gap analysis
- Prioritized recommendations (Critical/High/Medium/Low)
- Evidence-based feedback (file:line references)
- Action plan to reach ≥90
Validation:
- All 5 dimensions scored
- Scores calibrated against rubric
- Evidence provided for each score
- Gap analysis if <90
- Recommendations actionable
- Ensemble used for critical features (optional)
Time Estimate: 60-120 minutes (ensemble adds 30-60 min)
Gate 5: ✅ PASS if total score ≥90/100
目标:采用LLM-as-judge和Agent-as-a-Judge模式进行整体质量评估
自动化程度:0-20%自动化
速度:小时级(成本较高)
可信度:中等(需主观判断)
流程:
-
生成独立质量评估Agent(Agent-as-a-Judge):关键:尽可能使用不同模型家族(避免自我偏好偏见)typescript
const qualityAssessment = await task({ description: "Assess code quality holistically", prompt: `Evaluate code quality in src/ and tests/. DO NOT read implementation conversation history. You have access to tools: - Read files - Execute tests - Run linters - Query database (if needed) Assess 5 dimensions (score each /20): 1. CORRECTNESS (/20): - Logic correctness - Edge case handling - Error handling completeness - Security considerations 2. FUNCTIONALITY (/20): - Meets all requirements - User workflows work - Performance acceptable - No regressions 3. QUALITY (/20): - Code maintainability - Best practices followed - Anti-patterns avoided - Documentation complete 4. INTEGRATION (/20): - Components integrate smoothly - API contracts correct - Data flow works - Backward compatible 5. SECURITY (/20): - No vulnerabilities - Input validation - Authentication/authorization - Data protection TOTAL: /100 (sum of 5 dimensions) For each dimension, provide: - Score (/20) - Strengths (what's good) - Weaknesses (what needs improvement) - Evidence (file:line references) - Recommendations (specific, actionable) Write comprehensive report to: quality-assessment.md` }); -
多Agent集成评估(针对核心功能):3-5个Agent投票委员会:typescript
// Spawn 3 independent quality assessors const [judge1, judge2, judge3] = await Promise.all([ task({description: "Quality Judge 1", prompt: assessmentPrompt}), task({description: "Quality Judge 2", prompt: assessmentPrompt}), task({description: "Quality Judge 3", prompt: assessmentPrompt}) ]); // Aggregate scores const scores = { correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]), functionality: median([...]), quality: median([...]), integration: median([...]), security: median([...]) }; const totalScore = sum(Object.values(scores)); // Total /100 // Check variance const totalScores = [judge1.total, judge2.total, judge3.total]; const variance = max(totalScores) - min(totalScores); if (variance > 15) { // High disagreement → spawn 2 more judges (total 5) // Use 5-agent ensemble for final score } // Final score: median of 3 or 5 -
基于评分标准的校准:markdown
# Scoring Calibration ## Correctness: 18/20 (Excellent) **20**: Zero errors, all edge cases handled perfectly **18**: Minor edge case missing, otherwise excellent ✅ (achieved) **15**: 1-2 significant edge cases missing **10**: Some logic errors present **0**: Major functionality broken **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor) ## Functionality: 19/20 (Excellent) [Similar rubric with evidence] ## Quality: 17/20 (Good) [Similar rubric with evidence] ## Integration: 18/20 (Excellent) [Similar rubric with evidence] ## Security: 16/20 (Good) [Similar rubric with evidence] **Total**: 88/100 ⚠️ (Below ≥90 gate) -
差距分析(若评分<90):markdown
# Quality Gap Analysis **Current Score**: 88/100 **Target**: ≥90/100 **Gap**: 2 points ## Critical Gaps (Blocking Approval) None ## High Priority (Should Fix for ≥90) 1. **Security: Weak bcrypt rounds** - **What**: bcrypt using 10 rounds (outdated) - **Where**: src/auth/hash.ts:15 - **Why**: Current standard is 12-14 rounds - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)` - **Priority**: High - **Impact**: +2 points → 90/100 ## Medium Priority 1. **Quality: Missing JSDoc for 3 functions** - Impact: +1 point → 91/100 **Recommendation**: Fix high priority issue to reach ≥90 threshold **Estimated Effort**: 15 minutes -
生成全面质量报告:markdown
# Layer 5: Quality Scoring Report ## Executive Summary **Total Score**: 88/100 ⚠️ (Below ≥90 gate) **Status**: NEEDS MINOR REVISION ## Dimension Scores - Correctness: 18/20 ⭐⭐⭐⭐⭐ - Functionality: 19/20 ⭐⭐⭐⭐⭐ - Quality: 17/20 ⭐⭐⭐⭐ - Integration: 18/20 ⭐⭐⭐⭐⭐ - Security: 16/20 ⭐⭐⭐⭐ ## Strengths 1. Comprehensive test coverage (87%) 2. All functionality working correctly 3. Clean integration with all components 4. Good error handling ## Weaknesses 1. Bcrypt rounds below current standard (security) 2. Missing documentation for helper functions (quality) 3. One timezone edge case not handled (correctness) ## Recommendations (Prioritized) ### Priority 1 (High - Needed for ≥90) 1. Increase bcrypt rounds: 10 → 12 - File: src/auth/hash.ts:15 - Effort: 5 min - Impact: +2 points ### Priority 2 (Medium - Nice to Have) 1. Add JSDoc to helper functions - Files: src/utils/validation.ts - Effort: 30 min - Impact: +1 point 2. Handle timezone DST edge case - File: src/auth/tokens.ts:78 - Effort: 20 min - Impact: +1 point **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
输出结果:
- 质量评分(0-100分)及维度细分
- 基于评分标准的校准结果
- 差距分析报告
- 按优先级排序的建议(严重/高/中/低)
- 基于证据的反馈(文件:行号引用)
- 达到≥90分的行动计划
验证标准:
- 已对所有5个维度评分
- 评分已基于标准校准
- 每个评分均提供证据
- 若评分<90则已完成差距分析
- 建议可执行
- 核心功能已使用集成评估(可选)
时间预估:60-120分钟(集成评估需额外30-60分钟)
门禁5:✅ 总分≥90/100则通过
Quality Gates Summary
质量门禁汇总
All 5 Gates Must Pass for production approval:
Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVEDIf Any Gate Fails:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass需通过全部5个门禁方可获得生产部署批准:
Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVED若任意门禁失败:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until PassAppendix A: Independence Protocol
附录A:独立验证协议
How Verification Independence is Maintained
如何保持验证独立性
Verification Agent Spawning:
typescript
// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});Bias Prevention Checklist:
- Specifications written BEFORE implementation
- Verification agent prompt has no implementation context
- Agent evaluates against specs, not what code does
- Fresh context (via Task tool)
- Different model family used (if possible)
Validation of Independence:
markdown
undefined验证Agent生成方式:
typescript
// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});偏见预防检查清单:
- 需求规格书在实现前已编写
- 验证Agent的提示语不含实现上下文
- Agent仅基于规格书评估,而非代码实际实现
- 使用全新上下文(通过Task工具)
- 尽可能使用不同模型家族
独立性验证:
markdown
undefinedIndependence Audit
Independence Audit
Expected Behavior:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
Warning Signs:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
If Warning: Re-verify with stronger independence prompt
---Expected Behavior:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
Warning Signs:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
If Warning: Re-verify with stronger independence prompt
---Appendix B: Operational Scoring Rubrics
附录B:操作评分标准
Complete Rubrics for All 5 Dimensions
所有5个维度的完整评分标准
Correctness (/20)
Correctness (/20)
20 (Perfect): Zero logic errors, all edge cases handled, security perfect
18 (Excellent): 1 minor edge case missing, otherwise flawless
15 (Good): 2-3 edge cases missing, no critical errors
12 (Acceptable): Some edge cases missing, 1 minor logic issue
10 (Needs Work): Multiple edge cases missing or 1 significant logic error
5 (Poor): Major logic errors present
0 (Broken): Critical functionality broken
20 (Perfect): Zero logic errors, all edge cases handled, security perfect
18 (Excellent): 1 minor edge case missing, otherwise flawless
15 (Good): 2-3 edge cases missing, no critical errors
12 (Acceptable): Some edge cases missing, 1 minor logic issue
10 (Needs Work): Multiple edge cases missing or 1 significant logic error
5 (Poor): Major logic errors present
0 (Broken): Critical functionality broken
Functionality (/20)
Functionality (/20)
20: All requirements met, exceeds expectations
18: All requirements met, well implemented
15: All requirements met, basic implementation
12: 1 requirement partially missing
10: 2+ requirements partially missing
5: Several requirements not met
0: Core functionality missing
20: All requirements met, exceeds expectations
18: All requirements met, well implemented
15: All requirements met, basic implementation
12: 1 requirement partially missing
10: 2+ requirements partially missing
5: Several requirements not met
0: Core functionality missing
Quality (/20)
Quality (/20)
20: Exceptional code quality, best practices exemplified
18: High quality, follows best practices
15: Good quality, minor style issues
12: Acceptable quality, several style issues
10: Below standard, needs refactoring
5: Poor quality, significant issues
0: Unmaintainable code
20: Exceptional code quality, best practices exemplified
18: High quality, follows best practices
15: Good quality, minor style issues
12: Acceptable quality, several style issues
10: Below standard, needs refactoring
5: Poor quality, significant issues
0: Unmaintainable code
Integration (/20)
Integration (/20)
20: Perfect integration, all touch points verified
18: Excellent integration, minor docs needed
15: Good integration, all major points work
12: Acceptable, 1-2 integration issues
10: Integration issues present
5: Multiple integration problems
0: Does not integrate
20: Perfect integration, all touch points verified
18: Excellent integration, minor docs needed
15: Good integration, all major points work
12: Acceptable, 1-2 integration issues
10: Integration issues present
5: Multiple integration problems
0: Does not integrate
Security (/20)
Security (/20)
20: Passes all security scans, OWASP compliant, hardened
18: Passes scans, 1 minor non-critical issue
15: Passes, 2-3 minor issues
12: 1 medium security issue
10: Multiple medium issues
5: 1 critical issue present
0: Multiple critical vulnerabilities
20: Passes all security scans, OWASP compliant, hardened
18: Passes scans, 1 minor non-critical issue
15: Passes, 2-3 minor issues
12: 1 medium security issue
10: Multiple medium issues
5: 1 critical issue present
0: Multiple critical vulnerabilities
Appendix C: Technical Foundation
附录C:技术基础
Verification Tools
验证工具
Linting:
- ESLint (JavaScript/TypeScript)
- Pylint/Ruff (Python)
Type Checking:
- TypeScript compiler (tsc)
- mypy (Python)
Security (SAST):
- Semgrep (multi-language)
- Bandit (Python)
- npm audit (JavaScript)
Visual Testing:
- Playwright (screenshot, visual regression)
- Percy/Chromatic (visual diff)
- axe-core (accessibility)
Coverage:
- c8/nyc (JavaScript)
- pytest-cov (Python)
代码检查:
- ESLint (JavaScript/TypeScript)
- Pylint/Ruff (Python)
类型检查:
- TypeScript compiler (tsc)
- mypy (Python)
安全扫描(SAST):
- Semgrep (多语言)
- Bandit (Python)
- npm audit (JavaScript)
视觉测试:
- Playwright (截图、视觉回归)
- Percy/Chromatic (视觉对比)
- axe-core (可访问性)
覆盖率工具:
- c8/nyc (JavaScript)
- pytest-cov (Python)
Cost Controls
成本控制
Budget Caps:
- LLM-as-judge: $50/month
- Ensemble verification: $20/month
- Total verification: $70/month
Optimization:
- Cache quality scores for 24h (same code → same score)
- Skip Layer 5 for changes <50 lines
- Use ensemble (3-5 agents) only for critical features
- Use cheaper models for pre-filtering (Haiku for Layer 1-2)
预算上限:
- LLM-as-judge: $50/月
- 集成验证: $20/月
- 总验证成本: $70/月
优化措施:
- 质量评分缓存24小时(相同代码→相同评分)
- 代码变更<50行时跳过第5层验证
- 仅对核心功能使用集成评估(3-5个Agent)
- 预过滤使用低成本模型(Layer1-2使用Haiku)
Quick Reference
快速参考
The 5 Layers
5层验证体系
| Layer | Purpose | Automation | Time | Tools |
|---|---|---|---|---|
| 1 | Rules-based | 95% | 15-30m | Linters, types, SAST |
| 2 | Functional | 60-80% | 30-60m | Test execution, coverage |
| 3 | Visual | 30-50% | 30-90m | Screenshots, a11y |
| 4 | Integration | 20-30% | 45-90m | E2E, API tests |
| 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |
Total: 3-6 hours for complete 5-layer verification
| Layer | Purpose | Automation | Time | Tools |
|---|---|---|---|---|
| 1 | Rules-based | 95% | 15-30m | Linters, types, SAST |
| 2 | Functional | 60-80% | 30-60m | Test execution, coverage |
| 3 | Visual | 30-50% | 30-90m | Screenshots, a11y |
| 4 | Integration | 20-30% | 45-90m | E2E, API tests |
| 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |
Total: 3-6 hours for complete 5-layer verification
Quality Thresholds
质量阈值
- ≥90: ✅ Excellent (production-ready)
- 80-89: ⚠️ Good (needs minor improvements)
- 70-79: ❌ Acceptable (needs work before production)
- <70: ❌ Poor (significant rework required)
- ≥90: ✅ Excellent (production-ready)
- 80-89: ⚠️ Good (needs minor improvements)
- 70-79: ❌ Acceptable (needs work before production)
- <70: ❌ Poor (significant rework required)
Gates
门禁要求
All 5 Must Pass:
- Rules pass (no critical lint/type/security)
- Tests pass + coverage ≥80%
- Visual OK (no critical UI issues)
- Integration OK (E2E works)
- Quality ≥90/100
multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.
All 5 Must Pass:
- Rules pass (no critical lint/type/security)
- Tests pass + coverage ≥80%
- Visual OK (no critical UI issues)
- Integration OK (E2E works)
- Quality ≥90/100
multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.