multi-ai-verification

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multi-AI Verification

多AI验证

Overview

概述

multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.

Purpose: Multi-layer independent verification ensuring production-ready quality

Pattern: Task-based (5 independent verification operations, one per layer)

Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming

Core Principles (validated by tri-AI research):

Multi-Layer Defense - 5 layers catch different types of issues
Independent Verification - Separate agent from implementation/testing
Progressive Automation - Automate what can be automated (95% → 0%)
Quality Scoring - Objective 0-100 scoring with ≥90 threshold
Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)

Quality Gates: All 5 layers must pass for production approval

multi-ai-verification通过5层验证金字塔提供全面的质量保证，覆盖从自动化规则校验到LLM-as-judge评估的全流程。

核心目标：通过多层独立验证，确保代码达到生产就绪质量标准

模式：基于任务的验证（5层对应5项独立验证操作）

核心创新点：5层金字塔（底层95%自动化→顶层0%自动化）搭配独立验证机制，可避免偏见与测试作弊

核心原则（经Tri-AI研究验证）：

多层防御 - 5层验证可覆盖不同类型的问题
独立验证 - 验证Agent与实现/测试环节完全分离
渐进式自动化 - 对可自动化的环节实现自动化（95%→0%）
质量评分 - 采用0-100分的客观评分体系，合格阈值≥90分
可落地反馈 - 100%的反馈需具体且可执行（包含问题内容、位置、原因、修复方案、优先级）

质量门禁：需通过全部5层验证，方可获得生产部署批准

When to Use

适用场景

Use multi-ai-verification when:

Final quality check before commit/deployment
Independent code review (preventing bias)
Security verification (OWASP, vulnerabilities)
Comprehensive QA (all layers)
Test quality verification (prevent gaming)
Production readiness validation

在以下场景中使用multi-ai-verification：

代码提交/部署前的最终质量检查
独立代码评审（避免偏见）
安全验证（OWASP规范、漏洞扫描）
全面质量保证（覆盖所有验证层）
测试质量验证（防止测试作弊）
生产就绪状态确认

Prerequisites

前置条件

Required

必备条件

Code to verify (implementation complete)
Tests available (for functional verification)
Quality standards defined

待验证的代码（已完成实现）
可用的测试用例（用于功能验证）
已定义的质量标准

Tools Available

可用工具

Linters (ESLint, Pylint)
Type checkers (TypeScript, mypy)
Coverage tools (c8, pytest-cov)
Security scanners (Semgrep, Bandit)
Test frameworks (Jest, pytest)

代码检查工具（ESLint、Pylint）
类型检查工具（TypeScript、mypy）
覆盖率工具（c8、pytest-cov）
安全扫描工具（Semgrep、Bandit）
测试框架（Jest、pytest）

The 5-Layer Verification Pyramid

5层验证金字塔

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

原则：在进入成本较高的LLM-as-judge评估前，先通过自动化层快速拦截问题（成本低、速度快）

Verification Operations

验证操作流程

Operation 1: Rules-Based Verification (Layer 1)

操作1：基于规则的验证（第1层）

Purpose: Automated validation of code structure, formatting, types

Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)

Process:

Schema Validation (if applicable):

bash

# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json

Linting:

bash

# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}

# Python
pylint src/**/*.py

# Expected: Zero linting errors

Type Checking:

bash

# TypeScript
npx tsc --noEmit

# Python
mypy src/

# Expected: Zero type errors

Format Validation:

bash

# Check formatting
npx prettier --check src/**/*.{ts,tsx}

# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}

Security Scanning (SAST):

bash

# Static security analysis
npx semgrep --config=auto src/

# Or for Python
bandit -r src/

# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies

Generate Layer 1 Report:

markdown

# Layer 1: Rules-Based Verification

## Schema Validation
✅ plan.json validates
✅ All task files validate

## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)

## Type Checking
✅ 0 type errors

## Formatting
✅ All files formatted correctly

## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)

**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue

Outputs:

Lint report (errors/warnings)
Type check results
Schema validation results
Security scan findings
Layer 1 status (PASS/FAIL)

Validation:

All automated checks run
Results documented
Critical issues = 0 for PASS
Actionable feedback for warnings

Time Estimate: 15-30 minutes (mostly automated)

Gate 1: ✅ PASS if no critical issues (warnings acceptable)

目标：自动化验证代码结构、格式、类型合规性

自动化程度：95%自动化速度：秒级（快速反馈） 可信度：高（确定性结果）

流程：

Schema验证（如适用）：

bash

# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json

代码检查：

bash

# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}

# Python
pylint src/**/*.py

# Expected: Zero linting errors

类型检查：

bash

# TypeScript
npx tsc --noEmit

# Python
mypy src/

# Expected: Zero type errors

格式验证：

bash

# Check formatting
npx prettier --check src/**/*.{ts,tsx}

# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}

安全扫描（SAST）：

bash

# Static security analysis
npx semgrep --config=auto src/

# Or for Python
bandit -r src/

# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies

生成第1层验证报告：

markdown

# Layer 1: Rules-Based Verification

## Schema Validation
✅ plan.json validates
✅ All task files validate

## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)

## Type Checking
✅ 0 type errors

## Formatting
✅ All files formatted correctly

## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)

**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue

输出结果：

代码检查报告（错误/警告）
类型检查结果
Schema验证结果
安全扫描发现
第1层验证状态（通过/失败）

验证标准：

已执行所有自动化检查
结果已记录
严重问题数为0则通过
警告需提供可执行反馈

时间预估：15-30分钟（主要为自动化执行）

门禁1：✅ 无严重问题则通过（警告可接受）

Operation 2: Functional Verification (Layer 2)

操作2：功能验证（第2层）

Purpose: Validate functionality through test execution and coverage

Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)

Process:

Execute Complete Test Suite:

bash

# Run all tests with coverage
npm test -- --coverage --verbose

# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time

Validate Example Code (from documentation):

bash

# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected

# Target: ≥90% examples work

Check Coverage:

markdown

# Coverage Report

**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅

**Gate Status**: PASS ✅ (all ≥80%)

**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)

Regression Testing (for updates):

bash

# Compare before/after
git diff main...feature --stat

# Run all tests
npm test

# Verify: No new failures (regression prevention)

Performance Validation:

bash

# Run performance tests
npm run test:performance

# Check response times
# Verify: Within acceptable ranges

Generate Layer 2 Report:

markdown

# Layer 2: Functional Verification

## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds

## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%

## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)

## Regression
✅ All existing tests still pass

## Performance
✅ All endpoints <200ms

**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)

Outputs:

Test execution results
Coverage report
Example validation results
Regression check
Performance metrics
Layer 2 status

Validation:

All tests executed
Coverage meets gate (≥80%)
Examples validated (≥90%)
No regressions
Performance acceptable

Time Estimate: 30-60 minutes

Gate 2: ✅ PASS if tests pass + coverage ≥80%

目标：通过测试执行与覆盖率验证功能正确性

自动化程度：60-80%自动化速度：分钟级（中等速度反馈） 可信度：高（可量化结果）

流程：

执行完整测试套件：

bash

# Run all tests with coverage
npm test -- --coverage --verbose

# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time

验证示例代码（来自文档）：

bash

# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected

# Target: ≥90% examples work

覆盖率检查：

markdown

# Coverage Report

**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅

**Gate Status**: PASS ✅ (all ≥80%)

**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)

回归测试（针对代码更新）：

bash

# Compare before/after
git diff main...feature --stat

# Run all tests
npm test

# Verify: No new failures (regression prevention)

性能验证：

bash

# Run performance tests
npm run test:performance

# Check response times
# Verify: Within acceptable ranges

生成第2层验证报告：

markdown

# Layer 2: Functional Verification

## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds

## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%

## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)

## Regression
✅ All existing tests still pass

## Performance
✅ All endpoints <200ms

**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)

输出结果：

测试执行结果
覆盖率报告
示例代码验证结果
回归检查结果
性能指标
第2层验证状态

验证标准：

已执行所有测试
覆盖率达标（≥80%）
示例代码验证通过率≥90%
无回归问题
性能符合要求

时间预估：30-60分钟

门禁2：✅ 测试全部通过且覆盖率≥80%则通过

Operation 3: Visual Verification (Layer 3)

操作3：视觉验证（第3层）

Purpose: Validate UI appearance, layout, accessibility (for UI features)

Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)

Process:

Screenshot Generation:

bash

# Generate screenshots of UI
npx playwright test --screenshot=on

# Or manually:
# Open application
# Capture screenshots of key views

Visual Comparison (if previous version exists):

bash

# Compare against baseline
npx playwright test --update-snapshots=missing

# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/

Layout Validation:

markdown

# Visual Checklist

## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements

## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly

## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅

Accessibility Testing:

bash

# Automated accessibility scan
npx axe-core src/

# Check WCAG compliance
npx pa11y http://localhost:3000

# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios

Generate Layer 3 Report:

markdown

# Layer 3: Visual Verification

## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px

## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality

## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)

**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings

Outputs:

Screenshots of UI
Visual comparison results
Responsiveness validation
Accessibility report
Layer 3 status

Validation:

Screenshots captured
Visual comparison done (if applicable)
Layout validated
Responsiveness tested
Accessibility checked
No critical visual issues

Time Estimate: 30-90 minutes (skip if no UI)

Gate 3: ✅ PASS if no critical visual/a11y issues

目标：验证UI外观、布局、可访问性（针对含UI的功能）

自动化程度：30-50%自动化速度：分钟-小时级 可信度：中等（存在主观判断元素）

流程：

生成截图：

bash

# Generate screenshots of UI
npx playwright test --screenshot=on

# Or manually:
# Open application
# Capture screenshots of key views

视觉对比（若存在历史版本）：

bash

# Compare against baseline
npx playwright test --update-snapshots=missing

# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/

布局验证：

markdown

# Visual Checklist

## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements

## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly

## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅

可访问性测试：

bash

# Automated accessibility scan
npx axe-core src/

# Check WCAG compliance
npx pa11y http://localhost:3000

# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios

生成第3层验证报告：

markdown

# Layer 3: Visual Verification

## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px

## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality

## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)

**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings

输出结果：

UI截图
视觉对比结果
响应式验证结果
可访问性报告
第3层验证状态

验证标准：

已捕获截图
已完成视觉对比（若适用）
布局已验证
已测试响应式
已检查可访问性
无严重视觉问题

时间预估：30-90分钟（无UI则可跳过）

门禁3：✅ 无严重视觉/可访问性问题则通过

Operation 4: Integration Verification (Layer 4)

操作4：集成验证（第4层）

Purpose: Validate system-level integration, data flow, API compatibility

Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High

Process:

Component Integration Tests:

bash

# Run integration test suite
npm test -- tests/integration/

# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User

Data Flow Validation:

markdown

# Data Flow Verification

**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic

**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token

API Integration Tests:

bash

# Test all API endpoints
npm run test:api

# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works

End-to-End Workflow Tests:

typescript

// Complete user journeys
test('Complete registration and login flow', async () => {
  // 1. Register new user
  const registerResponse = await api.post('/register', userData);
  expect(registerResponse.status).toBe(201);

  // 2. Confirm email
  const confirmResponse = await api.get(confirmLink);
  expect(confirmResponse.status).toBe(200);

  // 3. Login
  const loginResponse = await api.post('/login', credentials);
  expect(loginResponse.status).toBe(200);
  expect(loginResponse.data.token).toBeDefined();

  // 4. Access protected resource
  const profileResponse = await api.get('/profile', {
    headers: { Authorization: `Bearer ${loginResponse.data.token}` }
  });
  expect(profileResponse.status).toBe(200);
});

Dependency Compatibility:

bash

# Check external dependencies work
npm audit

# Check for breaking changes
npm outdated

# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs

Generate Layer 4 Report:

markdown

# Layer 4: Integration Verification

## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly

## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption

## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works

## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks

## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)

**Layer 4 Status**: ✅ PASS

Outputs:

Integration test results
Data flow validation
API compatibility report
E2E workflow results
Dependency audit
Layer 4 status

Validation:

Integration tests pass
Data flows validated
APIs integrate correctly
E2E workflows function
Dependencies secure

Time Estimate: 45-90 minutes

Gate 4: ✅ PASS if all integration tests pass, no critical dependencies

目标：验证系统级集成、数据流、API兼容性

自动化程度：20-30%自动化速度：小时级（复杂度较高） 可信度：中-高

流程：

组件集成测试：

bash

# Run integration test suite
npm test -- tests/integration/

# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User

数据流验证：

markdown

# Data Flow Verification

**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic

**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token

API集成测试：

bash

# Test all API endpoints
npm run test:api

# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works

端到端工作流测试：

typescript

// Complete user journeys
test('Complete registration and login flow', async () => {
  // 1. Register new user
  const registerResponse = await api.post('/register', userData);
  expect(registerResponse.status).toBe(201);

  // 2. Confirm email
  const confirmResponse = await api.get(confirmLink);
  expect(confirmResponse.status).toBe(200);

  // 3. Login
  const loginResponse = await api.post('/login', credentials);
  expect(loginResponse.status).toBe(200);
  expect(loginResponse.data.token).toBeDefined();

  // 4. Access protected resource
  const profileResponse = await api.get('/profile', {
    headers: { Authorization: `Bearer ${loginResponse.data.token}` }
  });
  expect(profileResponse.status).toBe(200);
});

依赖兼容性检查：

bash

# Check external dependencies work
npm audit

# Check for breaking changes
npm outdated

# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs

生成第4层验证报告：

markdown

# Layer 4: Integration Verification

## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly

## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption

## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works

## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks

## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)

**Layer 4 Status**: ✅ PASS

输出结果：

集成测试结果
数据流验证报告
API兼容性报告
端到端工作流结果
依赖审计报告
第4层验证状态

验证标准：

集成测试全部通过
数据流已验证
API集成正常
端到端工作流可正常执行
依赖安全

时间预估：45-90分钟

门禁4：✅ 所有集成测试通过且无严重依赖问题则通过

Operation 5: Quality Scoring (Layer 5)

操作5：质量评分（第5层）

Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns

Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)

Process:

Spawn Independent Quality Assessor (Agent-as-a-Judge):

Key: Use different model family if possible (prevent self-preference bias)

typescript

const qualityAssessment = await task({
  description: "Assess code quality holistically",
  prompt: `Evaluate code quality in src/ and tests/.

  DO NOT read implementation conversation history.

  You have access to tools:
  - Read files
  - Execute tests
  - Run linters
  - Query database (if needed)

  Assess 5 dimensions (score each /20):

  1. CORRECTNESS (/20):
     - Logic correctness
     - Edge case handling
     - Error handling completeness
     - Security considerations

  2. FUNCTIONALITY (/20):
     - Meets all requirements
     - User workflows work
     - Performance acceptable
     - No regressions

  3. QUALITY (/20):
     - Code maintainability
     - Best practices followed
     - Anti-patterns avoided
     - Documentation complete

  4. INTEGRATION (/20):
     - Components integrate smoothly
     - API contracts correct
     - Data flow works
     - Backward compatible

  5. SECURITY (/20):
     - No vulnerabilities
     - Input validation
     - Authentication/authorization
     - Data protection

  TOTAL: /100 (sum of 5 dimensions)

  For each dimension, provide:
  - Score (/20)
  - Strengths (what's good)
  - Weaknesses (what needs improvement)
  - Evidence (file:line references)
  - Recommendations (specific, actionable)

  Write comprehensive report to: quality-assessment.md`
});

Multi-Agent Ensemble (for critical features):

3-5 Agent Voting Committee:

typescript

// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
  task({description: "Quality Judge 1", prompt: assessmentPrompt}),
  task({description: "Quality Judge 2", prompt: assessmentPrompt}),
  task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);

// Aggregate scores
const scores = {
  correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
  functionality: median([...]),
  quality: median([...]),
  integration: median([...]),
  security: median([...])
};

const totalScore = sum(Object.values(scores)); // Total /100

// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);

if (variance > 15) {
  // High disagreement → spawn 2 more judges (total 5)
  // Use 5-agent ensemble for final score
}

// Final score: median of 3 or 5

Calibration Against Rubric:

markdown

# Scoring Calibration

## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken

**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)

## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]

## Quality: 17/20 (Good)
[Similar rubric with evidence]

## Integration: 18/20 (Excellent)
[Similar rubric with evidence]

## Security: 16/20 (Good)
[Similar rubric with evidence]

**Total**: 88/100 ⚠️ (Below ≥90 gate)

Gap Analysis (if <90):

markdown

# Quality Gap Analysis

**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points

## Critical Gaps (Blocking Approval)
None

## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
   - **What**: bcrypt using 10 rounds (outdated)
   - **Where**: src/auth/hash.ts:15
   - **Why**: Current standard is 12-14 rounds
   - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
   - **Priority**: High
   - **Impact**: +2 points → 90/100

## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
   - Impact: +1 point → 91/100

**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes

Generate Comprehensive Quality Report:

markdown

# Layer 5: Quality Scoring Report

## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION

## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐

## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling

## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)

## Recommendations (Prioritized)

### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
   - File: src/auth/hash.ts:15
   - Effort: 5 min
   - Impact: +2 points

### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
   - Files: src/utils/validation.ts
   - Effort: 30 min
   - Impact: +1 point

2. Handle timezone DST edge case
   - File: src/auth/tokens.ts:78
   - Effort: 20 min
   - Impact: +1 point

**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90

Outputs:

Quality score (0-100) with dimension breakdown
Calibrated against rubric
Gap analysis
Prioritized recommendations (Critical/High/Medium/Low)
Evidence-based feedback (file:line references)
Action plan to reach ≥90

Validation:

All 5 dimensions scored
Scores calibrated against rubric
Evidence provided for each score
Gap analysis if <90
Recommendations actionable
Ensemble used for critical features (optional)

Time Estimate: 60-120 minutes (ensemble adds 30-60 min)

Gate 5: ✅ PASS if total score ≥90/100

目标：采用LLM-as-judge和Agent-as-a-Judge模式进行整体质量评估

自动化程度：0-20%自动化速度：小时级（成本较高） 可信度：中等（需主观判断）

流程：

生成独立质量评估Agent（Agent-as-a-Judge）：

关键：尽可能使用不同模型家族（避免自我偏好偏见）

typescript

const qualityAssessment = await task({
  description: "Assess code quality holistically",
  prompt: `Evaluate code quality in src/ and tests/.

  DO NOT read implementation conversation history.

  You have access to tools:
  - Read files
  - Execute tests
  - Run linters
  - Query database (if needed)

  Assess 5 dimensions (score each /20):

  1. CORRECTNESS (/20):
     - Logic correctness
     - Edge case handling
     - Error handling completeness
     - Security considerations

  2. FUNCTIONALITY (/20):
     - Meets all requirements
     - User workflows work
     - Performance acceptable
     - No regressions

  3. QUALITY (/20):
     - Code maintainability
     - Best practices followed
     - Anti-patterns avoided
     - Documentation complete

  4. INTEGRATION (/20):
     - Components integrate smoothly
     - API contracts correct
     - Data flow works
     - Backward compatible

  5. SECURITY (/20):
     - No vulnerabilities
     - Input validation
     - Authentication/authorization
     - Data protection

  TOTAL: /100 (sum of 5 dimensions)

  For each dimension, provide:
  - Score (/20)
  - Strengths (what's good)
  - Weaknesses (what needs improvement)
  - Evidence (file:line references)
  - Recommendations (specific, actionable)

  Write comprehensive report to: quality-assessment.md`
});

多Agent集成评估（针对核心功能）：

3-5个Agent投票委员会：

typescript

// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
  task({description: "Quality Judge 1", prompt: assessmentPrompt}),
  task({description: "Quality Judge 2", prompt: assessmentPrompt}),
  task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);

// Aggregate scores
const scores = {
  correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
  functionality: median([...]),
  quality: median([...]),
  integration: median([...]),
  security: median([...])
};

const totalScore = sum(Object.values(scores)); // Total /100

// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);

if (variance > 15) {
  // High disagreement → spawn 2 more judges (total 5)
  // Use 5-agent ensemble for final score
}

// Final score: median of 3 or 5

基于评分标准的校准：

markdown

# Scoring Calibration

## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken

**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)

## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]

## Quality: 17/20 (Good)
[Similar rubric with evidence]

## Integration: 18/20 (Excellent)
[Similar rubric with evidence]

## Security: 16/20 (Good)
[Similar rubric with evidence]

**Total**: 88/100 ⚠️ (Below ≥90 gate)

差距分析（若评分<90）：

markdown

# Quality Gap Analysis

**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points

## Critical Gaps (Blocking Approval)
None

## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
   - **What**: bcrypt using 10 rounds (outdated)
   - **Where**: src/auth/hash.ts:15
   - **Why**: Current standard is 12-14 rounds
   - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
   - **Priority**: High
   - **Impact**: +2 points → 90/100

## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
   - Impact: +1 point → 91/100

**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes

生成全面质量报告：

markdown

# Layer 5: Quality Scoring Report

## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION

## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐

## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling

## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)

## Recommendations (Prioritized)

### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
   - File: src/auth/hash.ts:15
   - Effort: 5 min
   - Impact: +2 points

### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
   - Files: src/utils/validation.ts
   - Effort: 30 min
   - Impact: +1 point

2. Handle timezone DST edge case
   - File: src/auth/tokens.ts:78
   - Effort: 20 min
   - Impact: +1 point

**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90

输出结果：

质量评分（0-100分）及维度细分
基于评分标准的校准结果
差距分析报告
按优先级排序的建议（严重/高/中/低）
基于证据的反馈（文件:行号引用）
达到≥90分的行动计划

验证标准：

已对所有5个维度评分
评分已基于标准校准
每个评分均提供证据
若评分<90则已完成差距分析
建议可执行
核心功能已使用集成评估（可选）

时间预估：60-120分钟（集成评估需额外30-60分钟）

门禁5：✅ 总分≥90/100则通过

Quality Gates Summary

质量门禁汇总

All 5 Gates Must Pass for production approval:

Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED

If Any Gate Fails:

Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

需通过全部5个门禁方可获得生产部署批准：

Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED

若任意门禁失败：

Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

Appendix A: Independence Protocol

附录A：独立验证协议

How Verification Independence is Maintained

如何保持验证独立性

Verification Agent Spawning:

typescript

// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});

Bias Prevention Checklist:

Specifications written BEFORE implementation
Verification agent prompt has no implementation context
Agent evaluates against specs, not what code does
Fresh context (via Task tool)
Different model family used (if possible)

Validation of Independence:

markdown

undefined

验证Agent生成方式：

typescript

// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});

偏见预防检查清单：

需求规格书在实现前已编写
验证Agent的提示语不含实现上下文
Agent仅基于规格书评估，而非代码实际实现
使用全新上下文（通过Task工具）
尽可能使用不同模型家族

独立性验证：

markdown

undefined

Independence Audit

Expected Behavior:

✅ Verifier finds 1-3 issues (healthy skepticism)
✅ Verifier references specifications
✅ Verifier uses tools to verify claims

Warning Signs:

⚠️ Verifier finds 0 issues (possible rubber stamp)
⚠️ Verifier doesn't use tools
⚠️ Verifier parrots implementation justifications

If Warning: Re-verify with stronger independence prompt

---

Expected Behavior:

✅ Verifier finds 1-3 issues (healthy skepticism)
✅ Verifier references specifications
✅ Verifier uses tools to verify claims

Warning Signs:

⚠️ Verifier finds 0 issues (possible rubber stamp)
⚠️ Verifier doesn't use tools
⚠️ Verifier parrots implementation justifications

If Warning: Re-verify with stronger independence prompt

---

Appendix B: Operational Scoring Rubrics

附录B：操作评分标准

Complete Rubrics for All 5 Dimensions

所有5个维度的完整评分标准

Correctness (/20)

20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken

Functionality (/20)

20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing

Quality (/20)

20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code

Integration (/20)

20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate

Security (/20)

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities

Appendix C: Technical Foundation

附录C：技术基础

Verification Tools

验证工具

Linting:

ESLint (JavaScript/TypeScript)
Pylint/Ruff (Python)

Type Checking:

TypeScript compiler (tsc)
mypy (Python)

Security (SAST):

Semgrep (multi-language)
Bandit (Python)
npm audit (JavaScript)

Visual Testing:

Playwright (screenshot, visual regression)
Percy/Chromatic (visual diff)
axe-core (accessibility)

Coverage:

c8/nyc (JavaScript)
pytest-cov (Python)

代码检查:

ESLint (JavaScript/TypeScript)
Pylint/Ruff (Python)

类型检查:

TypeScript compiler (tsc)
mypy (Python)

安全扫描（SAST）:

Semgrep (多语言)
Bandit (Python)
npm audit (JavaScript)

视觉测试:

Playwright (截图、视觉回归)
Percy/Chromatic (视觉对比)
axe-core (可访问性)

覆盖率工具:

c8/nyc (JavaScript)
pytest-cov (Python)

Cost Controls

成本控制

Budget Caps:

LLM-as-judge: $50/month
Ensemble verification: $20/month
Total verification: $70/month

Optimization:

Cache quality scores for 24h (same code → same score)
Skip Layer 5 for changes <50 lines
Use ensemble (3-5 agents) only for critical features
Use cheaper models for pre-filtering (Haiku for Layer 1-2)

预算上限:

LLM-as-judge: $50/月
集成验证: $20/月
总验证成本: $70/月

优化措施:

质量评分缓存24小时（相同代码→相同评分）
代码变更<50行时跳过第5层验证
仅对核心功能使用集成评估（3-5个Agent）
预过滤使用低成本模型（Layer1-2使用Haiku）

Quick Reference

快速参考

The 5 Layers

5层验证体系

Layer	Purpose	Automation	Time	Tools
1	Rules-based	95%	15-30m	Linters, types, SAST
2	Functional	60-80%	30-60m	Test execution, coverage
3	Visual	30-50%	30-90m	Screenshots, a11y
4	Integration	20-30%	45-90m	E2E, API tests
5	Quality Scoring	0-20%	60-120m	LLM-as-judge, ensemble

Total: 3-6 hours for complete 5-layer verification

Layer	Purpose	Automation	Time	Tools
1	Rules-based	95%	15-30m	Linters, types, SAST
2	Functional	60-80%	30-60m	Test execution, coverage
3	Visual	30-50%	30-90m	Screenshots, a11y
4	Integration	20-30%	45-90m	E2E, API tests
5	Quality Scoring	0-20%	60-120m	LLM-as-judge, ensemble

Total: 3-6 hours for complete 5-layer verification

Quality Thresholds

质量阈值

≥90: ✅ Excellent (production-ready)
80-89: ⚠️ Good (needs minor improvements)
70-79: ❌ Acceptable (needs work before production)
<70: ❌ Poor (significant rework required)

≥90: ✅ Excellent (production-ready)
80-89: ⚠️ Good (needs minor improvements)
70-79: ❌ Acceptable (needs work before production)
<70: ❌ Poor (significant rework required)

Gates

门禁要求

All 5 Must Pass:

Rules pass (no critical lint/type/security)
Tests pass + coverage ≥80%
Visual OK (no critical UI issues)
Integration OK (E2E works)
Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.

For rubrics, see Appendix B. For independence protocol, see Appendix A.

All 5 Must Pass:

Rules pass (no critical lint/type/security)
Tests pass + coverage ≥80%
Visual OK (no critical UI issues)
Integration OK (E2E works)
Quality ≥90/100

For rubrics, see Appendix B. For independence protocol, see Appendix A.

multi-ai-verification

Original

Translation

Multi-AI Verification

多AI验证

Overview

概述

When to Use

适用场景

Prerequisites

前置条件

Required

必备条件

Recommended

推荐搭配

Tools Available

可用工具

The 5-Layer Verification Pyramid

5层验证金字塔

Verification Operations

验证操作流程

Operation 1: Rules-Based Verification (Layer 1)

操作1：基于规则的验证（第1层）

Operation 2: Functional Verification (Layer 2)

操作2：功能验证（第2层）

Operation 3: Visual Verification (Layer 3)

操作3：视觉验证（第3层）

Operation 4: Integration Verification (Layer 4)

操作4：集成验证（第4层）

Operation 5: Quality Scoring (Layer 5)

操作5：质量评分（第5层）

Quality Gates Summary

质量门禁汇总

Appendix A: Independence Protocol

附录A：独立验证协议

How Verification Independence is Maintained

如何保持验证独立性

Independence Audit

Independence Audit

Appendix B: Operational Scoring Rubrics

附录B：操作评分标准

Complete Rubrics for All 5 Dimensions

所有5个维度的完整评分标准

Correctness (/20)

Correctness (/20)

Functionality (/20)

Functionality (/20)

Quality (/20)

Quality (/20)

Integration (/20)

Integration (/20)

Security (/20)

Security (/20)

Appendix C: Technical Foundation

附录C：技术基础

Verification Tools

验证工具

Cost Controls

成本控制

Quick Reference

快速参考

The 5 Layers

5层验证体系

Quality Thresholds

质量阈值

Gates

门禁要求