multi-ai-verification

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multi-AI Verification

多AI验证

Overview

概述

multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.
Purpose: Multi-layer independent verification ensuring production-ready quality
Pattern: Task-based (5 independent verification operations, one per layer)
Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming
Core Principles (validated by tri-AI research):
  1. Multi-Layer Defense - 5 layers catch different types of issues
  2. Independent Verification - Separate agent from implementation/testing
  3. Progressive Automation - Automate what can be automated (95% → 0%)
  4. Quality Scoring - Objective 0-100 scoring with ≥90 threshold
  5. Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)
Quality Gates: All 5 layers must pass for production approval

multi-ai-verification通过5层验证金字塔提供全面的质量保证,覆盖从自动化规则校验到LLM-as-judge评估的全流程。
核心目标:通过多层独立验证,确保代码达到生产就绪质量标准
模式:基于任务的验证(5层对应5项独立验证操作)
核心创新点5层金字塔(底层95%自动化→顶层0%自动化)搭配独立验证机制,可避免偏见与测试作弊
核心原则(经Tri-AI研究验证):
  1. 多层防御 - 5层验证可覆盖不同类型的问题
  2. 独立验证 - 验证Agent与实现/测试环节完全分离
  3. 渐进式自动化 - 对可自动化的环节实现自动化(95%→0%)
  4. 质量评分 - 采用0-100分的客观评分体系,合格阈值≥90分
  5. 可落地反馈 - 100%的反馈需具体且可执行(包含问题内容、位置、原因、修复方案、优先级)
质量门禁:需通过全部5层验证,方可获得生产部署批准

When to Use

适用场景

Use multi-ai-verification when:
  • Final quality check before commit/deployment
  • Independent code review (preventing bias)
  • Security verification (OWASP, vulnerabilities)
  • Comprehensive QA (all layers)
  • Test quality verification (prevent gaming)
  • Production readiness validation

在以下场景中使用multi-ai-verification:
  • 代码提交/部署前的最终质量检查
  • 独立代码评审(避免偏见)
  • 安全验证(OWASP规范、漏洞扫描)
  • 全面质量保证(覆盖所有验证层)
  • 测试质量验证(防止测试作弊)
  • 生产就绪状态确认

Prerequisites

前置条件

Required

必备条件

  • Code to verify (implementation complete)
  • Tests available (for functional verification)
  • Quality standards defined
  • 待验证的代码(已完成实现)
  • 可用的测试用例(用于功能验证)
  • 已定义的质量标准

Recommended

推荐搭配

  • multi-ai-testing - For generating/running tests
  • multi-ai-implementation - For implementing fixes
  • multi-ai-testing - 用于生成/执行测试用例
  • multi-ai-implementation - 用于修复问题

Tools Available

可用工具

  • Linters (ESLint, Pylint)
  • Type checkers (TypeScript, mypy)
  • Coverage tools (c8, pytest-cov)
  • Security scanners (Semgrep, Bandit)
  • Test frameworks (Jest, pytest)

  • 代码检查工具(ESLint、Pylint)
  • 类型检查工具(TypeScript、mypy)
  • 覆盖率工具(c8、pytest-cov)
  • 安全扫描工具(Semgrep、Bandit)
  • 测试框架(Jest、pytest)

The 5-Layer Verification Pyramid

5层验证金字塔

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
原则:在进入成本较高的LLM-as-judge评估前,先通过自动化层快速拦截问题(成本低、速度快)

Verification Operations

验证操作流程

Operation 1: Rules-Based Verification (Layer 1)

操作1:基于规则的验证(第1层)

Purpose: Automated validation of code structure, formatting, types
Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)
Process:
  1. Schema Validation (if applicable):
    bash
    # Validate JSON/YAML against schemas
    ajv validate -s plan.schema.json -d plan.json
    ajv validate -s task.schema.json -d tasks/*.json
  2. Linting:
    bash
    # JavaScript/TypeScript
    npx eslint src/**/*.{ts,tsx,js,jsx}
    
    # Python
    pylint src/**/*.py
    
    # Expected: Zero linting errors
  3. Type Checking:
    bash
    # TypeScript
    npx tsc --noEmit
    
    # Python
    mypy src/
    
    # Expected: Zero type errors
  4. Format Validation:
    bash
    # Check formatting
    npx prettier --check src/**/*.{ts,tsx}
    
    # Or auto-fix
    npx prettier --write src/**/*.{ts,tsx}
  5. Security Scanning (SAST):
    bash
    # Static security analysis
    npx semgrep --config=auto src/
    
    # Or for Python
    bandit -r src/
    
    # Check for:
    # - Hardcoded secrets
    # - SQL injection risks
    # - XSS vulnerabilities
    # - Insecure dependencies
  6. Generate Layer 1 Report:
    markdown
    # Layer 1: Rules-Based Verification
    
    ## Schema Validation
    ✅ plan.json validates
    ✅ All task files validate
    
    ## Linting
    ✅ 0 linting errors
    ⚠️ 3 warnings (non-blocking)
    
    ## Type Checking
    ✅ 0 type errors
    
    ## Formatting
    ✅ All files formatted correctly
    
    ## Security Scan (SAST)
    ✅ No critical vulnerabilities
    ⚠️ 1 medium: Weak password hashing rounds (bcrypt)
    
    **Layer 1 Status**: ✅ PASS (0 critical issues)
    **Issues to Address**: 1 medium security issue
Outputs:
  • Lint report (errors/warnings)
  • Type check results
  • Schema validation results
  • Security scan findings
  • Layer 1 status (PASS/FAIL)
Validation:
  • All automated checks run
  • Results documented
  • Critical issues = 0 for PASS
  • Actionable feedback for warnings
Time Estimate: 15-30 minutes (mostly automated)
Gate 1: ✅ PASS if no critical issues (warnings acceptable)

目标:自动化验证代码结构、格式、类型合规性
自动化程度:95%自动化 速度:秒级(快速反馈) 可信度:高(确定性结果)
流程
  1. Schema验证(如适用):
    bash
    # Validate JSON/YAML against schemas
    ajv validate -s plan.schema.json -d plan.json
    ajv validate -s task.schema.json -d tasks/*.json
  2. 代码检查
    bash
    # JavaScript/TypeScript
    npx eslint src/**/*.{ts,tsx,js,jsx}
    
    # Python
    pylint src/**/*.py
    
    # Expected: Zero linting errors
  3. 类型检查
    bash
    # TypeScript
    npx tsc --noEmit
    
    # Python
    mypy src/
    
    # Expected: Zero type errors
  4. 格式验证
    bash
    # Check formatting
    npx prettier --check src/**/*.{ts,tsx}
    
    # Or auto-fix
    npx prettier --write src/**/*.{ts,tsx}
  5. 安全扫描(SAST):
    bash
    # Static security analysis
    npx semgrep --config=auto src/
    
    # Or for Python
    bandit -r src/
    
    # Check for:
    # - Hardcoded secrets
    # - SQL injection risks
    # - XSS vulnerabilities
    # - Insecure dependencies
  6. 生成第1层验证报告
    markdown
    # Layer 1: Rules-Based Verification
    
    ## Schema Validation
    ✅ plan.json validates
    ✅ All task files validate
    
    ## Linting
    ✅ 0 linting errors
    ⚠️ 3 warnings (non-blocking)
    
    ## Type Checking
    ✅ 0 type errors
    
    ## Formatting
    ✅ All files formatted correctly
    
    ## Security Scan (SAST)
    ✅ No critical vulnerabilities
    ⚠️ 1 medium: Weak password hashing rounds (bcrypt)
    
    **Layer 1 Status**: ✅ PASS (0 critical issues)
    **Issues to Address**: 1 medium security issue
输出结果
  • 代码检查报告(错误/警告)
  • 类型检查结果
  • Schema验证结果
  • 安全扫描发现
  • 第1层验证状态(通过/失败)
验证标准
  • 已执行所有自动化检查
  • 结果已记录
  • 严重问题数为0则通过
  • 警告需提供可执行反馈
时间预估:15-30分钟(主要为自动化执行)
门禁1:✅ 无严重问题则通过(警告可接受)

Operation 2: Functional Verification (Layer 2)

操作2:功能验证(第2层)

Purpose: Validate functionality through test execution and coverage
Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)
Process:
  1. Execute Complete Test Suite:
    bash
    # Run all tests with coverage
    npm test -- --coverage --verbose
    
    # Capture results
    # - Tests passed/failed
    # - Coverage metrics
    # - Execution time
  2. Validate Example Code (from documentation):
    bash
    # Extract examples from SKILL.md
    # Execute each example automatically
    # Verify outputs match expected
    
    # Target: ≥90% examples work
  3. Check Coverage:
    markdown
    # Coverage Report
    
    **Line Coverage**: 87% ✅ (gate: ≥80%)
    **Branch Coverage**: 82% ✅
    **Function Coverage**: 92% ✅
    **Path Coverage**: 74% ✅
    
    **Gate Status**: PASS ✅ (all ≥80%)
    
    **Uncovered Code**:
    - src/admin/legacy.ts: 23% (low priority)
    - src/utils/deprecated.ts: 15% (deprecated, ok)
  4. Regression Testing (for updates):
    bash
    # Compare before/after
    git diff main...feature --stat
    
    # Run all tests
    npm test
    
    # Verify: No new failures (regression prevention)
  5. Performance Validation:
    bash
    # Run performance tests
    npm run test:performance
    
    # Check response times
    # Verify: Within acceptable ranges
  6. Generate Layer 2 Report:
    markdown
    # Layer 2: Functional Verification
    
    ## Test Execution
    ✅ 245/245 tests passing (100%)
    ⏱️ Execution time: 8.3 seconds
    
    ## Coverage
    ✅ Line: 87% (gate: ≥80%)
    ✅ Branch: 82%
    ✅ Function: 92%
    
    ## Example Validation
    ✅ 18/20 examples work (90%)
    ❌ 2 examples fail (outdated)
    
    ## Regression
    ✅ All existing tests still pass
    
    ## Performance
    ✅ All endpoints <200ms
    
    **Layer 2 Status**: ✅ PASS
    **Issues**: 2 outdated examples (update docs)
Outputs:
  • Test execution results
  • Coverage report
  • Example validation results
  • Regression check
  • Performance metrics
  • Layer 2 status
Validation:
  • All tests executed
  • Coverage meets gate (≥80%)
  • Examples validated (≥90%)
  • No regressions
  • Performance acceptable
Time Estimate: 30-60 minutes
Gate 2: ✅ PASS if tests pass + coverage ≥80%

目标:通过测试执行与覆盖率验证功能正确性
自动化程度:60-80%自动化 速度:分钟级(中等速度反馈) 可信度:高(可量化结果)
流程
  1. 执行完整测试套件
    bash
    # Run all tests with coverage
    npm test -- --coverage --verbose
    
    # Capture results
    # - Tests passed/failed
    # - Coverage metrics
    # - Execution time
  2. 验证示例代码(来自文档):
    bash
    # Extract examples from SKILL.md
    # Execute each example automatically
    # Verify outputs match expected
    
    # Target: ≥90% examples work
  3. 覆盖率检查
    markdown
    # Coverage Report
    
    **Line Coverage**: 87% ✅ (gate: ≥80%)
    **Branch Coverage**: 82% ✅
    **Function Coverage**: 92% ✅
    **Path Coverage**: 74% ✅
    
    **Gate Status**: PASS ✅ (all ≥80%)
    
    **Uncovered Code**:
    - src/admin/legacy.ts: 23% (low priority)
    - src/utils/deprecated.ts: 15% (deprecated, ok)
  4. 回归测试(针对代码更新):
    bash
    # Compare before/after
    git diff main...feature --stat
    
    # Run all tests
    npm test
    
    # Verify: No new failures (regression prevention)
  5. 性能验证
    bash
    # Run performance tests
    npm run test:performance
    
    # Check response times
    # Verify: Within acceptable ranges
  6. 生成第2层验证报告
    markdown
    # Layer 2: Functional Verification
    
    ## Test Execution
    ✅ 245/245 tests passing (100%)
    ⏱️ Execution time: 8.3 seconds
    
    ## Coverage
    ✅ Line: 87% (gate: ≥80%)
    ✅ Branch: 82%
    ✅ Function: 92%
    
    ## Example Validation
    ✅ 18/20 examples work (90%)
    ❌ 2 examples fail (outdated)
    
    ## Regression
    ✅ All existing tests still pass
    
    ## Performance
    ✅ All endpoints <200ms
    
    **Layer 2 Status**: ✅ PASS
    **Issues**: 2 outdated examples (update docs)
输出结果
  • 测试执行结果
  • 覆盖率报告
  • 示例代码验证结果
  • 回归检查结果
  • 性能指标
  • 第2层验证状态
验证标准
  • 已执行所有测试
  • 覆盖率达标(≥80%)
  • 示例代码验证通过率≥90%
  • 无回归问题
  • 性能符合要求
时间预估:30-60分钟
门禁2:✅ 测试全部通过且覆盖率≥80%则通过

Operation 3: Visual Verification (Layer 3)

操作3:视觉验证(第3层)

Purpose: Validate UI appearance, layout, accessibility (for UI features)
Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)
Process:
  1. Screenshot Generation:
    bash
    # Generate screenshots of UI
    npx playwright test --screenshot=on
    
    # Or manually:
    # Open application
    # Capture screenshots of key views
  2. Visual Comparison (if previous version exists):
    bash
    # Compare against baseline
    npx playwright test --update-snapshots=missing
    
    # Or use Percy/Chromatic for visual regression
    npx percy snapshot screenshots/
  3. Layout Validation:
    markdown
    # Visual Checklist
    
    ## Layout
    - [ ] Components positioned correctly
    - [ ] Spacing/margins match mockup
    - [ ] Alignment proper
    - [ ] No overlapping elements
    
    ## Styling
    - [ ] Colors match design system
    - [ ] Typography correct (fonts, sizes)
    - [ ] Icons/images display properly
    
    ## Responsiveness
    - [ ] Mobile view (320px-480px): ✅
    - [ ] Tablet view (768px-1024px): ✅
    - [ ] Desktop view (>1024px): ✅
  4. Accessibility Testing:
    bash
    # Automated accessibility scan
    npx axe-core src/
    
    # Check WCAG compliance
    npx pa11y http://localhost:3000
    
    # Manual checks:
    # - Keyboard navigation
    # - Screen reader compatibility
    # - Color contrast ratios
  5. Generate Layer 3 Report:
    markdown
    # Layer 3: Visual Verification
    
    ## Screenshot Comparison
    ✅ Login page matches mockup
    ✅ Dashboard layout correct
    ⚠️ Profile page: Avatar alignment off by 5px
    
    ## Responsiveness
    ✅ Mobile: All components visible
    ✅ Tablet: Layout adapts correctly
    ✅ Desktop: Full functionality
    
    ## Accessibility
    ✅ WCAG 2.1 AA compliance
    ✅ Keyboard navigation works
    ⚠️ 2 color contrast warnings (non-critical)
    
    **Layer 3 Status**: ✅ PASS (minor issues acceptable)
    **Issues**: Avatar alignment (cosmetic), contrast warnings
Outputs:
  • Screenshots of UI
  • Visual comparison results
  • Responsiveness validation
  • Accessibility report
  • Layer 3 status
Validation:
  • Screenshots captured
  • Visual comparison done (if applicable)
  • Layout validated
  • Responsiveness tested
  • Accessibility checked
  • No critical visual issues
Time Estimate: 30-90 minutes (skip if no UI)
Gate 3: ✅ PASS if no critical visual/a11y issues

目标:验证UI外观、布局、可访问性(针对含UI的功能)
自动化程度:30-50%自动化 速度:分钟-小时级 可信度:中等(存在主观判断元素)
流程
  1. 生成截图
    bash
    # Generate screenshots of UI
    npx playwright test --screenshot=on
    
    # Or manually:
    # Open application
    # Capture screenshots of key views
  2. 视觉对比(若存在历史版本):
    bash
    # Compare against baseline
    npx playwright test --update-snapshots=missing
    
    # Or use Percy/Chromatic for visual regression
    npx percy snapshot screenshots/
  3. 布局验证
    markdown
    # Visual Checklist
    
    ## Layout
    - [ ] Components positioned correctly
    - [ ] Spacing/margins match mockup
    - [ ] Alignment proper
    - [ ] No overlapping elements
    
    ## Styling
    - [ ] Colors match design system
    - [ ] Typography correct (fonts, sizes)
    - [ ] Icons/images display properly
    
    ## Responsiveness
    - [ ] Mobile view (320px-480px): ✅
    - [ ] Tablet view (768px-1024px): ✅
    - [ ] Desktop view (>1024px): ✅
  4. 可访问性测试
    bash
    # Automated accessibility scan
    npx axe-core src/
    
    # Check WCAG compliance
    npx pa11y http://localhost:3000
    
    # Manual checks:
    # - Keyboard navigation
    # - Screen reader compatibility
    # - Color contrast ratios
  5. 生成第3层验证报告
    markdown
    # Layer 3: Visual Verification
    
    ## Screenshot Comparison
    ✅ Login page matches mockup
    ✅ Dashboard layout correct
    ⚠️ Profile page: Avatar alignment off by 5px
    
    ## Responsiveness
    ✅ Mobile: All components visible
    ✅ Tablet: Layout adapts correctly
    ✅ Desktop: Full functionality
    
    ## Accessibility
    ✅ WCAG 2.1 AA compliance
    ✅ Keyboard navigation works
    ⚠️ 2 color contrast warnings (non-critical)
    
    **Layer 3 Status**: ✅ PASS (minor issues acceptable)
    **Issues**: Avatar alignment (cosmetic), contrast warnings
输出结果
  • UI截图
  • 视觉对比结果
  • 响应式验证结果
  • 可访问性报告
  • 第3层验证状态
验证标准
  • 已捕获截图
  • 已完成视觉对比(若适用)
  • 布局已验证
  • 已测试响应式
  • 已检查可访问性
  • 无严重视觉问题
时间预估:30-90分钟(无UI则可跳过)
门禁3:✅ 无严重视觉/可访问性问题则通过

Operation 4: Integration Verification (Layer 4)

操作4:集成验证(第4层)

Purpose: Validate system-level integration, data flow, API compatibility
Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High
Process:
  1. Component Integration Tests:
    bash
    # Run integration test suite
    npm test -- tests/integration/
    
    # Verify components work together
    # - Database ← → API
    # - API ← → Frontend
    # - Frontend ← → User
  2. Data Flow Validation:
    markdown
    # Data Flow Verification
    
    **Flow 1: User Registration**
    Frontend form → API endpoint → Validation → Database → Email service
    ✅ Data flows correctly
    ✅ No data loss
    ✅ Transactions atomic
    
    **Flow 2: Authentication**
    Login request → API → Database lookup → Token generation → Response
    ✅ Token generated correctly
    ✅ Session stored
    ✅ Response includes token
  3. API Integration Tests:
    bash
    # Test all API endpoints
    npm run test:api
    
    # Verify:
    # - All endpoints respond
    # - Status codes correct
    # - Response formats match spec
    # - Error handling works
  4. End-to-End Workflow Tests:
    typescript
    // Complete user journeys
    test('Complete registration and login flow', async () => {
      // 1. Register new user
      const registerResponse = await api.post('/register', userData);
      expect(registerResponse.status).toBe(201);
    
      // 2. Confirm email
      const confirmResponse = await api.get(confirmLink);
      expect(confirmResponse.status).toBe(200);
    
      // 3. Login
      const loginResponse = await api.post('/login', credentials);
      expect(loginResponse.status).toBe(200);
      expect(loginResponse.data.token).toBeDefined();
    
      // 4. Access protected resource
      const profileResponse = await api.get('/profile', {
        headers: { Authorization: `Bearer ${loginResponse.data.token}` }
      });
      expect(profileResponse.status).toBe(200);
    });
  5. Dependency Compatibility:
    bash
    # Check external dependencies work
    npm audit
    
    # Check for breaking changes
    npm outdated
    
    # Verify integration with services
    # - Database connection
    # - Redis/cache
    # - External APIs
  6. Generate Layer 4 Report:
    markdown
    # Layer 4: Integration Verification
    
    ## Component Integration
    ✅ 12/12 integration tests passing
    ✅ All components integrate correctly
    
    ## Data Flow
    ✅ All 5 data flows validated
    ✅ No data loss or corruption
    
    ## API Integration
    ✅ All 15 endpoints functional
    ✅ Response formats correct
    ✅ Error handling works
    
    ## E2E Workflows
    ✅ 8/8 user journeys complete successfully
    ✅ No workflow breaks
    
    ## Dependencies
    ✅ 0 critical vulnerabilities
    ⚠️ 2 moderate (non-blocking)
    
    **Layer 4 Status**: ✅ PASS
Outputs:
  • Integration test results
  • Data flow validation
  • API compatibility report
  • E2E workflow results
  • Dependency audit
  • Layer 4 status
Validation:
  • Integration tests pass
  • Data flows validated
  • APIs integrate correctly
  • E2E workflows function
  • Dependencies secure
Time Estimate: 45-90 minutes
Gate 4: ✅ PASS if all integration tests pass, no critical dependencies

目标:验证系统级集成、数据流、API兼容性
自动化程度:20-30%自动化 速度:小时级(复杂度较高) 可信度:中-高
流程
  1. 组件集成测试
    bash
    # Run integration test suite
    npm test -- tests/integration/
    
    # Verify components work together
    # - Database ← → API
    # - API ← → Frontend
    # - Frontend ← → User
  2. 数据流验证
    markdown
    # Data Flow Verification
    
    **Flow 1: User Registration**
    Frontend form → API endpoint → Validation → Database → Email service
    ✅ Data flows correctly
    ✅ No data loss
    ✅ Transactions atomic
    
    **Flow 2: Authentication**
    Login request → API → Database lookup → Token generation → Response
    ✅ Token generated correctly
    ✅ Session stored
    ✅ Response includes token
  3. API集成测试
    bash
    # Test all API endpoints
    npm run test:api
    
    # Verify:
    # - All endpoints respond
    # - Status codes correct
    # - Response formats match spec
    # - Error handling works
  4. 端到端工作流测试
    typescript
    // Complete user journeys
    test('Complete registration and login flow', async () => {
      // 1. Register new user
      const registerResponse = await api.post('/register', userData);
      expect(registerResponse.status).toBe(201);
    
      // 2. Confirm email
      const confirmResponse = await api.get(confirmLink);
      expect(confirmResponse.status).toBe(200);
    
      // 3. Login
      const loginResponse = await api.post('/login', credentials);
      expect(loginResponse.status).toBe(200);
      expect(loginResponse.data.token).toBeDefined();
    
      // 4. Access protected resource
      const profileResponse = await api.get('/profile', {
        headers: { Authorization: `Bearer ${loginResponse.data.token}` }
      });
      expect(profileResponse.status).toBe(200);
    });
  5. 依赖兼容性检查
    bash
    # Check external dependencies work
    npm audit
    
    # Check for breaking changes
    npm outdated
    
    # Verify integration with services
    # - Database connection
    # - Redis/cache
    # - External APIs
  6. 生成第4层验证报告
    markdown
    # Layer 4: Integration Verification
    
    ## Component Integration
    ✅ 12/12 integration tests passing
    ✅ All components integrate correctly
    
    ## Data Flow
    ✅ All 5 data flows validated
    ✅ No data loss or corruption
    
    ## API Integration
    ✅ All 15 endpoints functional
    ✅ Response formats correct
    ✅ Error handling works
    
    ## E2E Workflows
    ✅ 8/8 user journeys complete successfully
    ✅ No workflow breaks
    
    ## Dependencies
    ✅ 0 critical vulnerabilities
    ⚠️ 2 moderate (non-blocking)
    
    **Layer 4 Status**: ✅ PASS
输出结果
  • 集成测试结果
  • 数据流验证报告
  • API兼容性报告
  • 端到端工作流结果
  • 依赖审计报告
  • 第4层验证状态
验证标准
  • 集成测试全部通过
  • 数据流已验证
  • API集成正常
  • 端到端工作流可正常执行
  • 依赖安全
时间预估:45-90分钟
门禁4:✅ 所有集成测试通过且无严重依赖问题则通过

Operation 5: Quality Scoring (Layer 5)

操作5:质量评分(第5层)

Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns
Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)
Process:
  1. Spawn Independent Quality Assessor (Agent-as-a-Judge):
    Key: Use different model family if possible (prevent self-preference bias)
    typescript
    const qualityAssessment = await task({
      description: "Assess code quality holistically",
      prompt: `Evaluate code quality in src/ and tests/.
    
      DO NOT read implementation conversation history.
    
      You have access to tools:
      - Read files
      - Execute tests
      - Run linters
      - Query database (if needed)
    
      Assess 5 dimensions (score each /20):
    
      1. CORRECTNESS (/20):
         - Logic correctness
         - Edge case handling
         - Error handling completeness
         - Security considerations
    
      2. FUNCTIONALITY (/20):
         - Meets all requirements
         - User workflows work
         - Performance acceptable
         - No regressions
    
      3. QUALITY (/20):
         - Code maintainability
         - Best practices followed
         - Anti-patterns avoided
         - Documentation complete
    
      4. INTEGRATION (/20):
         - Components integrate smoothly
         - API contracts correct
         - Data flow works
         - Backward compatible
    
      5. SECURITY (/20):
         - No vulnerabilities
         - Input validation
         - Authentication/authorization
         - Data protection
    
      TOTAL: /100 (sum of 5 dimensions)
    
      For each dimension, provide:
      - Score (/20)
      - Strengths (what's good)
      - Weaknesses (what needs improvement)
      - Evidence (file:line references)
      - Recommendations (specific, actionable)
    
      Write comprehensive report to: quality-assessment.md`
    });
  2. Multi-Agent Ensemble (for critical features):
    3-5 Agent Voting Committee:
    typescript
    // Spawn 3 independent quality assessors
    const [judge1, judge2, judge3] = await Promise.all([
      task({description: "Quality Judge 1", prompt: assessmentPrompt}),
      task({description: "Quality Judge 2", prompt: assessmentPrompt}),
      task({description: "Quality Judge 3", prompt: assessmentPrompt})
    ]);
    
    // Aggregate scores
    const scores = {
      correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
      functionality: median([...]),
      quality: median([...]),
      integration: median([...]),
      security: median([...])
    };
    
    const totalScore = sum(Object.values(scores)); // Total /100
    
    // Check variance
    const totalScores = [judge1.total, judge2.total, judge3.total];
    const variance = max(totalScores) - min(totalScores);
    
    if (variance > 15) {
      // High disagreement → spawn 2 more judges (total 5)
      // Use 5-agent ensemble for final score
    }
    
    // Final score: median of 3 or 5
  3. Calibration Against Rubric:
    markdown
    # Scoring Calibration
    
    ## Correctness: 18/20 (Excellent)
    **20**: Zero errors, all edge cases handled perfectly
    **18**: Minor edge case missing, otherwise excellent ✅ (achieved)
    **15**: 1-2 significant edge cases missing
    **10**: Some logic errors present
    **0**: Major functionality broken
    
    **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)
    
    ## Functionality: 19/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Quality: 17/20 (Good)
    [Similar rubric with evidence]
    
    ## Integration: 18/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Security: 16/20 (Good)
    [Similar rubric with evidence]
    
    **Total**: 88/100 ⚠️ (Below ≥90 gate)
  4. Gap Analysis (if <90):
    markdown
    # Quality Gap Analysis
    
    **Current Score**: 88/100
    **Target**: ≥90/100
    **Gap**: 2 points
    
    ## Critical Gaps (Blocking Approval)
    None
    
    ## High Priority (Should Fix for ≥90)
    1. **Security: Weak bcrypt rounds**
       - **What**: bcrypt using 10 rounds (outdated)
       - **Where**: src/auth/hash.ts:15
       - **Why**: Current standard is 12-14 rounds
       - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
       - **Priority**: High
       - **Impact**: +2 points → 90/100
    
    ## Medium Priority
    1. **Quality: Missing JSDoc for 3 functions**
       - Impact: +1 point → 91/100
    
    **Recommendation**: Fix high priority issue to reach ≥90 threshold
    **Estimated Effort**: 15 minutes
  5. Generate Comprehensive Quality Report:
    markdown
    # Layer 5: Quality Scoring Report
    
    ## Executive Summary
    **Total Score**: 88/100 ⚠️ (Below ≥90 gate)
    **Status**: NEEDS MINOR REVISION
    
    ## Dimension Scores
    - Correctness: 18/20 ⭐⭐⭐⭐⭐
    - Functionality: 19/20 ⭐⭐⭐⭐⭐
    - Quality: 17/20 ⭐⭐⭐⭐
    - Integration: 18/20 ⭐⭐⭐⭐⭐
    - Security: 16/20 ⭐⭐⭐⭐
    
    ## Strengths
    1. Comprehensive test coverage (87%)
    2. All functionality working correctly
    3. Clean integration with all components
    4. Good error handling
    
    ## Weaknesses
    1. Bcrypt rounds below current standard (security)
    2. Missing documentation for helper functions (quality)
    3. One timezone edge case not handled (correctness)
    
    ## Recommendations (Prioritized)
    
    ### Priority 1 (High - Needed for ≥90)
    1. Increase bcrypt rounds: 10 → 12
       - File: src/auth/hash.ts:15
       - Effort: 5 min
       - Impact: +2 points
    
    ### Priority 2 (Medium - Nice to Have)
    1. Add JSDoc to helper functions
       - Files: src/utils/validation.ts
       - Effort: 30 min
       - Impact: +1 point
    
    2. Handle timezone DST edge case
       - File: src/auth/tokens.ts:78
       - Effort: 20 min
       - Impact: +1 point
    
    **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
Outputs:
  • Quality score (0-100) with dimension breakdown
  • Calibrated against rubric
  • Gap analysis
  • Prioritized recommendations (Critical/High/Medium/Low)
  • Evidence-based feedback (file:line references)
  • Action plan to reach ≥90
Validation:
  • All 5 dimensions scored
  • Scores calibrated against rubric
  • Evidence provided for each score
  • Gap analysis if <90
  • Recommendations actionable
  • Ensemble used for critical features (optional)
Time Estimate: 60-120 minutes (ensemble adds 30-60 min)
Gate 5: ✅ PASS if total score ≥90/100

目标:采用LLM-as-judge和Agent-as-a-Judge模式进行整体质量评估
自动化程度:0-20%自动化 速度:小时级(成本较高) 可信度:中等(需主观判断)
流程
  1. 生成独立质量评估Agent(Agent-as-a-Judge):
    关键:尽可能使用不同模型家族(避免自我偏好偏见)
    typescript
    const qualityAssessment = await task({
      description: "Assess code quality holistically",
      prompt: `Evaluate code quality in src/ and tests/.
    
      DO NOT read implementation conversation history.
    
      You have access to tools:
      - Read files
      - Execute tests
      - Run linters
      - Query database (if needed)
    
      Assess 5 dimensions (score each /20):
    
      1. CORRECTNESS (/20):
         - Logic correctness
         - Edge case handling
         - Error handling completeness
         - Security considerations
    
      2. FUNCTIONALITY (/20):
         - Meets all requirements
         - User workflows work
         - Performance acceptable
         - No regressions
    
      3. QUALITY (/20):
         - Code maintainability
         - Best practices followed
         - Anti-patterns avoided
         - Documentation complete
    
      4. INTEGRATION (/20):
         - Components integrate smoothly
         - API contracts correct
         - Data flow works
         - Backward compatible
    
      5. SECURITY (/20):
         - No vulnerabilities
         - Input validation
         - Authentication/authorization
         - Data protection
    
      TOTAL: /100 (sum of 5 dimensions)
    
      For each dimension, provide:
      - Score (/20)
      - Strengths (what's good)
      - Weaknesses (what needs improvement)
      - Evidence (file:line references)
      - Recommendations (specific, actionable)
    
      Write comprehensive report to: quality-assessment.md`
    });
  2. 多Agent集成评估(针对核心功能):
    3-5个Agent投票委员会
    typescript
    // Spawn 3 independent quality assessors
    const [judge1, judge2, judge3] = await Promise.all([
      task({description: "Quality Judge 1", prompt: assessmentPrompt}),
      task({description: "Quality Judge 2", prompt: assessmentPrompt}),
      task({description: "Quality Judge 3", prompt: assessmentPrompt})
    ]);
    
    // Aggregate scores
    const scores = {
      correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
      functionality: median([...]),
      quality: median([...]),
      integration: median([...]),
      security: median([...])
    };
    
    const totalScore = sum(Object.values(scores)); // Total /100
    
    // Check variance
    const totalScores = [judge1.total, judge2.total, judge3.total];
    const variance = max(totalScores) - min(totalScores);
    
    if (variance > 15) {
      // High disagreement → spawn 2 more judges (total 5)
      // Use 5-agent ensemble for final score
    }
    
    // Final score: median of 3 or 5
  3. 基于评分标准的校准
    markdown
    # Scoring Calibration
    
    ## Correctness: 18/20 (Excellent)
    **20**: Zero errors, all edge cases handled perfectly
    **18**: Minor edge case missing, otherwise excellent ✅ (achieved)
    **15**: 1-2 significant edge cases missing
    **10**: Some logic errors present
    **0**: Major functionality broken
    
    **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)
    
    ## Functionality: 19/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Quality: 17/20 (Good)
    [Similar rubric with evidence]
    
    ## Integration: 18/20 (Excellent)
    [Similar rubric with evidence]
    
    ## Security: 16/20 (Good)
    [Similar rubric with evidence]
    
    **Total**: 88/100 ⚠️ (Below ≥90 gate)
  4. 差距分析(若评分<90):
    markdown
    # Quality Gap Analysis
    
    **Current Score**: 88/100
    **Target**: ≥90/100
    **Gap**: 2 points
    
    ## Critical Gaps (Blocking Approval)
    None
    
    ## High Priority (Should Fix for ≥90)
    1. **Security: Weak bcrypt rounds**
       - **What**: bcrypt using 10 rounds (outdated)
       - **Where**: src/auth/hash.ts:15
       - **Why**: Current standard is 12-14 rounds
       - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
       - **Priority**: High
       - **Impact**: +2 points → 90/100
    
    ## Medium Priority
    1. **Quality: Missing JSDoc for 3 functions**
       - Impact: +1 point → 91/100
    
    **Recommendation**: Fix high priority issue to reach ≥90 threshold
    **Estimated Effort**: 15 minutes
  5. 生成全面质量报告
    markdown
    # Layer 5: Quality Scoring Report
    
    ## Executive Summary
    **Total Score**: 88/100 ⚠️ (Below ≥90 gate)
    **Status**: NEEDS MINOR REVISION
    
    ## Dimension Scores
    - Correctness: 18/20 ⭐⭐⭐⭐⭐
    - Functionality: 19/20 ⭐⭐⭐⭐⭐
    - Quality: 17/20 ⭐⭐⭐⭐
    - Integration: 18/20 ⭐⭐⭐⭐⭐
    - Security: 16/20 ⭐⭐⭐⭐
    
    ## Strengths
    1. Comprehensive test coverage (87%)
    2. All functionality working correctly
    3. Clean integration with all components
    4. Good error handling
    
    ## Weaknesses
    1. Bcrypt rounds below current standard (security)
    2. Missing documentation for helper functions (quality)
    3. One timezone edge case not handled (correctness)
    
    ## Recommendations (Prioritized)
    
    ### Priority 1 (High - Needed for ≥90)
    1. Increase bcrypt rounds: 10 → 12
       - File: src/auth/hash.ts:15
       - Effort: 5 min
       - Impact: +2 points
    
    ### Priority 2 (Medium - Nice to Have)
    1. Add JSDoc to helper functions
       - Files: src/utils/validation.ts
       - Effort: 30 min
       - Impact: +1 point
    
    2. Handle timezone DST edge case
       - File: src/auth/tokens.ts:78
       - Effort: 20 min
       - Impact: +1 point
    
    **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
输出结果
  • 质量评分(0-100分)及维度细分
  • 基于评分标准的校准结果
  • 差距分析报告
  • 按优先级排序的建议(严重/高/中/低)
  • 基于证据的反馈(文件:行号引用)
  • 达到≥90分的行动计划
验证标准
  • 已对所有5个维度评分
  • 评分已基于标准校准
  • 每个评分均提供证据
  • 若评分<90则已完成差距分析
  • 建议可执行
  • 核心功能已使用集成评估(可选)
时间预估:60-120分钟(集成评估需额外30-60分钟)
门禁5:✅ 总分≥90/100则通过

Quality Gates Summary

质量门禁汇总

All 5 Gates Must Pass for production approval:
Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED
If Any Gate Fails:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

需通过全部5个门禁方可获得生产部署批准:
Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED
若任意门禁失败
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

Appendix A: Independence Protocol

附录A:独立验证协议

How Verification Independence is Maintained

如何保持验证独立性

Verification Agent Spawning:
typescript
// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});
Bias Prevention Checklist:
  • Specifications written BEFORE implementation
  • Verification agent prompt has no implementation context
  • Agent evaluates against specs, not what code does
  • Fresh context (via Task tool)
  • Different model family used (if possible)
Validation of Independence:
markdown
undefined
验证Agent生成方式
typescript
// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});
偏见预防检查清单
  • 需求规格书在实现前已编写
  • 验证Agent的提示语不含实现上下文
  • Agent仅基于规格书评估,而非代码实际实现
  • 使用全新上下文(通过Task工具)
  • 尽可能使用不同模型家族
独立性验证
markdown
undefined

Independence Audit

Independence Audit

Expected Behavior:
  • ✅ Verifier finds 1-3 issues (healthy skepticism)
  • ✅ Verifier references specifications
  • ✅ Verifier uses tools to verify claims
Warning Signs:
  • ⚠️ Verifier finds 0 issues (possible rubber stamp)
  • ⚠️ Verifier doesn't use tools
  • ⚠️ Verifier parrots implementation justifications
If Warning: Re-verify with stronger independence prompt

---
Expected Behavior:
  • ✅ Verifier finds 1-3 issues (healthy skepticism)
  • ✅ Verifier references specifications
  • ✅ Verifier uses tools to verify claims
Warning Signs:
  • ⚠️ Verifier finds 0 issues (possible rubber stamp)
  • ⚠️ Verifier doesn't use tools
  • ⚠️ Verifier parrots implementation justifications
If Warning: Re-verify with stronger independence prompt

---

Appendix B: Operational Scoring Rubrics

附录B:操作评分标准

Complete Rubrics for All 5 Dimensions

所有5个维度的完整评分标准

Correctness (/20)

Correctness (/20)

20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken
20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken

Functionality (/20)

Functionality (/20)

20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing
20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing

Quality (/20)

Quality (/20)

20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code
20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code

Integration (/20)

Integration (/20)

20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate
20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate

Security (/20)

Security (/20)

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities

Appendix C: Technical Foundation

附录C:技术基础

Verification Tools

验证工具

Linting:
  • ESLint (JavaScript/TypeScript)
  • Pylint/Ruff (Python)
Type Checking:
  • TypeScript compiler (tsc)
  • mypy (Python)
Security (SAST):
  • Semgrep (multi-language)
  • Bandit (Python)
  • npm audit (JavaScript)
Visual Testing:
  • Playwright (screenshot, visual regression)
  • Percy/Chromatic (visual diff)
  • axe-core (accessibility)
Coverage:
  • c8/nyc (JavaScript)
  • pytest-cov (Python)
代码检查:
  • ESLint (JavaScript/TypeScript)
  • Pylint/Ruff (Python)
类型检查:
  • TypeScript compiler (tsc)
  • mypy (Python)
安全扫描(SAST):
  • Semgrep (多语言)
  • Bandit (Python)
  • npm audit (JavaScript)
视觉测试:
  • Playwright (截图、视觉回归)
  • Percy/Chromatic (视觉对比)
  • axe-core (可访问性)
覆盖率工具:
  • c8/nyc (JavaScript)
  • pytest-cov (Python)

Cost Controls

成本控制

Budget Caps:
  • LLM-as-judge: $50/month
  • Ensemble verification: $20/month
  • Total verification: $70/month
Optimization:
  • Cache quality scores for 24h (same code → same score)
  • Skip Layer 5 for changes <50 lines
  • Use ensemble (3-5 agents) only for critical features
  • Use cheaper models for pre-filtering (Haiku for Layer 1-2)

预算上限:
  • LLM-as-judge: $50/月
  • 集成验证: $20/月
  • 总验证成本: $70/月
优化措施:
  • 质量评分缓存24小时(相同代码→相同评分)
  • 代码变更<50行时跳过第5层验证
  • 仅对核心功能使用集成评估(3-5个Agent)
  • 预过滤使用低成本模型(Layer1-2使用Haiku)

Quick Reference

快速参考

The 5 Layers

5层验证体系

LayerPurposeAutomationTimeTools
1Rules-based95%15-30mLinters, types, SAST
2Functional60-80%30-60mTest execution, coverage
3Visual30-50%30-90mScreenshots, a11y
4Integration20-30%45-90mE2E, API tests
5Quality Scoring0-20%60-120mLLM-as-judge, ensemble
Total: 3-6 hours for complete 5-layer verification
LayerPurposeAutomationTimeTools
1Rules-based95%15-30mLinters, types, SAST
2Functional60-80%30-60mTest execution, coverage
3Visual30-50%30-90mScreenshots, a11y
4Integration20-30%45-90mE2E, API tests
5Quality Scoring0-20%60-120mLLM-as-judge, ensemble
Total: 3-6 hours for complete 5-layer verification

Quality Thresholds

质量阈值

  • ≥90: ✅ Excellent (production-ready)
  • 80-89: ⚠️ Good (needs minor improvements)
  • 70-79: ❌ Acceptable (needs work before production)
  • <70: ❌ Poor (significant rework required)
  • ≥90: ✅ Excellent (production-ready)
  • 80-89: ⚠️ Good (needs minor improvements)
  • 70-79: ❌ Acceptable (needs work before production)
  • <70: ❌ Poor (significant rework required)

Gates

门禁要求

All 5 Must Pass:
  1. Rules pass (no critical lint/type/security)
  2. Tests pass + coverage ≥80%
  3. Visual OK (no critical UI issues)
  4. Integration OK (E2E works)
  5. Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.
All 5 Must Pass:
  1. Rules pass (no critical lint/type/security)
  2. Tests pass + coverage ≥80%
  3. Visual OK (no critical UI issues)
  4. Integration OK (E2E works)
  5. Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.