Loading...
Loading...
Multi-layer quality assurance with 5-layer verification pyramid (Rules → Functional → Visual → Integration → Quality Scoring). Independent verification with LLM-as-judge and Agent-as-a-Judge patterns. Score 0-100 with ≥90 threshold. Use when verifying code quality, security scanning, preventing test gaming, comprehensive QA, or ensuring production readiness through multi-layer validation.
npx skill4agent add adaptationio/skrillz multi-ai-verification Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}
# Python
pylint src/**/*.py
# Expected: Zero linting errors# TypeScript
npx tsc --noEmit
# Python
mypy src/
# Expected: Zero type errors# Check formatting
npx prettier --check src/**/*.{ts,tsx}
# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}# Static security analysis
npx semgrep --config=auto src/
# Or for Python
bandit -r src/
# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies# Layer 1: Rules-Based Verification
## Schema Validation
✅ plan.json validates
✅ All task files validate
## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)
## Type Checking
✅ 0 type errors
## Formatting
✅ All files formatted correctly
## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)
**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue# Run all tests with coverage
npm test -- --coverage --verbose
# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected
# Target: ≥90% examples work# Coverage Report
**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅
**Gate Status**: PASS ✅ (all ≥80%)
**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)# Compare before/after
git diff main...feature --stat
# Run all tests
npm test
# Verify: No new failures (regression prevention)# Run performance tests
npm run test:performance
# Check response times
# Verify: Within acceptable ranges# Layer 2: Functional Verification
## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds
## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%
## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)
## Regression
✅ All existing tests still pass
## Performance
✅ All endpoints <200ms
**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)# Generate screenshots of UI
npx playwright test --screenshot=on
# Or manually:
# Open application
# Capture screenshots of key views# Compare against baseline
npx playwright test --update-snapshots=missing
# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/# Visual Checklist
## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements
## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly
## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅# Automated accessibility scan
npx axe-core src/
# Check WCAG compliance
npx pa11y http://localhost:3000
# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios# Layer 3: Visual Verification
## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px
## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality
## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)
**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings# Run integration test suite
npm test -- tests/integration/
# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User# Data Flow Verification
**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic
**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token# Test all API endpoints
npm run test:api
# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works// Complete user journeys
test('Complete registration and login flow', async () => {
// 1. Register new user
const registerResponse = await api.post('/register', userData);
expect(registerResponse.status).toBe(201);
// 2. Confirm email
const confirmResponse = await api.get(confirmLink);
expect(confirmResponse.status).toBe(200);
// 3. Login
const loginResponse = await api.post('/login', credentials);
expect(loginResponse.status).toBe(200);
expect(loginResponse.data.token).toBeDefined();
// 4. Access protected resource
const profileResponse = await api.get('/profile', {
headers: { Authorization: `Bearer ${loginResponse.data.token}` }
});
expect(profileResponse.status).toBe(200);
});# Check external dependencies work
npm audit
# Check for breaking changes
npm outdated
# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs# Layer 4: Integration Verification
## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly
## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption
## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works
## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks
## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)
**Layer 4 Status**: ✅ PASSconst qualityAssessment = await task({
description: "Assess code quality holistically",
prompt: `Evaluate code quality in src/ and tests/.
DO NOT read implementation conversation history.
You have access to tools:
- Read files
- Execute tests
- Run linters
- Query database (if needed)
Assess 5 dimensions (score each /20):
1. CORRECTNESS (/20):
- Logic correctness
- Edge case handling
- Error handling completeness
- Security considerations
2. FUNCTIONALITY (/20):
- Meets all requirements
- User workflows work
- Performance acceptable
- No regressions
3. QUALITY (/20):
- Code maintainability
- Best practices followed
- Anti-patterns avoided
- Documentation complete
4. INTEGRATION (/20):
- Components integrate smoothly
- API contracts correct
- Data flow works
- Backward compatible
5. SECURITY (/20):
- No vulnerabilities
- Input validation
- Authentication/authorization
- Data protection
TOTAL: /100 (sum of 5 dimensions)
For each dimension, provide:
- Score (/20)
- Strengths (what's good)
- Weaknesses (what needs improvement)
- Evidence (file:line references)
- Recommendations (specific, actionable)
Write comprehensive report to: quality-assessment.md`
});// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
task({description: "Quality Judge 1", prompt: assessmentPrompt}),
task({description: "Quality Judge 2", prompt: assessmentPrompt}),
task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);
// Aggregate scores
const scores = {
correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
functionality: median([...]),
quality: median([...]),
integration: median([...]),
security: median([...])
};
const totalScore = sum(Object.values(scores)); // Total /100
// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);
if (variance > 15) {
// High disagreement → spawn 2 more judges (total 5)
// Use 5-agent ensemble for final score
}
// Final score: median of 3 or 5# Scoring Calibration
## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken
**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)
## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]
## Quality: 17/20 (Good)
[Similar rubric with evidence]
## Integration: 18/20 (Excellent)
[Similar rubric with evidence]
## Security: 16/20 (Good)
[Similar rubric with evidence]
**Total**: 88/100 ⚠️ (Below ≥90 gate)# Quality Gap Analysis
**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points
## Critical Gaps (Blocking Approval)
None
## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
- **What**: bcrypt using 10 rounds (outdated)
- **Where**: src/auth/hash.ts:15
- **Why**: Current standard is 12-14 rounds
- **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
- **Priority**: High
- **Impact**: +2 points → 90/100
## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
- Impact: +1 point → 91/100
**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes# Layer 5: Quality Scoring Report
## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION
## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐
## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling
## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)
## Recommendations (Prioritized)
### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
- File: src/auth/hash.ts:15
- Effort: 5 min
- Impact: +2 points
### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
- Files: src/utils/validation.ts
- Effort: 30 min
- Impact: +1 point
2. Handle timezone DST edge case
- File: src/auth/tokens.ts:78
- Effort: 20 min
- Impact: +1 point
**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVEDFailed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});## Independence Audit
**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
**If Warning**: Re-verify with stronger independence prompt| Layer | Purpose | Automation | Time | Tools |
|---|---|---|---|---|
| 1 | Rules-based | 95% | 15-30m | Linters, types, SAST |
| 2 | Functional | 60-80% | 30-60m | Test execution, coverage |
| 3 | Visual | 30-50% | 30-90m | Screenshots, a11y |
| 4 | Integration | 20-30% | 45-90m | E2E, API tests |
| 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |