Loading...
Loading...
GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper.
npx skill4agent add affaan-m/everything-claude-code gan-style-harnessInspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)
When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
claude -p ┌─────────────┐
│ PLANNER │
│ (Opus 4.6) │
└──────┬──────┘
│ Product Spec
│ (features, sprints, design direction)
▼
┌────────────────────────┐
│ │
│ GENERATOR-EVALUATOR │
│ FEEDBACK LOOP │
│ │
│ ┌──────────┐ │
│ │GENERATOR │──build──▶│──┐
│ │(Opus 4.6)│ │ │
│ └────▲─────┘ │ │
│ │ │ │ live app
│ feedback │ │
│ │ │ │
│ ┌────┴─────┐ │ │
│ │EVALUATOR │◀─test───│──┘
│ │(Opus 4.6)│ │
│ │+Playwright│ │
│ └──────────┘ │
│ │
│ 5-15 iterations │
└────────────────────────┘## Evaluation Rubric
### Design Quality (weight: 0.3)
- 1-3: Generic, template-like, "AI slop" aesthetics
- 4-6: Competent but unremarkable, follows conventions
- 7-8: Distinctive, cohesive visual identity
- 9-10: Could pass for a professional designer's work
### Originality (weight: 0.2)
- 1-3: Default colors, stock layouts, no personality
- 4-6: Some custom choices, mostly standard patterns
- 7-8: Clear creative vision, unique approach
- 9-10: Surprising, delightful, genuinely novel
### Craft (weight: 0.3)
- 1-3: Broken layouts, missing states, no animations
- 4-6: Works but feels rough, inconsistent spacing
- 7-8: Polished, smooth transitions, responsive
- 9-10: Pixel-perfect, delightful micro-interactions
### Functionality (weight: 0.2)
- 1-3: Core features broken or missing
- 4-6: Happy path works, edge cases fail
- 7-8: All features work, good error handling
- 9-10: Bulletproof, handles every edge case# Full three-agent harness
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
# With custom config
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
# Frontend design mode (generator + evaluator only, no planner)
/project:gan-design "Create a landing page for a crypto portfolio tracker"# Basic usage
./scripts/gan-harness.sh "Build a music streaming dashboard"
# With options
GAN_MAX_ITERATIONS=10 \
GAN_PASS_THRESHOLD=7.5 \
GAN_EVAL_CRITERIA="functionality,performance,security" \
./scripts/gan-harness.sh "Build a REST API for task management"# Step 1: Plan
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
# Step 2: Generate (iteration 1)
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
# Step 3: Evaluate (iteration 1)
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
# Step 4: Generate (iteration 2 — reads feedback)
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
# Repeat steps 3-4 until pass threshold metKey principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
| Variable | Default | Description |
|---|---|---|
| | Maximum generator-evaluator cycles |
| | Weighted score to pass (1-10) |
| | Model for planning agent |
| | Model for generator agent |
| | Model for evaluator agent |
| | Comma-separated criteria |
| | Port for the live app |
| | Command to start dev server |
| | Project working directory |
| | Skip planner, use spec directly |
| | |
| Mode | Tools | Best For |
|---|---|---|
| Browser MCP + live interaction | Full-stack apps with UI |
| Screenshot + visual analysis | Static sites, design-only |
| Tests + linting + build | APIs, libraries, CLI tools |
feedback-NNN.mdGAN_MAX_ITERATIONS| Metric | Solo Agent | GAN Harness | Improvement |
|---|---|---|---|
| Time | 20 min | 4-6 hours | 12-18x longer |
| Cost | $9 | $125-200 | 14-22x more |
| Quality | Barely functional | Production-ready | Phase change |
| Core features | Broken | All working | N/A |
| Design | Generic AI slop | Distinctive, polished | N/A |