gan-style-harness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GAN-Style Harness Skill

GAN风格框架技能

A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
灵感来自 Anthropic的长期运行应用开发框架设计(2026年3月24日发布)
这是一套多Agent框架,将生成评估环节拆分,构建对抗性反馈循环,能实现远超单个Agent的产出质量。

Core Insight

核心洞见

When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
要求Agent评估自己的产出时,它们会表现出病态的乐观主义:会吹捧平庸的产出,对合理存在的问题视而不见。但开发一个独立的评估器来做到极其严苛,远比教生成器自我批判要容易得多。
这和GAN(生成对抗网络)的运行逻辑一致:生成器负责产出,评估器负责批判,反馈会驱动下一轮迭代优化。

When to Use

适用场景

  • Building complete applications from a one-line prompt
  • Frontend design tasks requiring high visual quality
  • Full-stack projects that need working features, not just code
  • Any task where "AI slop" aesthetics are unacceptable
  • Projects where you want to invest $50-200 for production-quality output
  • 仅用一行提示词构建完整应用
  • 对视觉质量要求高的前端设计任务
  • 需要可运行功能、而非仅输出代码的全栈项目
  • 无法接受「AI粗制滥造内容」观感的所有任务
  • 愿意投入50-200美元获取生产级产出的项目

When NOT to Use

不适用场景

  • Quick single-file fixes (use standard
    claude -p
    )
  • Tasks with tight budget constraints (<$10)
  • Simple refactoring (use de-sloppify pattern instead)
  • Tasks that are already well-specified with tests (use TDD workflow)
  • 快速的单文件修复(使用标准
    claude -p
    即可)
  • 预算紧张的任务(预算低于10美元)
  • 简单重构(改用去粗制滥造模式即可)
  • 已经通过测试明确定义的任务(使用TDD工作流即可)

Architecture

架构

                    ┌─────────────┐
                    │   PLANNER   │
                    │  (Opus 4.6) │
                    └──────┬──────┘
                           │ Product Spec
                           │ (features, sprints, design direction)
              ┌────────────────────────┐
              │                        │
              │   GENERATOR-EVALUATOR  │
              │      FEEDBACK LOOP     │
              │                        │
              │  ┌──────────┐          │
              │  │GENERATOR │──build──▶│──┐
              │  │(Opus 4.6)│          │  │
              │  └────▲─────┘          │  │
              │       │                │  │ live app
              │    feedback             │  │
              │       │                │  │
              │  ┌────┴─────┐          │  │
              │  │EVALUATOR │◀─test───│──┘
              │  │(Opus 4.6)│          │
              │  │+Playwright│         │
              │  └──────────┘          │
              │                        │
              │   5-15 iterations      │
              └────────────────────────┘
                    ┌─────────────┐
                    │   PLANNER   │
                    │  (Opus 4.6) │
                    └──────┬──────┘
                           │ Product Spec
                           │ (features, sprints, design direction)
              ┌────────────────────────┐
              │                        │
              │   GENERATOR-EVALUATOR  │
              │      FEEDBACK LOOP     │
              │                        │
              │  ┌──────────┐          │
              │  │GENERATOR │──build──▶│──┐
              │  │(Opus 4.6)│          │  │
              │  └────▲─────┘          │  │
              │       │                │  │ live app
              │    feedback             │  │
              │       │                │  │
              │  ┌────┴─────┐          │  │
              │  │EVALUATOR │◀─test───│──┘
              │  │(Opus 4.6)│          │
              │  │+Playwright│         │
              │  └──────────┘          │
              │                        │
              │   5-15 iterations      │
              └────────────────────────┘

The Three Agents

三类Agent

1. Planner Agent

1. 规划Agent

Role: Product manager — expands a brief prompt into a full product specification.
Key behaviors:
  • Takes a one-line prompt and produces a 16-feature, multi-sprint specification
  • Defines user stories, technical requirements, and visual design direction
  • Is deliberately ambitious — conservative planning leads to underwhelming results
  • Produces evaluation criteria that the Evaluator will use later
Model: Opus 4.6 (needs deep reasoning for spec expansion)
角色: 产品经理——将简短的提示词扩展为完整的产品规格说明。
核心行为:
  • 接收一行提示词,输出包含16个功能、多冲刺的规格说明
  • 定义用户故事、技术要求和视觉设计方向
  • 刻意设定高要求——保守的规划只会产出平淡的结果
  • 输出后续评估器会使用的评估标准
使用模型: Opus 4.6(需要深度推理能力来扩展规格说明)

2. Generator Agent

2. 生成Agent

Role: Developer — implements features according to the spec.
Key behaviors:
  • Works in structured sprints (or continuous mode with newer models)
  • Negotiates a "sprint contract" with the Evaluator before writing code
  • Uses full-stack tooling: React, FastAPI/Express, databases, CSS
  • Manages git for version control between iterations
  • Reads Evaluator feedback and incorporates it in next iteration
Model: Opus 4.6 (needs strong coding capability)
角色: 开发者——按照规格说明实现功能。
核心行为:
  • 按照结构化冲刺工作(或者使用更新的模型采用持续模式)
  • 编写代码前和评估器协商确定「冲刺约定」
  • 使用全栈工具:React、FastAPI/Express、数据库、CSS
  • 管理git来实现迭代间的版本控制
  • 读取评估器的反馈,在下一轮迭代中优化
使用模型: Opus 4.6(需要较强的编码能力)

3. Evaluator Agent

3. 评估Agent

Role: QA engineer — tests the live running application, not just code.
Key behaviors:
  • Uses Playwright MCP to interact with the live application
  • Clicks through features, fills forms, tests API endpoints
  • Scores against four criteria (configurable):
    1. Design Quality — Does it feel like a coherent whole?
    2. Originality — Custom decisions vs. template/AI patterns?
    3. Craft — Typography, spacing, animations, micro-interactions?
    4. Functionality — Do all features actually work?
  • Returns structured feedback with scores and specific issues
  • Is engineered to be ruthlessly strict — never praises mediocre work
Model: Opus 4.6 (needs strong judgment + tool use)
角色: QA工程师——测试正在运行的线上应用,而非仅测试代码。
核心行为:
  • 使用Playwright MCP和运行中的应用交互
  • 点击功能、填写表单、测试API端点
  • 按照四个标准打分(可配置):
    1. 设计质量——整体观感是否统一协调?
    2. 原创性——自定义决策占比vs模板/AI通用模式占比?
    3. 做工精细度——排版、间距、动画、微交互表现如何?
    4. 功能性——所有功能是否都能正常运行?
  • 返回结构化反馈,包含得分和具体问题
  • 被设定为极其严苛——永远不会吹捧平庸的产出
使用模型: Opus 4.6(需要较强的判断能力+工具使用能力)

Evaluation Criteria

评估标准

The default four criteria, each scored 1-10:
markdown
undefined
默认四个评估维度,每个维度打分1-10分:
markdown
undefined

Evaluation Rubric

评估规则

Design Quality (weight: 0.3)

设计质量(权重:0.3)

  • 1-3: Generic, template-like, "AI slop" aesthetics
  • 4-6: Competent but unremarkable, follows conventions
  • 7-8: Distinctive, cohesive visual identity
  • 9-10: Could pass for a professional designer's work
  • 1-3:通用模板感,典型「AI粗制滥造内容」观感
  • 4-6:合格但无亮点,符合常规设计规范
  • 7-8:有辨识度、视觉风格统一
  • 9-10:足以媲美专业设计师的作品

Originality (weight: 0.2)

原创性(权重:0.2)

  • 1-3: Default colors, stock layouts, no personality
  • 4-6: Some custom choices, mostly standard patterns
  • 7-8: Clear creative vision, unique approach
  • 9-10: Surprising, delightful, genuinely novel
  • 1-3:默认配色、通用布局、无个性
  • 4-6:有少量自定义选择,大部分是标准模式
  • 7-8:有清晰的创意思路、独特的实现方式
  • 9-10:有惊喜感、体验愉悦、真正具备创新性

Craft (weight: 0.3)

做工精细度(权重:0.3)

  • 1-3: Broken layouts, missing states, no animations
  • 4-6: Works but feels rough, inconsistent spacing
  • 7-8: Polished, smooth transitions, responsive
  • 9-10: Pixel-perfect, delightful micro-interactions
  • 1-3:布局破损、状态缺失、无动画
  • 4-6:可运行但观感粗糙、间距不一致
  • 7-8:打磨完善、过渡流畅、响应式适配良好
  • 9-10:像素级完美、微交互体验愉悦

Functionality (weight: 0.2)

功能性(权重:0.2)

  • 1-3: Core features broken or missing
  • 4-6: Happy path works, edge cases fail
  • 7-8: All features work, good error handling
  • 9-10: Bulletproof, handles every edge case
undefined
  • 1-3:核心功能破损或缺失
  • 4-6:主流程可用,边缘场景报错
  • 7-8:所有功能可用、错误处理完善
  • 9-10:稳定性极强,覆盖所有边缘场景
undefined

Scoring

打分规则

  • Weighted score = sum of (criterion_score * weight)
  • Pass threshold = 7.0 (configurable)
  • Max iterations = 15 (configurable, typically 5-15 sufficient)
  • 加权得分 = 各维度得分 * 对应权重之和
  • 合格阈值 = 7.0(可配置)
  • 最大迭代次数 = 15(可配置,通常5-15次足够)

Usage

使用方法

Via Command

通过命令使用

bash
undefined
bash
undefined

Full three-agent harness

完整三Agent框架

/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"

With custom config

自定义配置

/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5

Frontend design mode (generator + evaluator only, no planner)

前端设计模式(仅生成器+评估器,无规划器)

/project:gan-design "Create a landing page for a crypto portfolio tracker"
undefined
/project:gan-design "Create a landing page for a crypto portfolio tracker"
undefined

Via Shell Script

通过Shell脚本使用

bash
undefined
bash
undefined

Basic usage

基础用法

./scripts/gan-harness.sh "Build a music streaming dashboard"
./scripts/gan-harness.sh "Build a music streaming dashboard"

With options

带参数使用

GAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
undefined
GAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
undefined

Via Claude Code (Manual)

通过Claude Code手动使用

bash
undefined
bash
undefined

Step 1: Plan

步骤1:规划

claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"

Step 2: Generate (iteration 1)

步骤2:生成(第1轮迭代)

claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."

Step 3: Evaluate (iteration 1)

步骤3:评估(第1轮迭代)

claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"

Step 4: Generate (iteration 2 — reads feedback)

步骤4:生成(第2轮迭代——读取反馈)

claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."

Repeat steps 3-4 until pass threshold met

重复步骤3-4直到达到合格阈值

undefined
undefined

Evolution Across Model Capabilities

随模型能力的演进

The harness should simplify as models improve. Following Anthropic's evolution:
模型能力提升时框架应该随之简化。遵循Anthropic的演进路径:

Stage 1 — Weaker Models (Sonnet-class)

阶段1——较弱模型(Sonnet级别)

  • Full sprint decomposition required
  • Context resets between sprints (avoid context anxiety)
  • 2-agent minimum: Initializer + Coding Agent
  • Heavy scaffolding compensates for model limitations
  • 需要完整的冲刺拆分
  • 冲刺之间重置上下文(避免上下文过载)
  • 最少2个Agent:初始化器 + 编码Agent
  • 重度脚手架来弥补模型能力的不足

Stage 2 — Capable Models (Opus 4.5-class)

阶段2——能力达标模型(Opus 4.5级别)

  • Full 3-agent harness: Planner + Generator + Evaluator
  • Sprint contracts before each implementation phase
  • 10-sprint decomposition for complex apps
  • Context resets still useful but less critical
  • 完整3Agent框架:规划器 + 生成器 + 评估器
  • 每个实现阶段前先确认冲刺约定
  • 复杂应用拆分为10个冲刺
  • 上下文重置仍然有用但不再是必须

Stage 3 — Frontier Models (Opus 4.6-class)

阶段3——前沿模型(Opus 4.6级别)

  • Simplified harness: single planning pass, continuous generation
  • Evaluation reduced to single end-pass (model is smarter)
  • No sprint structure needed
  • Automatic compaction handles context growth
Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
  • 简化框架:单次规划、持续生成
  • 评估简化为最终单次审核(模型更智能)
  • 不需要冲刺结构
  • 自动压缩处理上下文增长
核心原则: 框架的每个组件都对应了一个「模型无法独立完成」的假设。当模型能力提升时,重新验证这些假设,移除不再需要的组件。

Configuration

配置项

Environment Variables

环境变量

VariableDefaultDescription
GAN_MAX_ITERATIONS
15
Maximum generator-evaluator cycles
GAN_PASS_THRESHOLD
7.0
Weighted score to pass (1-10)
GAN_PLANNER_MODEL
opus
Model for planning agent
GAN_GENERATOR_MODEL
opus
Model for generator agent
GAN_EVALUATOR_MODEL
opus
Model for evaluator agent
GAN_EVAL_CRITERIA
design,originality,craft,functionality
Comma-separated criteria
GAN_DEV_SERVER_PORT
3000
Port for the live app
GAN_DEV_SERVER_CMD
npm run dev
Command to start dev server
GAN_PROJECT_DIR
.
Project working directory
GAN_SKIP_PLANNER
false
Skip planner, use spec directly
GAN_EVAL_MODE
playwright
playwright
,
screenshot
, or
code-only
变量默认值描述
GAN_MAX_ITERATIONS
15
生成-评估循环最大次数
GAN_PASS_THRESHOLD
7.0
合格加权得分(1-10)
GAN_PLANNER_MODEL
opus
规划Agent使用的模型
GAN_GENERATOR_MODEL
opus
生成Agent使用的模型
GAN_EVALUATOR_MODEL
opus
评估Agent使用的模型
GAN_EVAL_CRITERIA
design,originality,craft,functionality
逗号分隔的评估维度
GAN_DEV_SERVER_PORT
3000
运行中应用的端口
GAN_DEV_SERVER_CMD
npm run dev
启动开发服务器的命令
GAN_PROJECT_DIR
.
项目工作目录
GAN_SKIP_PLANNER
false
跳过规划器,直接使用规格说明
GAN_EVAL_MODE
playwright
playwright
screenshot
code-only

Evaluation Modes

评估模式

ModeToolsBest For
playwright
Browser MCP + live interactionFull-stack apps with UI
screenshot
Screenshot + visual analysisStatic sites, design-only
code-only
Tests + linting + buildAPIs, libraries, CLI tools
模式使用工具适用场景
playwright
浏览器MCP + 实时交互带UI的全栈应用
screenshot
截图 + 视觉分析静态站点、仅设计类任务
code-only
测试 + lint检查 + 构建API、库、CLI工具

Anti-Patterns

反模式

  1. Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
  2. Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read
    feedback-NNN.md
    at the start of each iteration.
  3. Infinite loops — Always set
    GAN_MAX_ITERATIONS
    . If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.
  4. Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.
  5. Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
  6. Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.
  1. 评估器过于宽松 —— 如果评估器在第1轮迭代就通过所有内容,说明你的规则太宽松。收紧打分标准,对常见AI模式增加明确的扣分规则。
  2. 生成器忽略反馈 —— 确保反馈以文件形式传递,而非内联输入。生成器应该在每次迭代开始时读取
    feedback-NNN.md
  3. 无限循环 —— 一定要设置
    GAN_MAX_ITERATIONS
    。如果生成器经过3次迭代后得分没有提升,停止流程,标记为需要人工审核。
  4. 评估器测试流于表面 —— 评估器必须使用Playwright交互测试运行中的应用,而不是仅截图。要点击按钮、填写表单、测试错误状态。
  5. 评估器肯定自己提出的修复方案 —— 永远不要让评估器既提出修复方案,又评估修复后的结果。评估器只负责批判,生成器负责修复。
  6. 上下文耗尽 —— 对于长会话,使用Claude Agent SDK的自动压缩功能,或者在主要阶段之间重置上下文。

Results: What to Expect

结果:预期产出

Based on Anthropic's published results:
MetricSolo AgentGAN HarnessImprovement
Time20 min4-6 hours12-18x longer
Cost$9$125-20014-22x more
QualityBarely functionalProduction-readyPhase change
Core featuresBrokenAll workingN/A
DesignGeneric AI slopDistinctive, polishedN/A
The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.
基于Anthropic公开的测试结果:
指标单个AgentGAN框架提升幅度
耗时20分钟4-6小时耗时是12-18倍
成本9美元125-200美元成本是14-22倍
质量barely functional(仅能勉强运行)生产级可用质的飞跃
核心功能破损全部可用
设计通用AI粗制滥造水平有辨识度、打磨完善
权衡非常明确: 付出约20倍的时间和成本,换取产出质量的质的提升。这套框架适用于对质量有要求的项目。

References

参考资料