gan-style-harness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGAN-Style Harness Skill
GAN风格框架技能
Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)
A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
灵感来自 Anthropic的长期运行应用开发框架设计(2026年3月24日发布)
这是一套多Agent框架,将生成和评估环节拆分,构建对抗性反馈循环,能实现远超单个Agent的产出质量。
Core Insight
核心洞见
When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
要求Agent评估自己的产出时,它们会表现出病态的乐观主义:会吹捧平庸的产出,对合理存在的问题视而不见。但开发一个独立的评估器来做到极其严苛,远比教生成器自我批判要容易得多。
这和GAN(生成对抗网络)的运行逻辑一致:生成器负责产出,评估器负责批判,反馈会驱动下一轮迭代优化。
When to Use
适用场景
- Building complete applications from a one-line prompt
- Frontend design tasks requiring high visual quality
- Full-stack projects that need working features, not just code
- Any task where "AI slop" aesthetics are unacceptable
- Projects where you want to invest $50-200 for production-quality output
- 仅用一行提示词构建完整应用
- 对视觉质量要求高的前端设计任务
- 需要可运行功能、而非仅输出代码的全栈项目
- 无法接受「AI粗制滥造内容」观感的所有任务
- 愿意投入50-200美元获取生产级产出的项目
When NOT to Use
不适用场景
- Quick single-file fixes (use standard )
claude -p - Tasks with tight budget constraints (<$10)
- Simple refactoring (use de-sloppify pattern instead)
- Tasks that are already well-specified with tests (use TDD workflow)
- 快速的单文件修复(使用标准即可)
claude -p - 预算紧张的任务(预算低于10美元)
- 简单重构(改用去粗制滥造模式即可)
- 已经通过测试明确定义的任务(使用TDD工作流即可)
Architecture
架构
┌─────────────┐
│ PLANNER │
│ (Opus 4.6) │
└──────┬──────┘
│ Product Spec
│ (features, sprints, design direction)
▼
┌────────────────────────┐
│ │
│ GENERATOR-EVALUATOR │
│ FEEDBACK LOOP │
│ │
│ ┌──────────┐ │
│ │GENERATOR │──build──▶│──┐
│ │(Opus 4.6)│ │ │
│ └────▲─────┘ │ │
│ │ │ │ live app
│ feedback │ │
│ │ │ │
│ ┌────┴─────┐ │ │
│ │EVALUATOR │◀─test───│──┘
│ │(Opus 4.6)│ │
│ │+Playwright│ │
│ └──────────┘ │
│ │
│ 5-15 iterations │
└────────────────────────┘ ┌─────────────┐
│ PLANNER │
│ (Opus 4.6) │
└──────┬──────┘
│ Product Spec
│ (features, sprints, design direction)
▼
┌────────────────────────┐
│ │
│ GENERATOR-EVALUATOR │
│ FEEDBACK LOOP │
│ │
│ ┌──────────┐ │
│ │GENERATOR │──build──▶│──┐
│ │(Opus 4.6)│ │ │
│ └────▲─────┘ │ │
│ │ │ │ live app
│ feedback │ │
│ │ │ │
│ ┌────┴─────┐ │ │
│ │EVALUATOR │◀─test───│──┘
│ │(Opus 4.6)│ │
│ │+Playwright│ │
│ └──────────┘ │
│ │
│ 5-15 iterations │
└────────────────────────┘The Three Agents
三类Agent
1. Planner Agent
1. 规划Agent
Role: Product manager — expands a brief prompt into a full product specification.
Key behaviors:
- Takes a one-line prompt and produces a 16-feature, multi-sprint specification
- Defines user stories, technical requirements, and visual design direction
- Is deliberately ambitious — conservative planning leads to underwhelming results
- Produces evaluation criteria that the Evaluator will use later
Model: Opus 4.6 (needs deep reasoning for spec expansion)
角色: 产品经理——将简短的提示词扩展为完整的产品规格说明。
核心行为:
- 接收一行提示词,输出包含16个功能、多冲刺的规格说明
- 定义用户故事、技术要求和视觉设计方向
- 刻意设定高要求——保守的规划只会产出平淡的结果
- 输出后续评估器会使用的评估标准
使用模型: Opus 4.6(需要深度推理能力来扩展规格说明)
2. Generator Agent
2. 生成Agent
Role: Developer — implements features according to the spec.
Key behaviors:
- Works in structured sprints (or continuous mode with newer models)
- Negotiates a "sprint contract" with the Evaluator before writing code
- Uses full-stack tooling: React, FastAPI/Express, databases, CSS
- Manages git for version control between iterations
- Reads Evaluator feedback and incorporates it in next iteration
Model: Opus 4.6 (needs strong coding capability)
角色: 开发者——按照规格说明实现功能。
核心行为:
- 按照结构化冲刺工作(或者使用更新的模型采用持续模式)
- 编写代码前和评估器协商确定「冲刺约定」
- 使用全栈工具:React、FastAPI/Express、数据库、CSS
- 管理git来实现迭代间的版本控制
- 读取评估器的反馈,在下一轮迭代中优化
使用模型: Opus 4.6(需要较强的编码能力)
3. Evaluator Agent
3. 评估Agent
Role: QA engineer — tests the live running application, not just code.
Key behaviors:
- Uses Playwright MCP to interact with the live application
- Clicks through features, fills forms, tests API endpoints
- Scores against four criteria (configurable):
- Design Quality — Does it feel like a coherent whole?
- Originality — Custom decisions vs. template/AI patterns?
- Craft — Typography, spacing, animations, micro-interactions?
- Functionality — Do all features actually work?
- Returns structured feedback with scores and specific issues
- Is engineered to be ruthlessly strict — never praises mediocre work
Model: Opus 4.6 (needs strong judgment + tool use)
角色: QA工程师——测试正在运行的线上应用,而非仅测试代码。
核心行为:
- 使用Playwright MCP和运行中的应用交互
- 点击功能、填写表单、测试API端点
- 按照四个标准打分(可配置):
- 设计质量——整体观感是否统一协调?
- 原创性——自定义决策占比vs模板/AI通用模式占比?
- 做工精细度——排版、间距、动画、微交互表现如何?
- 功能性——所有功能是否都能正常运行?
- 返回结构化反馈,包含得分和具体问题
- 被设定为极其严苛——永远不会吹捧平庸的产出
使用模型: Opus 4.6(需要较强的判断能力+工具使用能力)
Evaluation Criteria
评估标准
The default four criteria, each scored 1-10:
markdown
undefined默认四个评估维度,每个维度打分1-10分:
markdown
undefinedEvaluation Rubric
评估规则
Design Quality (weight: 0.3)
设计质量(权重:0.3)
- 1-3: Generic, template-like, "AI slop" aesthetics
- 4-6: Competent but unremarkable, follows conventions
- 7-8: Distinctive, cohesive visual identity
- 9-10: Could pass for a professional designer's work
- 1-3:通用模板感,典型「AI粗制滥造内容」观感
- 4-6:合格但无亮点,符合常规设计规范
- 7-8:有辨识度、视觉风格统一
- 9-10:足以媲美专业设计师的作品
Originality (weight: 0.2)
原创性(权重:0.2)
- 1-3: Default colors, stock layouts, no personality
- 4-6: Some custom choices, mostly standard patterns
- 7-8: Clear creative vision, unique approach
- 9-10: Surprising, delightful, genuinely novel
- 1-3:默认配色、通用布局、无个性
- 4-6:有少量自定义选择,大部分是标准模式
- 7-8:有清晰的创意思路、独特的实现方式
- 9-10:有惊喜感、体验愉悦、真正具备创新性
Craft (weight: 0.3)
做工精细度(权重:0.3)
- 1-3: Broken layouts, missing states, no animations
- 4-6: Works but feels rough, inconsistent spacing
- 7-8: Polished, smooth transitions, responsive
- 9-10: Pixel-perfect, delightful micro-interactions
- 1-3:布局破损、状态缺失、无动画
- 4-6:可运行但观感粗糙、间距不一致
- 7-8:打磨完善、过渡流畅、响应式适配良好
- 9-10:像素级完美、微交互体验愉悦
Functionality (weight: 0.2)
功能性(权重:0.2)
- 1-3: Core features broken or missing
- 4-6: Happy path works, edge cases fail
- 7-8: All features work, good error handling
- 9-10: Bulletproof, handles every edge case
undefined- 1-3:核心功能破损或缺失
- 4-6:主流程可用,边缘场景报错
- 7-8:所有功能可用、错误处理完善
- 9-10:稳定性极强,覆盖所有边缘场景
undefinedScoring
打分规则
- Weighted score = sum of (criterion_score * weight)
- Pass threshold = 7.0 (configurable)
- Max iterations = 15 (configurable, typically 5-15 sufficient)
- 加权得分 = 各维度得分 * 对应权重之和
- 合格阈值 = 7.0(可配置)
- 最大迭代次数 = 15(可配置,通常5-15次足够)
Usage
使用方法
Via Command
通过命令使用
bash
undefinedbash
undefinedFull three-agent harness
完整三Agent框架
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
With custom config
自定义配置
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
Frontend design mode (generator + evaluator only, no planner)
前端设计模式(仅生成器+评估器,无规划器)
/project:gan-design "Create a landing page for a crypto portfolio tracker"
undefined/project:gan-design "Create a landing page for a crypto portfolio tracker"
undefinedVia Shell Script
通过Shell脚本使用
bash
undefinedbash
undefinedBasic usage
基础用法
./scripts/gan-harness.sh "Build a music streaming dashboard"
./scripts/gan-harness.sh "Build a music streaming dashboard"
With options
带参数使用
GAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
undefinedGAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"
undefinedVia Claude Code (Manual)
通过Claude Code手动使用
bash
undefinedbash
undefinedStep 1: Plan
步骤1:规划
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
Step 2: Generate (iteration 1)
步骤2:生成(第1轮迭代)
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
Step 3: Evaluate (iteration 1)
步骤3:评估(第1轮迭代)
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
Step 4: Generate (iteration 2 — reads feedback)
步骤4:生成(第2轮迭代——读取反馈)
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
Repeat steps 3-4 until pass threshold met
重复步骤3-4直到达到合格阈值
undefinedundefinedEvolution Across Model Capabilities
随模型能力的演进
The harness should simplify as models improve. Following Anthropic's evolution:
模型能力提升时框架应该随之简化。遵循Anthropic的演进路径:
Stage 1 — Weaker Models (Sonnet-class)
阶段1——较弱模型(Sonnet级别)
- Full sprint decomposition required
- Context resets between sprints (avoid context anxiety)
- 2-agent minimum: Initializer + Coding Agent
- Heavy scaffolding compensates for model limitations
- 需要完整的冲刺拆分
- 冲刺之间重置上下文(避免上下文过载)
- 最少2个Agent:初始化器 + 编码Agent
- 重度脚手架来弥补模型能力的不足
Stage 2 — Capable Models (Opus 4.5-class)
阶段2——能力达标模型(Opus 4.5级别)
- Full 3-agent harness: Planner + Generator + Evaluator
- Sprint contracts before each implementation phase
- 10-sprint decomposition for complex apps
- Context resets still useful but less critical
- 完整3Agent框架:规划器 + 生成器 + 评估器
- 每个实现阶段前先确认冲刺约定
- 复杂应用拆分为10个冲刺
- 上下文重置仍然有用但不再是必须
Stage 3 — Frontier Models (Opus 4.6-class)
阶段3——前沿模型(Opus 4.6级别)
- Simplified harness: single planning pass, continuous generation
- Evaluation reduced to single end-pass (model is smarter)
- No sprint structure needed
- Automatic compaction handles context growth
Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
- 简化框架:单次规划、持续生成
- 评估简化为最终单次审核(模型更智能)
- 不需要冲刺结构
- 自动压缩处理上下文增长
核心原则: 框架的每个组件都对应了一个「模型无法独立完成」的假设。当模型能力提升时,重新验证这些假设,移除不再需要的组件。
Configuration
配置项
Environment Variables
环境变量
| Variable | Default | Description |
|---|---|---|
| | Maximum generator-evaluator cycles |
| | Weighted score to pass (1-10) |
| | Model for planning agent |
| | Model for generator agent |
| | Model for evaluator agent |
| | Comma-separated criteria |
| | Port for the live app |
| | Command to start dev server |
| | Project working directory |
| | Skip planner, use spec directly |
| | |
| 变量 | 默认值 | 描述 |
|---|---|---|
| | 生成-评估循环最大次数 |
| | 合格加权得分(1-10) |
| | 规划Agent使用的模型 |
| | 生成Agent使用的模型 |
| | 评估Agent使用的模型 |
| | 逗号分隔的评估维度 |
| | 运行中应用的端口 |
| | 启动开发服务器的命令 |
| | 项目工作目录 |
| | 跳过规划器,直接使用规格说明 |
| | |
Evaluation Modes
评估模式
| Mode | Tools | Best For |
|---|---|---|
| Browser MCP + live interaction | Full-stack apps with UI |
| Screenshot + visual analysis | Static sites, design-only |
| Tests + linting + build | APIs, libraries, CLI tools |
| 模式 | 使用工具 | 适用场景 |
|---|---|---|
| 浏览器MCP + 实时交互 | 带UI的全栈应用 |
| 截图 + 视觉分析 | 静态站点、仅设计类任务 |
| 测试 + lint检查 + 构建 | API、库、CLI工具 |
Anti-Patterns
反模式
-
Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
-
Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should readat the start of each iteration.
feedback-NNN.md -
Infinite loops — Always set. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.
GAN_MAX_ITERATIONS -
Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.
-
Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
-
Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.
-
评估器过于宽松 —— 如果评估器在第1轮迭代就通过所有内容,说明你的规则太宽松。收紧打分标准,对常见AI模式增加明确的扣分规则。
-
生成器忽略反馈 —— 确保反馈以文件形式传递,而非内联输入。生成器应该在每次迭代开始时读取。
feedback-NNN.md -
无限循环 —— 一定要设置。如果生成器经过3次迭代后得分没有提升,停止流程,标记为需要人工审核。
GAN_MAX_ITERATIONS -
评估器测试流于表面 —— 评估器必须使用Playwright交互测试运行中的应用,而不是仅截图。要点击按钮、填写表单、测试错误状态。
-
评估器肯定自己提出的修复方案 —— 永远不要让评估器既提出修复方案,又评估修复后的结果。评估器只负责批判,生成器负责修复。
-
上下文耗尽 —— 对于长会话,使用Claude Agent SDK的自动压缩功能,或者在主要阶段之间重置上下文。
Results: What to Expect
结果:预期产出
Based on Anthropic's published results:
| Metric | Solo Agent | GAN Harness | Improvement |
|---|---|---|---|
| Time | 20 min | 4-6 hours | 12-18x longer |
| Cost | $9 | $125-200 | 14-22x more |
| Quality | Barely functional | Production-ready | Phase change |
| Core features | Broken | All working | N/A |
| Design | Generic AI slop | Distinctive, polished | N/A |
The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.
基于Anthropic公开的测试结果:
| 指标 | 单个Agent | GAN框架 | 提升幅度 |
|---|---|---|---|
| 耗时 | 20分钟 | 4-6小时 | 耗时是12-18倍 |
| 成本 | 9美元 | 125-200美元 | 成本是14-22倍 |
| 质量 | barely functional(仅能勉强运行) | 生产级可用 | 质的飞跃 |
| 核心功能 | 破损 | 全部可用 | 无 |
| 设计 | 通用AI粗制滥造水平 | 有辨识度、打磨完善 | 无 |
权衡非常明确: 付出约20倍的时间和成本,换取产出质量的质的提升。这套框架适用于对质量有要求的项目。
References
参考资料
- Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran
- Epsilla: The GAN-Style Agent Loop — Architecture deconstruction
- Martin Fowler: Harness Engineering — Broader industry context
- OpenAI: Harness Engineering — OpenAI's parallel work
- Anthropic: 长期运行应用的框架设计 —— Prithvi Rajasekaran的原始论文
- Epsilla: GAN风格Agent循环 —— 架构拆解
- Martin Fowler: 框架工程 —— 更广泛的行业背景
- OpenAI: 框架工程 —— OpenAI的同步研究成果