gan-style-harness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GAN-Style Harness Skill

GAN风格框架技能

Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)

A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.

灵感来自 Anthropic的长期运行应用开发框架设计（2026年3月24日发布）

这是一套多Agent框架，将生成和评估环节拆分，构建对抗性反馈循环，能实现远超单个Agent的产出质量。

Core Insight

核心洞见

When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.

This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.

要求Agent评估自己的产出时，它们会表现出病态的乐观主义：会吹捧平庸的产出，对合理存在的问题视而不见。但开发一个独立的评估器来做到极其严苛，远比教生成器自我批判要容易得多。

这和GAN（生成对抗网络）的运行逻辑一致：生成器负责产出，评估器负责批判，反馈会驱动下一轮迭代优化。

When to Use

适用场景

Building complete applications from a one-line prompt
Frontend design tasks requiring high visual quality
Full-stack projects that need working features, not just code
Any task where "AI slop" aesthetics are unacceptable
Projects where you want to invest $50-200 for production-quality output

仅用一行提示词构建完整应用
对视觉质量要求高的前端设计任务
需要可运行功能、而非仅输出代码的全栈项目
无法接受「AI粗制滥造内容」观感的所有任务
愿意投入50-200美元获取生产级产出的项目

When NOT to Use

不适用场景

Quick single-file fixes (use standard
```
claude -p
```
)
Tasks with tight budget constraints (<$10)
Simple refactoring (use de-sloppify pattern instead)
Tasks that are already well-specified with tests (use TDD workflow)

快速的单文件修复（使用标准
```
claude -p
```
即可）
预算紧张的任务（预算低于10美元）
简单重构（改用去粗制滥造模式即可）
已经通过测试明确定义的任务（使用TDD工作流即可）

Architecture

架构

                    ┌─────────────┐
                    │   PLANNER   │
                    │  (Opus 4.6) │
                    └──────┬──────┘
                           │ Product Spec
                           │ (features, sprints, design direction)
                           ▼
              ┌────────────────────────┐
              │                        │
              │   GENERATOR-EVALUATOR  │
              │      FEEDBACK LOOP     │
              │                        │
              │  ┌──────────┐          │
              │  │GENERATOR │──build──▶│──┐
              │  │(Opus 4.6)│          │  │
              │  └────▲─────┘          │  │
              │       │                │  │ live app
              │    feedback             │  │
              │       │                │  │
              │  ┌────┴─────┐          │  │
              │  │EVALUATOR │◀─test───│──┘
              │  │(Opus 4.6)│          │
              │  │+Playwright│         │
              │  └──────────┘          │
              │                        │
              │   5-15 iterations      │
              └────────────────────────┘

                    ┌─────────────┐
                    │   PLANNER   │
                    │  (Opus 4.6) │
                    └──────┬──────┘
                           │ Product Spec
                           │ (features, sprints, design direction)
                           ▼
              ┌────────────────────────┐
              │                        │
              │   GENERATOR-EVALUATOR  │
              │      FEEDBACK LOOP     │
              │                        │
              │  ┌──────────┐          │
              │  │GENERATOR │──build──▶│──┐
              │  │(Opus 4.6)│          │  │
              │  └────▲─────┘          │  │
              │       │                │  │ live app
              │    feedback             │  │
              │       │                │  │
              │  ┌────┴─────┐          │  │
              │  │EVALUATOR │◀─test───│──┘
              │  │(Opus 4.6)│          │
              │  │+Playwright│         │
              │  └──────────┘          │
              │                        │
              │   5-15 iterations      │
              └────────────────────────┘

The Three Agents

三类Agent

1. Planner Agent

1. 规划Agent

Role: Product manager — expands a brief prompt into a full product specification.

Key behaviors:

Takes a one-line prompt and produces a 16-feature, multi-sprint specification
Defines user stories, technical requirements, and visual design direction
Is deliberately ambitious — conservative planning leads to underwhelming results
Produces evaluation criteria that the Evaluator will use later

Model: Opus 4.6 (needs deep reasoning for spec expansion)

角色： 产品经理——将简短的提示词扩展为完整的产品规格说明。

核心行为：

接收一行提示词，输出包含16个功能、多冲刺的规格说明
定义用户故事、技术要求和视觉设计方向
刻意设定高要求——保守的规划只会产出平淡的结果
输出后续评估器会使用的评估标准

使用模型： Opus 4.6（需要深度推理能力来扩展规格说明）

2. Generator Agent

2. 生成Agent

Role: Developer — implements features according to the spec.

Key behaviors:

Works in structured sprints (or continuous mode with newer models)
Negotiates a "sprint contract" with the Evaluator before writing code
Uses full-stack tooling: React, FastAPI/Express, databases, CSS
Manages git for version control between iterations
Reads Evaluator feedback and incorporates it in next iteration

Model: Opus 4.6 (needs strong coding capability)

角色： 开发者——按照规格说明实现功能。

核心行为：

按照结构化冲刺工作（或者使用更新的模型采用持续模式）
编写代码前和评估器协商确定「冲刺约定」
使用全栈工具：React、FastAPI/Express、数据库、CSS
管理git来实现迭代间的版本控制
读取评估器的反馈，在下一轮迭代中优化

使用模型： Opus 4.6（需要较强的编码能力）

3. Evaluator Agent

3. 评估Agent

Role: QA engineer — tests the live running application, not just code.

Key behaviors:

Uses Playwright MCP to interact with the live application
Clicks through features, fills forms, tests API endpoints
Scores against four criteria (configurable):
1. Design Quality — Does it feel like a coherent whole?
2. Originality — Custom decisions vs. template/AI patterns?
3. Craft — Typography, spacing, animations, micro-interactions?
4. Functionality — Do all features actually work?
Returns structured feedback with scores and specific issues
Is engineered to be ruthlessly strict — never praises mediocre work

Model: Opus 4.6 (needs strong judgment + tool use)

角色： QA工程师——测试正在运行的线上应用，而非仅测试代码。

核心行为：

使用Playwright MCP和运行中的应用交互
点击功能、填写表单、测试API端点
按照四个标准打分（可配置）：
1. 设计质量——整体观感是否统一协调？
2. 原创性——自定义决策占比vs模板/AI通用模式占比？
3. 做工精细度——排版、间距、动画、微交互表现如何？
4. 功能性——所有功能是否都能正常运行？
返回结构化反馈，包含得分和具体问题
被设定为极其严苛——永远不会吹捧平庸的产出

使用模型： Opus 4.6（需要较强的判断能力+工具使用能力）

Evaluation Criteria

评估标准

The default four criteria, each scored 1-10:

markdown

undefined

默认四个评估维度，每个维度打分1-10分：

markdown

undefined

Evaluation Rubric

评估规则

Design Quality (weight: 0.3)

设计质量（权重：0.3）

1-3: Generic, template-like, "AI slop" aesthetics
4-6: Competent but unremarkable, follows conventions
7-8: Distinctive, cohesive visual identity
9-10: Could pass for a professional designer's work

1-3：通用模板感，典型「AI粗制滥造内容」观感
4-6：合格但无亮点，符合常规设计规范
7-8：有辨识度、视觉风格统一
9-10：足以媲美专业设计师的作品

Originality (weight: 0.2)

原创性（权重：0.2）

1-3: Default colors, stock layouts, no personality
4-6: Some custom choices, mostly standard patterns
7-8: Clear creative vision, unique approach
9-10: Surprising, delightful, genuinely novel

1-3：默认配色、通用布局、无个性
4-6：有少量自定义选择，大部分是标准模式
7-8：有清晰的创意思路、独特的实现方式
9-10：有惊喜感、体验愉悦、真正具备创新性

Craft (weight: 0.3)

做工精细度（权重：0.3）

1-3: Broken layouts, missing states, no animations
4-6: Works but feels rough, inconsistent spacing
7-8: Polished, smooth transitions, responsive
9-10: Pixel-perfect, delightful micro-interactions

1-3：布局破损、状态缺失、无动画
4-6：可运行但观感粗糙、间距不一致
7-8：打磨完善、过渡流畅、响应式适配良好
9-10：像素级完美、微交互体验愉悦

Functionality (weight: 0.2)

功能性（权重：0.2）

1-3: Core features broken or missing
4-6: Happy path works, edge cases fail
7-8: All features work, good error handling
9-10: Bulletproof, handles every edge case

undefined

1-3：核心功能破损或缺失
4-6：主流程可用，边缘场景报错
7-8：所有功能可用、错误处理完善
9-10：稳定性极强，覆盖所有边缘场景

undefined

Scoring

打分规则

Weighted score = sum of (criterion_score * weight)
Pass threshold = 7.0 (configurable)
Max iterations = 15 (configurable, typically 5-15 sufficient)

加权得分 = 各维度得分 * 对应权重之和
合格阈值 = 7.0（可配置）
最大迭代次数 = 15（可配置，通常5-15次足够）

Usage

使用方法

Via Command

通过命令使用

bash

undefined

bash

undefined

Full three-agent harness

完整三Agent框架

/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"

With custom config

自定义配置

/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5

Frontend design mode (generator + evaluator only, no planner)

前端设计模式（仅生成器+评估器，无规划器）

/project:gan-design "Create a landing page for a crypto portfolio tracker"

undefined

/project:gan-design "Create a landing page for a crypto portfolio tracker"

undefined

Via Shell Script

通过Shell脚本使用

bash

undefined

bash

undefined

Basic usage

基础用法

./scripts/gan-harness.sh "Build a music streaming dashboard"

With options

带参数使用

GAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"

undefined

GAN_MAX_ITERATIONS=10
GAN_PASS_THRESHOLD=7.5
GAN_EVAL_CRITERIA="functionality,performance,security"
./scripts/gan-harness.sh "Build a REST API for task management"

undefined

Via Claude Code (Manual)

通过Claude Code手动使用

bash

undefined

bash

undefined

Step 1: Plan

步骤1：规划

claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"

Step 2: Generate (iteration 1)

步骤2：生成（第1轮迭代）

claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."

Step 3: Evaluate (iteration 1)

步骤3：评估（第1轮迭代）

claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"

Step 4: Generate (iteration 2 — reads feedback)

步骤4：生成（第2轮迭代——读取反馈）

claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."

Repeat steps 3-4 until pass threshold met

重复步骤3-4直到达到合格阈值

undefined

undefined

Evolution Across Model Capabilities

随模型能力的演进

The harness should simplify as models improve. Following Anthropic's evolution:

模型能力提升时框架应该随之简化。遵循Anthropic的演进路径：

Stage 1 — Weaker Models (Sonnet-class)

阶段1——较弱模型（Sonnet级别）

Full sprint decomposition required
Context resets between sprints (avoid context anxiety)
2-agent minimum: Initializer + Coding Agent
Heavy scaffolding compensates for model limitations

需要完整的冲刺拆分
冲刺之间重置上下文（避免上下文过载）
最少2个Agent：初始化器 + 编码Agent
重度脚手架来弥补模型能力的不足

Stage 2 — Capable Models (Opus 4.5-class)

阶段2——能力达标模型（Opus 4.5级别）

Full 3-agent harness: Planner + Generator + Evaluator
Sprint contracts before each implementation phase
10-sprint decomposition for complex apps
Context resets still useful but less critical

完整3Agent框架：规划器 + 生成器 + 评估器
每个实现阶段前先确认冲刺约定
复杂应用拆分为10个冲刺
上下文重置仍然有用但不再是必须

Stage 3 — Frontier Models (Opus 4.6-class)

阶段3——前沿模型（Opus 4.6级别）

Simplified harness: single planning pass, continuous generation
Evaluation reduced to single end-pass (model is smarter)
No sprint structure needed
Automatic compaction handles context growth

Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.

简化框架：单次规划、持续生成
评估简化为最终单次审核（模型更智能）
不需要冲刺结构
自动压缩处理上下文增长

核心原则： 框架的每个组件都对应了一个「模型无法独立完成」的假设。当模型能力提升时，重新验证这些假设，移除不再需要的组件。

Configuration

配置项

Environment Variables

环境变量

Variable	Default	Description
`GAN_MAX_ITERATIONS`	`15`	Maximum generator-evaluator cycles
`GAN_PASS_THRESHOLD`	`7.0`	Weighted score to pass (1-10)
`GAN_PLANNER_MODEL`	`opus`	Model for planning agent
`GAN_GENERATOR_MODEL`	`opus`	Model for generator agent
`GAN_EVALUATOR_MODEL`	`opus`	Model for evaluator agent
`GAN_EVAL_CRITERIA`	`design,originality,craft,functionality`	Comma-separated criteria
`GAN_DEV_SERVER_PORT`	`3000`	Port for the live app
`GAN_DEV_SERVER_CMD`	`npm run dev`	Command to start dev server
`GAN_PROJECT_DIR`	`.`	Project working directory
`GAN_SKIP_PLANNER`	`false`	Skip planner, use spec directly
`GAN_EVAL_MODE`	`playwright`	`playwright` , `screenshot` , or `code-only`

变量	默认值	描述
`GAN_MAX_ITERATIONS`	`15`	生成-评估循环最大次数
`GAN_PASS_THRESHOLD`	`7.0`	合格加权得分（1-10）
`GAN_PLANNER_MODEL`	`opus`	规划Agent使用的模型
`GAN_GENERATOR_MODEL`	`opus`	生成Agent使用的模型
`GAN_EVALUATOR_MODEL`	`opus`	评估Agent使用的模型
`GAN_EVAL_CRITERIA`	`design,originality,craft,functionality`	逗号分隔的评估维度
`GAN_DEV_SERVER_PORT`	`3000`	运行中应用的端口
`GAN_DEV_SERVER_CMD`	`npm run dev`	启动开发服务器的命令
`GAN_PROJECT_DIR`	`.`	项目工作目录
`GAN_SKIP_PLANNER`	`false`	跳过规划器，直接使用规格说明
`GAN_EVAL_MODE`	`playwright`	`playwright` 、 `screenshot` 或 `code-only`

Evaluation Modes

评估模式

Mode	Tools	Best For
`playwright`	Browser MCP + live interaction	Full-stack apps with UI
`screenshot`	Screenshot + visual analysis	Static sites, design-only
`code-only`	Tests + linting + build	APIs, libraries, CLI tools

模式	使用工具	适用场景
`playwright`	浏览器MCP + 实时交互	带UI的全栈应用
`screenshot`	截图 + 视觉分析	静态站点、仅设计类任务
`code-only`	测试 + lint检查 + 构建	API、库、CLI工具

Anti-Patterns

反模式

Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read
```
feedback-NNN.md
```
at the start of each iteration.
Infinite loops — Always set
```
GAN_MAX_ITERATIONS
```
. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.
Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.
Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.

评估器过于宽松 —— 如果评估器在第1轮迭代就通过所有内容，说明你的规则太宽松。收紧打分标准，对常见AI模式增加明确的扣分规则。
生成器忽略反馈 —— 确保反馈以文件形式传递，而非内联输入。生成器应该在每次迭代开始时读取
```
feedback-NNN.md
```
。
无限循环 —— 一定要设置
```
GAN_MAX_ITERATIONS
```
。如果生成器经过3次迭代后得分没有提升，停止流程，标记为需要人工审核。
评估器测试流于表面 —— 评估器必须使用Playwright交互测试运行中的应用，而不是仅截图。要点击按钮、填写表单、测试错误状态。
评估器肯定自己提出的修复方案 —— 永远不要让评估器既提出修复方案，又评估修复后的结果。评估器只负责批判，生成器负责修复。
上下文耗尽 —— 对于长会话，使用Claude Agent SDK的自动压缩功能，或者在主要阶段之间重置上下文。

Results: What to Expect

结果：预期产出

Based on Anthropic's published results:

Metric	Solo Agent	GAN Harness	Improvement
Time	20 min	4-6 hours	12-18x longer
Cost	$9	$125-200	14-22x more
Quality	Barely functional	Production-ready	Phase change
Core features	Broken	All working	N/A
Design	Generic AI slop	Distinctive, polished	N/A

The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.

基于Anthropic公开的测试结果：

指标	单个Agent	GAN框架	提升幅度
耗时	20分钟	4-6小时	耗时是12-18倍
成本	9美元	125-200美元	成本是14-22倍
质量	barely functional（仅能勉强运行）	生产级可用	质的飞跃
核心功能	破损	全部可用	无
设计	通用AI粗制滥造水平	有辨识度、打磨完善	无

权衡非常明确： 付出约20倍的时间和成本，换取产出质量的质的提升。这套框架适用于对质量有要求的项目。

References

参考资料

Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran
Epsilla: The GAN-Style Agent Loop — Architecture deconstruction
Martin Fowler: Harness Engineering — Broader industry context
OpenAI: Harness Engineering — OpenAI's parallel work

Anthropic: 长期运行应用的框架设计 —— Prithvi Rajasekaran的原始论文
Epsilla: GAN风格Agent循环 —— 架构拆解
Martin Fowler: 框架工程 —— 更广泛的行业背景
OpenAI: 框架工程 —— OpenAI的同步研究成果