Loading...
Loading...
Execute tasks through competitive multi-agent generation, meta-judge evaluation specification, multi-judge evaluation, and evidence-based synthesis
npx skill4agent add neolabhq/context-engineering-kit do-competitivelyPhase 1: Competitive Generation with Self-Critique + Meta-Judge (IN PARALLEL)
┌─ Meta-Judge → Evaluation Specification YAML ───────────┐
Task ────┼─ Agent 2 → Draft → Critique → Revise → Solution B ───┐ │
├─ Agent 3 → Draft → Critique → Revise → Solution C ───┼─┤
└─ Agent 1 → Draft → Critique → Revise → Solution A ───┘ │
│
Phase 2: Multi-Judge Evaluation with Verification │
┌─ Judge 1 → Evaluate → Verify → Revise → Report A ─┐ │
├─ Judge 2 → Evaluate → Verify → Revise → Report B ─┼────┤
└─ Judge 3 → Evaluate → Verify → Revise → Report C ─┘ │
│
Phase 2.5: Adaptive Strategy Selection │
Analyze Consensus ───────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (return Phase 1) │
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 3: Evidence-Based Synthesis │ │
(Only if FULL_SYNTHESIS) │ │
Synthesizer ─────────────────────┴───────────────────────┴─→ Final Solutionmkdir -p .specs/reports.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md{solution-name}users-apispecs/api/users.md{YYYY-MM-DD}[1|2|3].specs/reports/## Task
Generate an evaluation specification yaml for the following task. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementation artifacts.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task description from user}
## Context
{Any relevant codebase context, file paths, constraints}
## Artifact Type
{code | documentation | configuration | etc.}
## Number of Solutions
3 (competitive implementations to be compared)
## Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.Use Task tool:
- description: "Meta-judge: {brief task summary}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"{solution-file}.[a|b|c].[ext]{solution-file}.[a|b|c].[ext]{solution-file}create users.tsusers[a|b|c][ext]mdts<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{define expected output following such pattern: {solution-file}.[a|b|c].[ext] based on the task description and context. Each [a|b|c] is a unique identifier per sub-agent. You MUST provide filename with it!!!}
</output>
Instructions:
Let's approach this systematically to produce the best possible solution.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Consider multiple approaches - what are the different ways to solve this?
3. Think through the tradeoffs step by step and choose the approach you believe is best
4. Implement it completely
5. Generate 5 verification questions about critical aspects
6. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
7. Revise solution:
- Fix identified issues
8. Explain what was changed and whyMessage with 4 tool calls:
Tool call 1 (meta-judge):
- description: "Meta-judge: {brief task summary}"
- model: opus
- subagent_type: "sadd:meta-judge"
Tool call 2 (generator A):
- description: "Generate solution A: {brief task summary}"
- model: opus
Tool call 3 (generator B):
- description: "Generate solution B: {brief task summary}"
- model: opus
Tool call 4 (generator C):
- description: "Generate solution C: {brief task summary}"
- model: opus.specs/reports/{solution-name}-{date}.[1|2|3].mdYou are evaluating {number} competitive solutions against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Task
{task_description}
## Solutions
{list of paths to all candidate solutions}
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
CRITICAL: NEVER provide score threshold to judges. Judge MUST not know what threshold for score is, in order to not be biased!!!
**Dispatch:**
### Phase 2.5: Adaptive Strategy Selection (Early Return)
**The orchestrator** (not a subagent) analyzes judge outputs to determine the optimal strategy.
#### Decision Logic
**Step 1: Parse structured headers from judge reply**
Parse the judges reply.
CRITICAL: Do not read reports files itself, it can overflow your context.
**Step 2: Check for unanimous winner**
Compare all three VOTE values:
- If Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE (same solution):
- **Strategy: SELECT_AND_POLISH**
- **Reason:** Clear consensus - all three judges prefer same solution
**Step 3: Check if all solutions are fundamentally flawed**
If no unanimous vote, calculate average scores:
1. Average Solution A scores: (Judge1_A + Judge2_A + Judge3_A) / 3
2. Average Solution B scores: (Judge1_B + Judge2_B + Judge3_B) / 3
3. Average Solution C scores: (Judge1_C + Judge2_C + Judge3_C) / 3
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
- **Strategy: REDESIGN**
- **Reason:** All solutions below quality threshold, fundamental approach issues
**Step 5: Default to full synthesis**
If none of the above conditions met:
- **Strategy: FULL_SYNTHESIS**
- **Reason:** Split decision with merit, synthesis needed to combine best elements
#### Strategy 1: SELECT_AND_POLISH
**When:** Clear winner (unanimous votes)
**Process:**
1. Select the winning solution as the base
2. Launch subagent to apply specific improvements from judge feedback
3. Cherry-pick 1-2 best elements from runner-up solutions
4. Document what was added and why
**Benefits:**
- Saves synthesis cost (simpler than full synthesis)
- Preserves proven quality of winning solution
- Focused improvements rather than full reconstruction
**Prompt template:**
```markdown
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's work through this step by step to polish the winning solution effectively.
1. Take the winning solution as your base (do NOT rewrite it)
2. First, carefully review all judge feedback to understand what needs improvement
3. Apply improvements based on judge feedback:
- Fix identified weaknesses
- Add missing elements judges noted
4. Next, examine the runner-up solutions for standout elements
5. Cherry-pick 1-2 specific elements from runners-up if judges praised them
6. Document changes made:
- What was changed and why
- What was added from other solutions
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.You are analyzing why all solutions failed to meet quality standards. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all candidate solutions}
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports with low scores}
</evaluation_reports>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and why
You are synthesizing the best solution from competitive implementations and evaluations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all candidate solutions}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<output>
{define expected output following such pattern: solution.md based on the task description and context. Result should be a complete solution to the task.}
</output>
Instructions:
Let's think through this synthesis step by step to create the best possible combined solution.
1. First, read all solutions and evaluation reports carefully
2. Map out the consensus:
- What strengths did multiple judges praise in each solution?
- What weaknesses did multiple judges criticize in each solution?
3. For each major component or section, think through:
- Which solution handles this best and why?
- Could a hybrid approach work better?
4. Create the best possible solution by:
- Copying text directly when one solution is clearly superior
- Combining approaches when a hybrid would be better
- Fixing all identified issues
- Preserving the best elements from each
5. Explain your synthesis decisions:
- What you took from each solution
- Why you made those choices
- How you addressed identified weaknesses
CRITICAL: Do not create something entirely new. Synthesize the best from what exists.{solution-file}.[a|b|c].[ext].specs/reports/{solution-name}-{date}.[1|2|3].md{output_path}## Execution Summary
Original Task: {task_description}
Strategy Used: {strategy} ({reason})
### Results
| Phase | Agents | Models | Status |
|-------------------------|--------|----------|-------------|
| Phase 1: Competitive Generation + Meta-Judge | 4 (3 generators + 1 meta-judge) | opus x 4 | [Complete / Failed] |
| Phase 2: Multi-Judge Evaluation | 3 | opus x 3 | [Complete / Failed] |
| Phase 2.5: Adaptive Strategy Selection | orchestrator | - | {strategy} |
| Phase 3: [Synthesis/Polish/Redesign] | [N] | [model] | [Complete / Failed] |
Files Created
Final Solution:
- {output_path} - Synthesized production-ready command
Candidate Solutions:
- {solution-file}.[a|b|c].[ext] (Score: [X.X]/5.0)
Evaluation Reports:
- .specs/reports/{solution-file}-{date}.[1|2|3].md (Vote: [Solution A/B/C])
Synthesis Decisions
| Element | Source | Rationale |
|----------------------|------------------|-------------|
| [element] | Solution [B/A/C] | [rationale] |
/do-competitively "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"specs/api/users.a.mdspecs/api/users.b.mdspecs/api/users.c.md.specs/reports/users-api-2025-01-15.1.mdVOTE: Solution A
SCORES: A=4.5/5.0, B=3.2/5.0, C=2.8/5.0.specs/reports/users-api-2025-01-15.2.mdVOTE: Solution A
SCORES: A=4.3/5.0, B=3.5/5.0, C=2.6/5.0.specs/reports/users-api-2025-01-15.3.mdVOTE: Solution A
SCORES: A=4.6/5.0, B=3.0/5.0, C=2.9/5.0specs/api/users.md/do-competitively "Design caching strategy for high-traffic API" \
--output "specs/caching.md" \
--criteria "performance,memory-efficiency,simplicity,reliability"specs/caching.a.mdspecs/caching.b.mdspecs/caching.c.md.specs/reports/caching-2025-01-15.1.mdVOTE: Solution B
SCORES: A=3.8/5.0, B=4.2/5.0, C=3.9/5.0.specs/reports/caching-2025-01-15.2.mdVOTE: Solution A
SCORES: A=4.0/5.0, B=3.9/5.0, C=3.7/5.0.specs/reports/caching-2025-01-15.3.mdVOTE: Solution C
SCORES: A=3.6/5.0, B=4.0/5.0, C=4.1/5.0specs/caching.md/do-competitively "Design authentication system with social login" \
--output "specs/auth.md" \
--criteria "security,user-experience,maintainability"specs/auth.a.mdspecs/auth.b.mdspecs/auth.c.md.specs/reports/auth-2025-01-15.1.mdVOTE: Solution A
SCORES: A=2.5/5.0, B=2.2/5.0, C=2.3/5.0.specs/reports/auth-2025-01-15.2.mdVOTE: Solution B
SCORES: A=2.4/5.0, B=2.8/5.0, C=2.1/5.0.specs/reports/auth-2025-01-15.3.mdVOTE: Solution C
SCORES: A=2.6/5.0, B=2.5/5.0, C=2.3/5.0