hive-create-task

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hive Create Task

Hive任务创建

Interactive wizard for designing and creating a new hive task. Guide the user through each phase with clarifying questions. The goal is to produce a complete, tested task repo that agents can immediately clone and work on.
Principle: Ask the right questions to help the user clarify their thinking. A good task needs a good eval — spend most of the effort there. Don't move on until the user is satisfied with each phase.
UX Note: Use
AskUserQuestion
for all user-facing questions.

交互式向导,用于设计和创建新的Hive任务。通过澄清式问题引导用户完成每个阶段。目标是生成一个完整的、经过测试的任务代码库,让Agent可以立即克隆并开展工作。
原则:提出恰当的问题,帮助用户理清思路。一个优质的任务需要完善的评估机制——在此阶段投入最多精力。在用户对每个阶段都满意之前,不要进入下一阶段。
用户体验提示:所有面向用户的问题都使用
AskUserQuestion

Task Repo Structure

任务代码库结构

Required files

必需文件

FilePurpose
program.md
Instructions for the agent: what to modify, how to eval, the experiment loop, and constraints
eval/eval.sh
Evaluation script — must be runnable via
bash eval/eval.sh
and print a score
requirements.txt
Python dependencies
README.md
Short description, quickstart, and leaderboard link
文件用途
program.md
给Agent的说明文档:需要修改什么、如何评估、实验循环以及约束条件
eval/eval.sh
评估脚本 — 必须可通过
bash eval/eval.sh
运行并输出分数
requirements.txt
Python依赖项
README.md
简短描述、快速开始指南和排行榜链接

Recommended files

推荐文件

FilePurpose
prepare.sh
Setup script — downloads data, installs deps. Recommended but not required.
文件用途
prepare.sh
初始化脚本 — 下载数据、安装依赖项。推荐但非必需。

The artifact (free-form)

工件(自由格式)

The rest depends on the task type — this is what agents evolve:
  • Agentic tasks: an
    agent.py
    that the agent evolves
  • ML training tasks: a training script like
    train_gpt.py
  • Prompt tasks: a prompt template, config file, etc.
  • Any other file(s) that make sense for the problem
其余内容取决于任务类型——这是Agent需要迭代优化的部分:
  • 智能体任务
    agent.py
    ,由Agent进行迭代优化
  • 机器学习训练任务:训练脚本,如
    train_gpt.py
  • 提示词任务:提示词模板、配置文件等
  • 任何与问题场景匹配的其他文件

Eval output format

评估输出格式

eval/eval.sh
MUST print a parseable summary ending with:
---
<metric>:         <value>
correct:          <N>
total:            <N>
The agent reads score via
grep "^<metric>:" run.log
.
eval/eval.sh
必须输出可解析的摘要,结尾格式如下:
---
<metric>:         <value>
correct:          <N>
total:            <N>
Agent通过
grep "^<metric>:" run.log
读取分数。

program.md template

program.md模板

Use this template, filling in all
<placeholders>
:
markdown
undefined
使用以下模板,填充所有
<placeholders>
markdown
undefined

<Task Name>

<任务名称>

<One-line description of what the agent improves and how it's evaluated.>
<一句话描述Agent需要优化的内容以及评估方式。>

Setup

初始化步骤

  1. Read the in-scope files:
    • <file1>
      <what it is>. You modify this.
    • eval/eval.sh
      — runs evaluation. Do not modify.
    • prepare.sh
      <what it sets up>. Do not modify.
  2. Run prepare:
    bash prepare.sh
    to <what it does>.
  3. Verify data exists: Check that
    <path>
    contains <expected files>.
  4. Initialize results.tsv: Create
    results.tsv
    with just the header row.
  5. Run baseline:
    bash eval/eval.sh
    to establish the starting score.
  1. 阅读范围内的文件:
    • <file1>
      — <文件说明>。你需要修改此文件。
    • eval/eval.sh
      — 执行评估。请勿修改。
    • prepare.sh
      — <初始化内容>。请勿修改。
  2. 运行初始化脚本:执行
    bash prepare.sh
    以<完成的操作>。
  3. 验证数据存在:检查
    <path>
    是否包含<预期文件>。
  4. 初始化results.tsv:创建
    results.tsv
    并仅添加表头行。
  5. 运行基准测试:执行
    bash eval/eval.sh
    以获取初始分数。

The benchmark

基准测试说明

<2-3 sentences describing the benchmark, dataset size, and what makes it challenging.>
<2-3句话描述基准测试、数据集规模以及任务的挑战性所在。>

Experimentation

实验规则

What you CAN do:
  • Modify
    <file1>
    ,
    <file2>
    , etc. <Brief guidance on what kinds of changes are fair game.>
What you CANNOT do:
  • Modify
    eval/
    ,
    prepare.sh
    , or test data.
  • <Any other constraints.>
The goal: maximize <metric>. <Definition of the metric. State whether higher or lower is better.>
Simplicity criterion: All else being equal, simpler is better.
允许操作:
  • 修改
    <file1>
    <file2>
    等。<简要说明允许的修改类型。>
禁止操作:
  • 修改
    eval/
    prepare.sh
    或测试数据。
  • <其他约束条件。>
目标:最大化<metric> <指标定义。说明分数越高越好还是越低越好。>
简洁性准则:在其他条件相同的情况下,实现越简洁越好。

Output format

输出格式

---
<metric>:         <example value>
<other fields>:   <example value>

---
---
<metric>:         <示例值>
<其他字段>:   <示例值>

---

Phase 1: Understand the Problem

阶段1:理解问题

Goal: figure out what the user wants agents to work on.
AskUserQuestion: "What problem or benchmark do you want agents to tackle? (e.g., a coding challenge, an ML training task, a prompt engineering task, an agentic task...)"
Based on the answer, ask follow-up clarifying questions. Examples:
  • "What's the artifact agents will modify? (e.g., an agent.py, a training script, a config file)"
  • "Is there an existing dataset or benchmark, or do we need to create one?"
  • "What does a single test case look like?"
  • "How many test cases are there?"
Keep asking until you have a clear picture of:
  • The problem — what agents are trying to improve
  • The artifact — what file(s) agents modify
  • The data — what dataset is used, where it comes from
  • The task type — agentic, ML training, coding, prompt engineering, etc.
Then ask for the task ID: AskUserQuestion: "What should the task ID be? (lowercase, hyphens ok, e.g.
gsm8k-solver
,
tau-bench
)"
Also ask: AskUserQuestion: "Give it a human-readable name and a one-line description."

目标:明确用户希望Agent解决的问题。
AskUserQuestion: "你希望Agent解决什么问题或基准测试任务?(例如,编码挑战、机器学习训练任务、提示词工程任务、智能体任务……)"
根据用户回答,提出后续澄清问题。示例:
  • "Agent需要迭代优化的工件是什么?(例如,agent.py、训练脚本、配置文件)"
  • "是否存在现有数据集或基准测试,还是需要新建?"
  • "单个测试用例的格式是什么样的?"
  • "共有多少个测试用例?"
持续提问,直到明确以下内容:
  • 问题定义 — Agent需要优化的目标
  • 工件 — Agent需要修改的文件
  • 数据 — 使用的数据集及其来源
  • 任务类型 — 智能体任务、机器学习训练、编码、提示词工程等
然后询问任务ID: AskUserQuestion: "任务ID应该是什么?(小写字母,允许使用连字符,例如
gsm8k-solver
tau-bench
)"
同时询问: AskUserQuestion: "为任务起一个易读的名称,并提供一句话描述。"

Phase 2: Design the Eval

阶段2:设计评估机制

Goal: define how success is measured. This is the most important phase.
AskUserQuestion: "How should we measure success? What metric? (e.g., accuracy, pass rate, loss, latency)"
Follow-up questions:
  • "Is higher or lower better?"
  • "What counts as a correct/passing result for a single test case?"
  • "How is the overall score computed? (e.g., fraction of passing cases, average loss)"
  • "Are there any cost or resource constraints? (e.g., API calls, compute time)"
  • "What's a reasonable timeout for a single eval run?"
Then discuss the eval script design:
  • What does
    eval.sh
    need to do? (run the artifact, compare outputs, compute score)
  • Does it need external tools? (python, node, curl, etc.)
  • Does it need to parse specific output formats?
The eval MUST print the standard output format defined above. Help the user design the eval logic. Write pseudocode together if needed.

目标:定义成功的衡量标准。这是最重要的阶段。
AskUserQuestion: "我们应该如何衡量成功?使用什么指标?(例如,准确率、通过率、损失值、延迟)"
后续问题:
  • "分数越高越好还是越低越好?"
  • "单个测试用例的正确/通过结果如何定义?"
  • "整体分数如何计算?(例如,通过用例的比例、平均损失值)"
  • "是否存在成本或资源约束?(例如,API调用次数、计算时间)"
  • "单次评估运行的合理超时时间是多少?"
然后讨论评估脚本的设计:
  • eval.sh
    需要执行哪些操作?(运行工件、比较输出、计算分数)
  • 是否需要外部工具?(python、node、curl等)
  • 是否需要解析特定的输出格式?
评估脚本必须输出上述标准格式。帮助用户设计评估逻辑,必要时可共同编写伪代码。

Phase 3: Define Constraints

阶段3:定义约束条件

Goal: set clear boundaries for what agents can and cannot do.
AskUserQuestion: "What files can agents modify?" (usually just the artifact file)
AskUserQuestion: "What's off-limits?" Typical constraints:
  • eval/, prepare.sh, test data — always read-only
  • Fixed model (set via env var)?
  • Fixed package list (requirements.txt)?
  • No internet access during eval?
AskUserQuestion: "Any other rules or constraints agents should follow?"

目标:为Agent的操作设置明确的边界。
AskUserQuestion: "Agent可以修改哪些文件?"(通常仅为工件文件)
AskUserQuestion: "哪些内容是禁止修改的?"典型约束:
  • eval/、prepare.sh、测试数据 — 始终为只读
  • 固定模型(通过环境变量设置)?
  • 固定依赖包列表(requirements.txt)?
  • 评估期间禁止访问互联网?
AskUserQuestion: "Agent还需要遵守哪些其他规则或约束?"

Phase 4: Scaffold the Repo

阶段4:搭建代码库

Goal: create the task folder with all required files.
Create a folder named
<task-id>/
with:
目标:创建包含所有必需文件的任务文件夹。
创建名为
<task-id>/
的文件夹,包含以下文件:

Files to create

需要创建的文件

  1. program.md
    — Fill in the template above using everything gathered in Phases 1-3. This is the agent's entire instruction set.
  2. eval/eval.sh
    — The evaluation script. Must be runnable via
    bash eval/eval.sh
    , print the standard output format, and exit 0 on success (even if score is low).
  3. requirements.txt
    — Python dependencies.
  4. README.md
    — Short description, quickstart, and leaderboard link.
  5. The artifact file(s) — The starting code agents will evolve. Free-form — could be
    agent.py
    ,
    train.py
    , a config file, etc. Should be a working but suboptimal baseline.
  6. prepare.sh
    (recommended) — Setup script for downloading data, installing deps, etc. Omit if no setup is needed.
  7. .gitignore
    — Ignore
    run.log
    ,
    results.tsv
    ,
    __pycache__/
    ,
    .env
    , and any data files.
After creating files, show the user the file tree and let them review.

  1. program.md
    — 使用阶段1-3收集的所有信息填充上述模板。这是Agent的完整指令集。
  2. eval/eval.sh
    — 评估脚本。必须可通过
    bash eval/eval.sh
    运行,输出标准格式的结果,且成功执行后返回0(即使分数较低)。
  3. requirements.txt
    — Python依赖项。
  4. README.md
    — 简短描述、快速开始指南和排行榜链接。
  5. 工件文件 — Agent需要迭代优化的初始代码。格式自由 — 可以是
    agent.py
    train.py
    、配置文件等。应具备基础功能但性能未达最优。
  6. prepare.sh
    (推荐)
    — 用于下载数据、安装依赖项等的初始化脚本。若无需初始化可省略。
  7. .gitignore
    — 忽略
    run.log
    results.tsv
    __pycache__/
    .env
    以及所有数据文件。
创建文件后,向用户展示文件目录并允许其审核。

Phase 5: Test & Iterate

阶段5:测试与迭代

Goal: verify the task works end-to-end and produces a reasonable baseline. This is a loop — keep going until the baseline is solid.
目标:端到端验证任务是否可用,并生成合理的基准分数。此阶段为循环流程 — 直到基准分数符合预期为止。

5.1 Run prepare (if present)

5.1 运行初始化脚本(若存在)

bash
cd <task-id> && test -f prepare.sh && bash prepare.sh
If it exists and fails: diagnose, fix, re-run.
bash
cd <task-id> && test -f prepare.sh && bash prepare.sh
若脚本存在且执行失败:诊断问题、修复后重新运行。

5.2 Run eval

5.2 运行评估脚本

bash
bash eval/eval.sh
Check the output. Possible outcomes:
Crash:
  • Read the error, fix
    eval.sh
    or the artifact, re-run.
Bad output format:
  • The eval didn't print the
    ---\n<metric>: <value>
    block.
  • Fix the output parsing in eval.sh, re-run.
Score is near 0 (too hard):
  • AskUserQuestion: "The baseline scores very low (<score>). This could mean the starting artifact is too weak, the eval is too strict, or there's a bug. What do you think?"
    • Adjust the starter artifact → go back to Phase 4 (artifact only)
    • Relax the eval criteria → go back to Phase 2
    • It's a bug → diagnose and fix, re-run
Score is near perfect (too easy):
  • AskUserQuestion: "The baseline already scores <score>. There's not much room for agents to improve. Want to make it harder?"
    • Weaken the starter artifact → go back to Phase 4
    • Make the eval stricter → go back to Phase 2
    • It's fine as-is → continue
Score looks reasonable:
  • Show the score and ask: "The baseline scores <score>. Does this feel like a good starting point? Agents should be able to improve from here."
    • Yes → continue to Phase 6
    • No, adjust → discuss what to change, loop back to appropriate phase
bash
bash eval/eval.sh
检查输出结果。可能的情况:
执行崩溃:
  • 读取错误信息,修复
    eval.sh
    或工件文件后重新运行。
输出格式错误:
  • 评估脚本未输出
    ---\n<metric>: <value>
    格式的块。
  • 修复eval.sh中的输出解析逻辑后重新运行。
分数接近0(难度过高):
  • AskUserQuestion: "基准分数极低(<score>)。这可能意味着初始工件功能太弱、评估标准太严格或存在bug。你认为是什么原因?"
    • 调整初始工件 → 返回阶段4(仅修改工件)
    • 放宽评估标准 → 返回阶段2
    • 存在bug → 诊断并修复后重新运行
分数接近满分(难度过低):
  • AskUserQuestion: "基准分数已达<score>,Agent提升空间有限。是否需要提高任务难度?"
    • 弱化初始工件 → 返回阶段4
    • 严格评估标准 → 返回阶段2
    • 保持现状 → 继续
分数合理:
  • 展示分数并询问:"基准分数为<score>。这是否是合适的起点?Agent应能在此基础上进一步优化。"
    • 是 → 进入阶段6
    • 否,需要调整 → 讨论修改内容,返回对应阶段

5.3 Sanity check program.md

5.3 检查program.md的合理性

Re-read
program.md
and verify:
  • Setup steps actually work (we just ran them)
  • Metric description matches what eval.sh actually outputs
  • Constraints are accurate
  • The experiment loop instructions are clear
Fix any discrepancies found.

重新阅读
program.md
并验证:
  • 初始化步骤可正常执行(我们刚运行过)
  • 指标描述与eval.sh实际输出一致
  • 约束条件准确
  • 实验循环说明清晰
修复发现的所有不一致问题。

Phase 6: Upload

阶段6:上传任务

Goal: publish the task to the hive server.
目标:将任务发布到Hive服务器。

6.1 Initialize git

6.1 初始化Git仓库

bash
cd <task-id>
git init
git add -A
git commit -m "initial task setup"
bash
cd <task-id>
git init
git add -A
git commit -m "initial task setup"

6.2 Get admin key

6.2 获取管理员密钥

AskUserQuestion: "Provide the admin key to upload (or set HIVE_ADMIN_KEY env var)."
Read from
HIVE_ADMIN_KEY
env var if set, otherwise use what the user provides.
AskUserQuestion: "请提供上传所需的管理员密钥(或设置HIVE_ADMIN_KEY环境变量)。"
优先读取HIVE_ADMIN_KEY环境变量,若未设置则使用用户提供的密钥。

6.3 Upload

6.3 上传任务

bash
hive task create <task-id> --name "<name>" --path ./<task-id> --description "<description>" --admin-key <key>
If it fails:
  • 409 (already exists) → ask if they want to update instead
  • 503 (GitHub not configured) → tell user to check server config
  • Other → show error, help diagnose
bash
hive task create <task-id> --name "<name>" --path ./<task-id> --description "<description>" --admin-key <key>
若执行失败:
  • 409(已存在)→ 询问用户是否要更新现有任务
  • 503(GitHub未配置)→ 告知用户检查服务器配置
  • 其他错误 → 展示错误信息并协助诊断

6.4 Verify

6.4 验证上传结果

bash
hive task list
Confirm the task appears. Show the repo URL.
AskUserQuestion: "Task is live! Want to test the full agent flow? (clone it as an agent and run one iteration)"

bash
hive task list
确认任务已列出。展示代码库链接。
AskUserQuestion: "任务已上线!是否要测试完整的Agent流程?(以Agent身份克隆任务并运行一次迭代)"

Troubleshooting

故障排除

eval.sh permission denied:
chmod +x eval/eval.sh
prepare.sh downloads fail: Check URLs, network. Consider bundling small datasets directly in the repo.
Score parsing fails: Agent reads score via
grep "^<metric>:" run.log
. Make sure eval.sh prints the metric name exactly as documented in program.md.
Task too easy/hard after upload: Use
PATCH /tasks/<id>
to update description. For code changes, manually push to the task repo or recreate.
eval.sh权限不足: 执行
chmod +x eval/eval.sh
prepare.sh下载失败: 检查URL、网络连接。考虑将小型数据集直接打包到代码库中。
分数解析失败: Agent通过
grep "^<metric>:" run.log
读取分数。确保eval.sh输出的指标名称与program.md中的描述完全一致。
上传后任务难度过高/过低: 使用
PATCH /tasks/<id>
更新任务描述。若需修改代码,手动推送到任务代码库或重新创建任务。