multi-agent-system-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMulti-Agent System Engineer
多智能体系统工程师
When to Use
适用场景
- Designing multi-agent topology: supervisor, hierarchical, peer-to-peer, or blackboard
- Decomposing work across specialized agents with routing, delegation, and merge rules
- Defining inter-agent message schemas, handoff payloads, and protocol boundaries
- Partitioning vs sharing state, scratchpads, artifacts, and consensus across agents
- Building fan-out/fan-in, DAG workflows, and synchronization barriers in agent graphs
- Resolving conflicts when agents disagree or duplicate work
- Engineering fault tolerance: retries, partial failure, compensation, and saga-style recovery
- Setting system-level budgets: tokens, latency, parallelism, and cost per workflow run
- Observability across agent traces, correlation IDs, and multi-step workflow debugging
- Testing and simulating multi-agent flows before production
- Deploying multi-agent runtimes on queues, durable workflows, and scaled workers
- 设计多智能体拓扑:监督式、分层式、点对点或黑板式
- 通过路由、委托和合并规则在专业化Agent间分解工作
- 定义Agent间消息模式、交接负载和协议边界
- 在Agent间划分或共享状态、临时工作区、工件及共识机制
- 在Agent图中构建扇出/扇入、DAG工作流和同步屏障
- 解决Agent意见分歧或重复工作时的冲突
- 设计容错机制:重试、部分故障处理、补偿和saga风格恢复
- 设置系统级预算:Token、延迟、并行度和每个工作流运行的成本
- 跨Agent追踪、关联ID和多步骤工作流调试的可观测性
- 在上线前测试和模拟多智能体流程
- 在队列、持久化工作流和横向扩展的Worker上部署多智能体运行时
When NOT to Use
不适用场景
- Single-agent loop, tools, MCP, checkpoints, and one runtime only →
agentic-ai-developer - Foundation model training, fine-tuning, classical ML pipelines → ,
ai-engineerai-researcher - AI ops cadence, vendor contracts, rollout governance without system design →
ai-lead-ops - Internal developer platform, golden paths, portals—no agent orchestration →
platform-engineer - Cross-team milestones, RAID, program status without agent architecture →
technical-program-manager - Corporate AI policy, risk tiering, model cards without system build →
ai-risk-governance - Pre-flight go/no-go or architecture review without implementing topology →
build-validator - Enterprise strategy, portfolio, and org design whiteboard only →
enterprise-strategist
- 单Agent循环、工具、MCP、检查点及单一运行时 →
agentic-ai-developer - 基础模型训练、微调、经典ML流水线 → ,
ai-engineerai-researcher - AI运维节奏、供应商合同、无系统设计的发布治理 →
ai-lead-ops - 内部开发者平台、最优路径、门户——无Agent编排 →
platform-engineer - 跨团队里程碑、RAID、无Agent架构的项目状态跟踪 →
technical-program-manager - 企业AI政策、风险分级、无系统构建的模型卡片 →
ai-risk-governance - 无拓扑实现的预上线准入/否决或架构评审 →
build-validator - 仅企业战略、产品组合和组织设计的白板规划 →
enterprise-strategist
Related skills
相关技能
| Need | Skill |
|---|---|
| Implement single-agent loop, tools, MCP, HITL, eval harness | |
| LLM apps, RAG, model routing, embedding strategy | |
| AI production ops, incidents, release gates | |
| Platform golden paths, IDP, developer portals | |
| Program delivery, dependencies, launch readiness | |
| Governance, risk tiers, policy mapping | |
| Independent architecture or build go/no-go | |
| Persistent memory stores and retrieval design | |
| Context packing and token budgeting per call | |
| Prompt templates and judge rubrics | |
| 需求 | 技能 |
|---|---|
| 实现单Agent循环、工具、MCP、HITL、评估框架 | |
| LLM应用、RAG、模型路由、嵌入策略 | |
| AI生产运维、事件处理、发布闸门 | |
| 平台最优路径、IDP、开发者门户 | |
| 项目交付、依赖管理、上线准备 | |
| 治理、风险等级、政策映射 | |
| 独立架构或构建准入/否决 | |
| 持久化内存存储与检索设计 | |
| 上下文打包与每次调用的Token预算 | |
| 提示词模板与评判准则 | |
Core Workflows
核心工作流
1. Frame the multi-agent system
1. 构建多智能体系统框架
- Define the end-to-end job, success metric, and SLA (latency, cost, quality)
- List agents by role (planner, executor, critic, specialist)—not by model name
- Choose topology and justify: supervisor, hierarchical, P2P, blackboard, or hybrid
- Map trust boundaries: which agent may call which tools and external systems
- Set system budgets: max parallel agents, tokens per run, wall time, dollars per task
See for scope, deliverables, and boundaries vs .
references/multi_agent_system_engineer_scope.mdagentic-ai-developer- 定义端到端任务、成功指标和SLA(延迟、成本、质量)
- 按角色(规划者、执行者、评审者、专家)列出Agent——而非按模型名称
- 选择拓扑并说明理由:监督式、分层式、P2P、黑板式或混合式
- 映射信任边界:哪些Agent可以调用哪些工具和外部系统
- 设置系统预算:最大并行Agent数量、每次运行的Token数、耗时、每个任务的成本
查看了解范围、交付成果,以及与的边界。
references/multi_agent_system_engineer_scope.mdagentic-ai-developer2. Topology, roles, and routing
2. 拓扑、角色与路由
ingress → router/supervisor → {workers} → reducer/merger → egressChecklist:
- Each agent has one primary responsibility and explicit inputs/outputs
- Routing rules are deterministic where safety matters; LLM routing elsewhere is logged
- Fan-out has a matching fan-in with merge semantics (vote, concat, structured reduce)
- Dangerous tools are centralized or gated—not duplicated on every worker
See for topology patterns and routing tables.
references/agent_roles_topology_and_routing.mdingress → router/supervisor → {workers} → reducer/merger → egress检查清单:
- 每个Agent有一项主要职责,且输入/输出明确
- 在安全相关场景使用确定性路由规则;其他场景使用LLM路由需记录日志
- 扇出操作对应匹配的扇入操作,并具备合并语义(投票、拼接、结构化归约)
- 危险工具集中化或设置访问权限——不重复部署在每个Worker上
查看了解拓扑模式和路由表。
references/agent_roles_topology_and_routing.md3. Protocols and messaging
3. 协议与通信
- Define message envelope: ,
correlation_id,from,to,intent,payload,artifactsconstraints - Version schemas; reject unknown versions at boundaries
- Prefer structured payloads over free-text handoffs for machine agents
- Document idempotency keys for retried messages
See for handoff contracts and A2A-style patterns.
references/inter_agent_protocols_and_messaging.md- 定义消息信封:、
correlation_id、from、to、intent、payload、artifactsconstraints - 版本化模式;在边界处拒绝未知版本
- 对于机器Agent,优先使用结构化负载而非自由文本交接
- 记录重试消息的幂等键
查看了解交接契约和A2A风格模式。
references/inter_agent_protocols_and_messaging.md4. State, coordination, and consensus
4. 状态、协同与共识
- Classify state: ephemeral scratchpad, workflow state, shared blackboard, durable store
- Partition tenant and thread keys on every read/write
- Use barriers or quorum when parallel agents must align before the next phase
- Resolve conflicts with explicit policy: supervisor wins, vote, or escalate to human
See for blackboard vs partitioned models.
references/shared_state_coordination_and_consensus.md- 分类状态:临时工作区、工作流状态、共享黑板、持久化存储
- 在每次读写时划分租户和线程键
- 当并行Agent必须在进入下一阶段前达成一致时,使用屏障或法定人数机制
- 通过明确策略解决冲突:监督式Agent裁决、投票或升级至人工处理
查看了解黑板模式与分区模式的对比。
references/shared_state_coordination_and_consensus.md5. Fault tolerance, observability, and testing
5. 容错、可观测性与测试
- Retry at message and workflow level with caps; distinguish transient vs terminal errors
- Compensate or mark partial success; never leave workflows stuck without timeout
- Trace: one spanning all agent spans; redact secrets in cross-agent logs
workflow_run_id - Test: unit agents, pairwise handoffs, full DAG golden paths, chaos on one worker
See for test matrices and SLOs.
references/fault_tolerance_observability_and_testing.md- 在消息和工作流层面设置上限重试;区分瞬时错误与终端错误
- 进行补偿或标记部分成功;绝不允许工作流无超时停滞
- 追踪:使用一个覆盖所有Agent链路;在跨Agent日志中脱敏敏感信息
workflow_run_id - 测试:Agent单元测试、两两交接测试、完整DAG黄金路径测试、单Worker混沌测试
查看了解测试矩阵和SLO。
references/fault_tolerance_observability_and_testing.md6. Deployment, cost, and governance
6. 部署、成本与治理
- Short synchronous graphs for interactive UX; queue or durable engine for long DAGs
- Scale workers horizontally; pin graph version and agent config per deployment
- Attribute cost per agent step; alert on budget burn rate
- Gate releases on multi-agent regression suite and policy checks
See for queue vs durable workflow tradeoffs.
references/deployment_cost_and_governance.md- 交互式UX使用短同步图;长DAG使用队列或持久化引擎
- 横向扩展Worker;在部署时固定图版本和Agent配置
- 按Agent步骤统计成本;针对预算消耗速率设置告警
- 基于多智能体回归套件和政策检查管控发布
查看了解队列与持久化工作流的权衡。
references/deployment_cost_and_governance.mdWhen to load references
何时加载参考文档
| Topic | Reference |
|---|---|
| Role scope, deliverables, vs agentic-ai-developer | |
| Roles, topologies, routing, fan-out/fan-in | |
| Messages, handoffs, schemas, protocols | |
| Shared state, barriers, consensus, conflicts | |
| Retries, traces, testing multi-agent flows | |
| Deploy, budgets, governance, frameworks | |
| 主题 | 参考文档 |
|---|---|
| 角色范围、交付成果、与agentic-ai-developer的对比 | |
| 角色、拓扑、路由、扇出/扇入 | |
| 消息、交接、模式、协议 | |
| 共享状态、屏障、共识、冲突 | |
| 重试、追踪、多智能体流程测试 | |
| 部署、预算、治理、框架 | |
Framework pointers (optional)
框架指引(可选)
Use framework docs for API specifics; this skill stays pattern-first:
| Pattern | Typical home |
|---|---|
| Stateful graph, Send/fan-in, subgraph checkpointers | LangGraph-style graphs |
| Subagents, task middleware, filesystem routing | Deep Agents-style harness |
| DAG orchestration, merge nodes, status boards | agenthub-style workflows |
Do not duplicate full framework tutorials—encode the system contracts (topology, messages, state, failure) in the stack the team chose.
使用框架文档了解API细节;本技能优先关注模式:
| 模式 | 常用框架 |
|---|---|
| 有状态图、发送/扇入、子图检查点 | LangGraph风格图 |
| 子Agent、任务中间件、文件系统路由 | Deep Agents风格工具集 |
| DAG编排、合并节点、状态面板 | agenthub风格工作流 |
请勿重复完整框架教程——在团队选择的技术栈中编码系统契约(拓扑、消息、状态、故障处理)。
Routing vs agentic-ai-developer
与agentic-ai-developer的路由区分
| Question | Use |
|---|---|
| One agent, tool loop, MCP, checkpoint resume | |
| Multiple agents, topology, routing, system-level failure and observability | this skill |
| Both: implement loops in agentic-ai-developer; design the fleet here | Load both; start here for topology |
| 问题 | 适用技能 |
|---|---|
| 单Agent、工具循环、MCP、检查点恢复 | |
| 多Agent、拓扑、路由、系统级故障与可观测性 | 本技能 |
| 两者皆需:在agentic-ai-developer中实现循环;在此处设计Agent集群 | 加载两者;先从拓扑设计开始 |