multi-agent-system-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multi-Agent System Engineer

多智能体系统工程师

When to Use

适用场景

  • Designing multi-agent topology: supervisor, hierarchical, peer-to-peer, or blackboard
  • Decomposing work across specialized agents with routing, delegation, and merge rules
  • Defining inter-agent message schemas, handoff payloads, and protocol boundaries
  • Partitioning vs sharing state, scratchpads, artifacts, and consensus across agents
  • Building fan-out/fan-in, DAG workflows, and synchronization barriers in agent graphs
  • Resolving conflicts when agents disagree or duplicate work
  • Engineering fault tolerance: retries, partial failure, compensation, and saga-style recovery
  • Setting system-level budgets: tokens, latency, parallelism, and cost per workflow run
  • Observability across agent traces, correlation IDs, and multi-step workflow debugging
  • Testing and simulating multi-agent flows before production
  • Deploying multi-agent runtimes on queues, durable workflows, and scaled workers
  • 设计多智能体拓扑:监督式、分层式、点对点或黑板式
  • 通过路由、委托和合并规则在专业化Agent间分解工作
  • 定义Agent间消息模式、交接负载和协议边界
  • 在Agent间划分或共享状态、临时工作区、工件及共识机制
  • 在Agent图中构建扇出/扇入、DAG工作流和同步屏障
  • 解决Agent意见分歧或重复工作时的冲突
  • 设计容错机制:重试、部分故障处理、补偿和saga风格恢复
  • 设置系统级预算:Token、延迟、并行度和每个工作流运行的成本
  • 跨Agent追踪、关联ID和多步骤工作流调试的可观测性
  • 在上线前测试和模拟多智能体流程
  • 在队列、持久化工作流和横向扩展的Worker上部署多智能体运行时

When NOT to Use

不适用场景

  • Single-agent loop, tools, MCP, checkpoints, and one runtime only →
    agentic-ai-developer
  • Foundation model training, fine-tuning, classical ML pipelines →
    ai-engineer
    ,
    ai-researcher
  • AI ops cadence, vendor contracts, rollout governance without system design →
    ai-lead-ops
  • Internal developer platform, golden paths, portals—no agent orchestration →
    platform-engineer
  • Cross-team milestones, RAID, program status without agent architecture →
    technical-program-manager
  • Corporate AI policy, risk tiering, model cards without system build →
    ai-risk-governance
  • Pre-flight go/no-go or architecture review without implementing topology →
    build-validator
  • Enterprise strategy, portfolio, and org design whiteboard only →
    enterprise-strategist
  • 单Agent循环、工具、MCP、检查点及单一运行时 →
    agentic-ai-developer
  • 基础模型训练、微调、经典ML流水线 →
    ai-engineer
    ,
    ai-researcher
  • AI运维节奏、供应商合同、无系统设计的发布治理 →
    ai-lead-ops
  • 内部开发者平台、最优路径、门户——无Agent编排 →
    platform-engineer
  • 跨团队里程碑、RAID、无Agent架构的项目状态跟踪 →
    technical-program-manager
  • 企业AI政策、风险分级、无系统构建的模型卡片 →
    ai-risk-governance
  • 无拓扑实现的预上线准入/否决或架构评审 →
    build-validator
  • 仅企业战略、产品组合和组织设计的白板规划 →
    enterprise-strategist

Related skills

相关技能

NeedSkill
Implement single-agent loop, tools, MCP, HITL, eval harness
agentic-ai-developer
LLM apps, RAG, model routing, embedding strategy
ai-engineer
AI production ops, incidents, release gates
ai-lead-ops
Platform golden paths, IDP, developer portals
platform-engineer
Program delivery, dependencies, launch readiness
technical-program-manager
Governance, risk tiers, policy mapping
ai-risk-governance
Independent architecture or build go/no-go
build-validator
Persistent memory stores and retrieval design
ai-memory-developer
Context packing and token budgeting per call
ai-context-engineer
Prompt templates and judge rubrics
prompt-engineer
需求技能
实现单Agent循环、工具、MCP、HITL、评估框架
agentic-ai-developer
LLM应用、RAG、模型路由、嵌入策略
ai-engineer
AI生产运维、事件处理、发布闸门
ai-lead-ops
平台最优路径、IDP、开发者门户
platform-engineer
项目交付、依赖管理、上线准备
technical-program-manager
治理、风险等级、政策映射
ai-risk-governance
独立架构或构建准入/否决
build-validator
持久化内存存储与检索设计
ai-memory-developer
上下文打包与每次调用的Token预算
ai-context-engineer
提示词模板与评判准则
prompt-engineer

Core Workflows

核心工作流

1. Frame the multi-agent system

1. 构建多智能体系统框架

  1. Define the end-to-end job, success metric, and SLA (latency, cost, quality)
  2. List agents by role (planner, executor, critic, specialist)—not by model name
  3. Choose topology and justify: supervisor, hierarchical, P2P, blackboard, or hybrid
  4. Map trust boundaries: which agent may call which tools and external systems
  5. Set system budgets: max parallel agents, tokens per run, wall time, dollars per task
See
references/multi_agent_system_engineer_scope.md
for scope, deliverables, and boundaries vs
agentic-ai-developer
.
  1. 定义端到端任务、成功指标和SLA(延迟、成本、质量)
  2. 角色(规划者、执行者、评审者、专家)列出Agent——而非按模型名称
  3. 选择拓扑并说明理由:监督式、分层式、P2P、黑板式或混合式
  4. 映射信任边界:哪些Agent可以调用哪些工具和外部系统
  5. 设置系统预算:最大并行Agent数量、每次运行的Token数、耗时、每个任务的成本
查看
references/multi_agent_system_engineer_scope.md
了解范围、交付成果,以及与
agentic-ai-developer
的边界。

2. Topology, roles, and routing

2. 拓扑、角色与路由

ingress → router/supervisor → {workers} → reducer/merger → egress
Checklist:
  • Each agent has one primary responsibility and explicit inputs/outputs
  • Routing rules are deterministic where safety matters; LLM routing elsewhere is logged
  • Fan-out has a matching fan-in with merge semantics (vote, concat, structured reduce)
  • Dangerous tools are centralized or gated—not duplicated on every worker
See
references/agent_roles_topology_and_routing.md
for topology patterns and routing tables.
ingress → router/supervisor → {workers} → reducer/merger → egress
检查清单:
  • 每个Agent有一项主要职责,且输入/输出明确
  • 在安全相关场景使用确定性路由规则;其他场景使用LLM路由需记录日志
  • 扇出操作对应匹配的扇入操作,并具备合并语义(投票、拼接、结构化归约)
  • 危险工具集中化或设置访问权限——不重复部署在每个Worker上
查看
references/agent_roles_topology_and_routing.md
了解拓扑模式和路由表。

3. Protocols and messaging

3. 协议与通信

  • Define message envelope:
    correlation_id
    ,
    from
    ,
    to
    ,
    intent
    ,
    payload
    ,
    artifacts
    ,
    constraints
  • Version schemas; reject unknown versions at boundaries
  • Prefer structured payloads over free-text handoffs for machine agents
  • Document idempotency keys for retried messages
See
references/inter_agent_protocols_and_messaging.md
for handoff contracts and A2A-style patterns.
  • 定义消息信封:
    correlation_id
    from
    to
    intent
    payload
    artifacts
    constraints
  • 版本化模式;在边界处拒绝未知版本
  • 对于机器Agent,优先使用结构化负载而非自由文本交接
  • 记录重试消息的幂等键
查看
references/inter_agent_protocols_and_messaging.md
了解交接契约和A2A风格模式。

4. State, coordination, and consensus

4. 状态、协同与共识

  • Classify state: ephemeral scratchpad, workflow state, shared blackboard, durable store
  • Partition tenant and thread keys on every read/write
  • Use barriers or quorum when parallel agents must align before the next phase
  • Resolve conflicts with explicit policy: supervisor wins, vote, or escalate to human
See
references/shared_state_coordination_and_consensus.md
for blackboard vs partitioned models.
  • 分类状态:临时工作区、工作流状态、共享黑板、持久化存储
  • 在每次读写时划分租户和线程键
  • 当并行Agent必须在进入下一阶段前达成一致时,使用屏障或法定人数机制
  • 通过明确策略解决冲突:监督式Agent裁决、投票或升级至人工处理
查看
references/shared_state_coordination_and_consensus.md
了解黑板模式与分区模式的对比。

5. Fault tolerance, observability, and testing

5. 容错、可观测性与测试

  • Retry at message and workflow level with caps; distinguish transient vs terminal errors
  • Compensate or mark partial success; never leave workflows stuck without timeout
  • Trace: one
    workflow_run_id
    spanning all agent spans; redact secrets in cross-agent logs
  • Test: unit agents, pairwise handoffs, full DAG golden paths, chaos on one worker
See
references/fault_tolerance_observability_and_testing.md
for test matrices and SLOs.
  • 在消息和工作流层面设置上限重试;区分瞬时错误与终端错误
  • 进行补偿或标记部分成功;绝不允许工作流无超时停滞
  • 追踪:使用一个
    workflow_run_id
    覆盖所有Agent链路;在跨Agent日志中脱敏敏感信息
  • 测试:Agent单元测试、两两交接测试、完整DAG黄金路径测试、单Worker混沌测试
查看
references/fault_tolerance_observability_and_testing.md
了解测试矩阵和SLO。

6. Deployment, cost, and governance

6. 部署、成本与治理

  • Short synchronous graphs for interactive UX; queue or durable engine for long DAGs
  • Scale workers horizontally; pin graph version and agent config per deployment
  • Attribute cost per agent step; alert on budget burn rate
  • Gate releases on multi-agent regression suite and policy checks
See
references/deployment_cost_and_governance.md
for queue vs durable workflow tradeoffs.
  • 交互式UX使用短同步图;长DAG使用队列或持久化引擎
  • 横向扩展Worker;在部署时固定图版本和Agent配置
  • 按Agent步骤统计成本;针对预算消耗速率设置告警
  • 基于多智能体回归套件和政策检查管控发布
查看
references/deployment_cost_and_governance.md
了解队列与持久化工作流的权衡。

When to load references

何时加载参考文档

TopicReference
Role scope, deliverables, vs agentic-ai-developer
references/multi_agent_system_engineer_scope.md
Roles, topologies, routing, fan-out/fan-in
references/agent_roles_topology_and_routing.md
Messages, handoffs, schemas, protocols
references/inter_agent_protocols_and_messaging.md
Shared state, barriers, consensus, conflicts
references/shared_state_coordination_and_consensus.md
Retries, traces, testing multi-agent flows
references/fault_tolerance_observability_and_testing.md
Deploy, budgets, governance, frameworks
references/deployment_cost_and_governance.md
主题参考文档
角色范围、交付成果、与agentic-ai-developer的对比
references/multi_agent_system_engineer_scope.md
角色、拓扑、路由、扇出/扇入
references/agent_roles_topology_and_routing.md
消息、交接、模式、协议
references/inter_agent_protocols_and_messaging.md
共享状态、屏障、共识、冲突
references/shared_state_coordination_and_consensus.md
重试、追踪、多智能体流程测试
references/fault_tolerance_observability_and_testing.md
部署、预算、治理、框架
references/deployment_cost_and_governance.md

Framework pointers (optional)

框架指引(可选)

Use framework docs for API specifics; this skill stays pattern-first:
PatternTypical home
Stateful graph, Send/fan-in, subgraph checkpointersLangGraph-style graphs
Subagents, task middleware, filesystem routingDeep Agents-style harness
DAG orchestration, merge nodes, status boardsagenthub-style workflows
Do not duplicate full framework tutorials—encode the system contracts (topology, messages, state, failure) in the stack the team chose.
使用框架文档了解API细节;本技能优先关注模式:
模式常用框架
有状态图、发送/扇入、子图检查点LangGraph风格图
子Agent、任务中间件、文件系统路由Deep Agents风格工具集
DAG编排、合并节点、状态面板agenthub风格工作流
请勿重复完整框架教程——在团队选择的技术栈中编码系统契约(拓扑、消息、状态、故障处理)。

Routing vs agentic-ai-developer

与agentic-ai-developer的路由区分

QuestionUse
One agent, tool loop, MCP, checkpoint resume
agentic-ai-developer
Multiple agents, topology, routing, system-level failure and observabilitythis skill
Both: implement loops in agentic-ai-developer; design the fleet hereLoad both; start here for topology
问题适用技能
单Agent、工具循环、MCP、检查点恢复
agentic-ai-developer
多Agent、拓扑、路由、系统级故障与可观测性本技能
两者皆需:在agentic-ai-developer中实现循环;在此处设计Agent集群加载两者;先从拓扑设计开始