multi-agent-system-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multi-Agent System Engineer

多智能体系统工程师

When to Use

适用场景

Designing multi-agent topology: supervisor, hierarchical, peer-to-peer, or blackboard
Decomposing work across specialized agents with routing, delegation, and merge rules
Defining inter-agent message schemas, handoff payloads, and protocol boundaries
Partitioning vs sharing state, scratchpads, artifacts, and consensus across agents
Building fan-out/fan-in, DAG workflows, and synchronization barriers in agent graphs
Resolving conflicts when agents disagree or duplicate work
Engineering fault tolerance: retries, partial failure, compensation, and saga-style recovery
Setting system-level budgets: tokens, latency, parallelism, and cost per workflow run
Observability across agent traces, correlation IDs, and multi-step workflow debugging
Testing and simulating multi-agent flows before production
Deploying multi-agent runtimes on queues, durable workflows, and scaled workers

设计多智能体拓扑：监督式、分层式、点对点或黑板式
通过路由、委托和合并规则在专业化Agent间分解工作
定义Agent间消息模式、交接负载和协议边界
在Agent间划分或共享状态、临时工作区、工件及共识机制
在Agent图中构建扇出/扇入、DAG工作流和同步屏障
解决Agent意见分歧或重复工作时的冲突
设计容错机制：重试、部分故障处理、补偿和saga风格恢复
设置系统级预算：Token、延迟、并行度和每个工作流运行的成本
跨Agent追踪、关联ID和多步骤工作流调试的可观测性
在上线前测试和模拟多智能体流程
在队列、持久化工作流和横向扩展的Worker上部署多智能体运行时

When NOT to Use

不适用场景

Single-agent loop, tools, MCP, checkpoints, and one runtime only →
```
agentic-ai-developer
```
Foundation model training, fine-tuning, classical ML pipelines →
```
ai-engineer
```
,
```
ai-researcher
```
AI ops cadence, vendor contracts, rollout governance without system design →
```
ai-lead-ops
```
Internal developer platform, golden paths, portals—no agent orchestration →
```
platform-engineer
```
Cross-team milestones, RAID, program status without agent architecture →
```
technical-program-manager
```
Corporate AI policy, risk tiering, model cards without system build →
```
ai-risk-governance
```
Pre-flight go/no-go or architecture review without implementing topology →
```
build-validator
```
Enterprise strategy, portfolio, and org design whiteboard only →
```
enterprise-strategist
```

单Agent循环、工具、MCP、检查点及单一运行时 →
```
agentic-ai-developer
```
基础模型训练、微调、经典ML流水线 →
```
ai-engineer
```
,
```
ai-researcher
```
AI运维节奏、供应商合同、无系统设计的发布治理 →
```
ai-lead-ops
```
内部开发者平台、最优路径、门户——无Agent编排 →
```
platform-engineer
```
跨团队里程碑、RAID、无Agent架构的项目状态跟踪 →
```
technical-program-manager
```
企业AI政策、风险分级、无系统构建的模型卡片 →
```
ai-risk-governance
```
无拓扑实现的预上线准入/否决或架构评审 →
```
build-validator
```
仅企业战略、产品组合和组织设计的白板规划 →
```
enterprise-strategist
```

Related skills

Need	Skill
Implement single-agent loop, tools, MCP, HITL, eval harness	`agentic-ai-developer`
LLM apps, RAG, model routing, embedding strategy	`ai-engineer`
AI production ops, incidents, release gates	`ai-lead-ops`
Platform golden paths, IDP, developer portals	`platform-engineer`
Program delivery, dependencies, launch readiness	`technical-program-manager`
Governance, risk tiers, policy mapping	`ai-risk-governance`
Independent architecture or build go/no-go	`build-validator`
Persistent memory stores and retrieval design	`ai-memory-developer`
Context packing and token budgeting per call	`ai-context-engineer`
Prompt templates and judge rubrics	`prompt-engineer`

需求	技能
实现单Agent循环、工具、MCP、HITL、评估框架	`agentic-ai-developer`
LLM应用、RAG、模型路由、嵌入策略	`ai-engineer`
AI生产运维、事件处理、发布闸门	`ai-lead-ops`
平台最优路径、IDP、开发者门户	`platform-engineer`
项目交付、依赖管理、上线准备	`technical-program-manager`
治理、风险等级、政策映射	`ai-risk-governance`
独立架构或构建准入/否决	`build-validator`
持久化内存存储与检索设计	`ai-memory-developer`
上下文打包与每次调用的Token预算	`ai-context-engineer`
提示词模板与评判准则	`prompt-engineer`

Core Workflows

核心工作流

1. Frame the multi-agent system

1. 构建多智能体系统框架

Define the end-to-end job, success metric, and SLA (latency, cost, quality)
List agents by role (planner, executor, critic, specialist)—not by model name
Choose topology and justify: supervisor, hierarchical, P2P, blackboard, or hybrid
Map trust boundaries: which agent may call which tools and external systems
Set system budgets: max parallel agents, tokens per run, wall time, dollars per task

See
references/multi_agent_system_engineer_scope.md
for scope, deliverables, and boundaries vs
agentic-ai-developer
.

定义端到端任务、成功指标和SLA（延迟、成本、质量）
按角色（规划者、执行者、评审者、专家）列出Agent——而非按模型名称
选择拓扑并说明理由：监督式、分层式、P2P、黑板式或混合式
映射信任边界：哪些Agent可以调用哪些工具和外部系统
设置系统预算：最大并行Agent数量、每次运行的Token数、耗时、每个任务的成本

查看
references/multi_agent_system_engineer_scope.md
了解范围、交付成果，以及与
agentic-ai-developer
的边界。

2. Topology, roles, and routing

2. 拓扑、角色与路由

ingress → router/supervisor → {workers} → reducer/merger → egress

Checklist:

Each agent has one primary responsibility and explicit inputs/outputs
Routing rules are deterministic where safety matters; LLM routing elsewhere is logged
Fan-out has a matching fan-in with merge semantics (vote, concat, structured reduce)
Dangerous tools are centralized or gated—not duplicated on every worker

See
references/agent_roles_topology_and_routing.md
for topology patterns and routing tables.

ingress → router/supervisor → {workers} → reducer/merger → egress

检查清单：

每个Agent有一项主要职责，且输入/输出明确
在安全相关场景使用确定性路由规则；其他场景使用LLM路由需记录日志
扇出操作对应匹配的扇入操作，并具备合并语义（投票、拼接、结构化归约）
危险工具集中化或设置访问权限——不重复部署在每个Worker上

查看
references/agent_roles_topology_and_routing.md
了解拓扑模式和路由表。

3. Protocols and messaging

3. 协议与通信

Define message envelope:

correlation_id

from

to

intent

payload

artifacts

constraints

Version schemas; reject unknown versions at boundaries
Prefer structured payloads over free-text handoffs for machine agents
Document idempotency keys for retried messages

See
references/inter_agent_protocols_and_messaging.md
for handoff contracts and A2A-style patterns.

定义消息信封：

correlation_id

、

from

、

to

、

intent

、

payload

、

artifacts

、

constraints

版本化模式；在边界处拒绝未知版本
对于机器Agent，优先使用结构化负载而非自由文本交接
记录重试消息的幂等键

查看
references/inter_agent_protocols_and_messaging.md
了解交接契约和A2A风格模式。

4. State, coordination, and consensus

4. 状态、协同与共识

Classify state: ephemeral scratchpad, workflow state, shared blackboard, durable store
Partition tenant and thread keys on every read/write
Use barriers or quorum when parallel agents must align before the next phase
Resolve conflicts with explicit policy: supervisor wins, vote, or escalate to human

See
references/shared_state_coordination_and_consensus.md
for blackboard vs partitioned models.

分类状态：临时工作区、工作流状态、共享黑板、持久化存储
在每次读写时划分租户和线程键
当并行Agent必须在进入下一阶段前达成一致时，使用屏障或法定人数机制
通过明确策略解决冲突：监督式Agent裁决、投票或升级至人工处理

查看
references/shared_state_coordination_and_consensus.md
了解黑板模式与分区模式的对比。

5. Fault tolerance, observability, and testing

5. 容错、可观测性与测试

Retry at message and workflow level with caps; distinguish transient vs terminal errors
Compensate or mark partial success; never leave workflows stuck without timeout
Trace: one
```
workflow_run_id
```
spanning all agent spans; redact secrets in cross-agent logs
Test: unit agents, pairwise handoffs, full DAG golden paths, chaos on one worker

See
references/fault_tolerance_observability_and_testing.md
for test matrices and SLOs.

在消息和工作流层面设置上限重试；区分瞬时错误与终端错误
进行补偿或标记部分成功；绝不允许工作流无超时停滞
追踪：使用一个
```
workflow_run_id
```
覆盖所有Agent链路；在跨Agent日志中脱敏敏感信息
测试：Agent单元测试、两两交接测试、完整DAG黄金路径测试、单Worker混沌测试

查看
references/fault_tolerance_observability_and_testing.md
了解测试矩阵和SLO。

6. Deployment, cost, and governance

6. 部署、成本与治理

Short synchronous graphs for interactive UX; queue or durable engine for long DAGs
Scale workers horizontally; pin graph version and agent config per deployment
Attribute cost per agent step; alert on budget burn rate
Gate releases on multi-agent regression suite and policy checks

See
references/deployment_cost_and_governance.md
for queue vs durable workflow tradeoffs.

交互式UX使用短同步图；长DAG使用队列或持久化引擎
横向扩展Worker；在部署时固定图版本和Agent配置
按Agent步骤统计成本；针对预算消耗速率设置告警
基于多智能体回归套件和政策检查管控发布

查看
references/deployment_cost_and_governance.md
了解队列与持久化工作流的权衡。

When to load references

何时加载参考文档

Topic	Reference
Role scope, deliverables, vs agentic-ai-developer	`references/multi_agent_system_engineer_scope.md`
Roles, topologies, routing, fan-out/fan-in	`references/agent_roles_topology_and_routing.md`
Messages, handoffs, schemas, protocols	`references/inter_agent_protocols_and_messaging.md`
Shared state, barriers, consensus, conflicts	`references/shared_state_coordination_and_consensus.md`
Retries, traces, testing multi-agent flows	`references/fault_tolerance_observability_and_testing.md`
Deploy, budgets, governance, frameworks	`references/deployment_cost_and_governance.md`

主题	参考文档
角色范围、交付成果、与agentic-ai-developer的对比	`references/multi_agent_system_engineer_scope.md`
角色、拓扑、路由、扇出/扇入	`references/agent_roles_topology_and_routing.md`
消息、交接、模式、协议	`references/inter_agent_protocols_and_messaging.md`
共享状态、屏障、共识、冲突	`references/shared_state_coordination_and_consensus.md`
重试、追踪、多智能体流程测试	`references/fault_tolerance_observability_and_testing.md`
部署、预算、治理、框架	`references/deployment_cost_and_governance.md`

Framework pointers (optional)

框架指引（可选）

Use framework docs for API specifics; this skill stays pattern-first:

Pattern	Typical home
Stateful graph, Send/fan-in, subgraph checkpointers	LangGraph-style graphs
Subagents, task middleware, filesystem routing	Deep Agents-style harness
DAG orchestration, merge nodes, status boards	agenthub-style workflows

Do not duplicate full framework tutorials—encode the system contracts (topology, messages, state, failure) in the stack the team chose.

使用框架文档了解API细节；本技能优先关注模式：

模式	常用框架
有状态图、发送/扇入、子图检查点	LangGraph风格图
子Agent、任务中间件、文件系统路由	Deep Agents风格工具集
DAG编排、合并节点、状态面板	agenthub风格工作流

请勿重复完整框架教程——在团队选择的技术栈中编码系统契约（拓扑、消息、状态、故障处理）。

Routing vs agentic-ai-developer

与agentic-ai-developer的路由区分

Question	Use
One agent, tool loop, MCP, checkpoint resume	`agentic-ai-developer`
Multiple agents, topology, routing, system-level failure and observability	this skill
Both: implement loops in agentic-ai-developer; design the fleet here	Load both; start here for topology

问题	适用技能
单Agent、工具循环、MCP、检查点恢复	`agentic-ai-developer`
多Agent、拓扑、路由、系统级故障与可观测性	本技能
两者皆需：在agentic-ai-developer中实现循环；在此处设计Agent集群	加载两者；先从拓扑设计开始