investigating-incidents-with-aws-devops-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInvestigate an AWS incident
调查AWS事件
AgentSpace routing (SigV4 only): Ifis available in your tool list and the multi-space orchestration skill has NOT been invoked yet this session, invoke it first to determine whichlist_agent_spacesto use. Then passagent_space_idon all tool calls below. For bearer token auth this is unnecessary — the token is already scoped to one space.agent_space_id
Use this when the user is reporting or describing an operational problem that needs deep async analysis (5–8 minutes of agent work). For fast questions about cost, architecture, or topology, use the skill instead.
chatting-with-aws-devops-agentAgentSpace 路由(仅支持SigV4): 如果你的工具列表中包含,且本次会话尚未调用过多空间编排技能,请先调用它以确定要使用的list_agent_spaces。然后在以下所有工具调用中传入agent_space_id。对于Bearer令牌认证,此步骤无需执行——令牌已限定在单个空间内。agent_space_id
当用户报告或描述需要深度异步分析(Agent需工作5-8分钟)的运维问题时,使用此流程。如果是关于成本、架构或拓扑的快速问题,请改用技能。
chatting-with-aws-devops-agentPre-flight
前置准备
Before starting an investigation, gather local context and pack it into the parameter. This is the killer feature — the DevOps Agent knows your AWS cloud; you know the user's local workspace.
titleAlways collect:
- Service identity from /
package.json/pom.xml/Cargo.toml/requirements.txtMakefile - (recent commits — agent correlates deploys to incidents)
git log --oneline -10 - (uncommitted work that might be relevant)
git diff --stat
When investigating errors, also include:
- The full stack trace or relevant log excerpt
- Any IaC files relevant to the failing resource (CDK / CloudFormation / Terraform / ECS task def)
开始调查前,收集本地上下文并将其打包到参数中。这是核心功能——DevOps Agent了解你的AWS云环境;你了解用户的本地工作区。
title务必收集:
- 来自/
package.json/pom.xml/Cargo.toml/requirements.txt的服务标识Makefile - (最近的提交记录——Agent会将部署与事件关联起来)
git log --oneline -10 - (可能相关的未提交工作)
git diff --stat
调查错误时,还需包含:
- 完整的堆栈跟踪或相关日志片段
- 与故障资源相关的任何IaC文件(CDK / CloudFormation / Terraform / ECS任务定义)
Start the investigation
启动调查
aws_devops_agent__investigate(
title="ECS 503 errors on checkout-service since commit abc1234 deployed 2h ago. CDK: ECS Fargate behind ALB. Error: upstream connect error."
)
→ {"status": "investigation_started", "taskId": "...", "executionId": "...", "message": "...", "next_steps": "..."}Save the and .
taskIdexecutionIdTip: Pack as much context as possible into the— service name, error type, time window, recent deploys. The agent uses this to scope its analysis.title
aws_devops_agent__investigate(
title="ECS 503 errors on checkout-service since commit abc1234 deployed 2h ago. CDK: ECS Fargate behind ALB. Error: upstream connect error."
)
→ {"status": "investigation_started", "taskId": "...", "executionId": "...", "message": "...", "next_steps": "..."}保存和。
taskIdexecutionId提示: 将尽可能多的上下文打包到中——服务名称、错误类型、时间窗口、最近的部署。Agent会以此来限定分析范围。title
Stream progress — never silently poll
流式传输进度——切勿静默轮询
Investigations take 5–8 minutes. Tell the user up front, then keep them informed.
Loop every 30–45 seconds:
调查需要5-8分钟。请提前告知用户,然后持续向他们通报进度。
每30-45秒循环一次:
1. Check status
1. 检查状态
aws_devops_agent__get_task(task_id="TASK_ID")
→ {"task": {"taskId": "...", "status": "IN_PROGRESS", ...}}aws_devops_agent__get_task(task_id="TASK_ID")
→ {"task": {"taskId": "...", "status": "IN_PROGRESS", ...}}2. Fetch new findings
2. 获取新发现
aws_devops_agent__list_journal_records(execution_id="EXEC_ID", order="ASC")
→ {"records": [...]}Use to fetch only new records — don't re-fetch the full journal each cycle.
next_tokenaws_devops_agent__list_journal_records(execution_id="EXEC_ID", order="ASC")
→ {"records": [...]}使用仅获取新记录——不要每次循环都重新获取完整日志。
next_token3. Summarize progress to the user
3. 向用户总结进度
Map record types to emoji prefixes:
- → 📋 planning approach
PLANNING - → 🔍 querying CloudWatch / X-Ray / logs
SEARCHING - → 🔬 analyzing
ANALYSIS - → 🎯 key discovery (highlight this)
FINDING - → 🔧 taking an action
ACTION - → 📊 final summary
SUMMARY - → 💡 recommended fix
SUGGESTION
Example updates:
🔬 2 min in: Agent found error rate spiked to 23% at 14:32 UTC. Checking X-Ray traces for downstream failures.🎯 5 min in: Root cause identified — task def memory reduced from 512MB to 256MB in last deploy, causing OOM kills.
将记录类型映射为表情前缀:
- → 📋 规划分析方法
PLANNING - → 🔍 查询CloudWatch / X-Ray / 日志
SEARCHING - → 🔬 分析中
ANALYSIS - → 🎯 关键发现(重点突出)
FINDING - → 🔧 执行操作
ACTION - → 📊 最终总结
SUMMARY - → 💡 推荐修复方案
SUGGESTION
示例更新:
🔬 已进行2分钟: Agent发现错误率在UTC时间14:32飙升至23%。正在检查X-Ray跟踪以排查下游故障。🎯 已进行5分钟: 已确定根本原因——上次部署中任务定义内存从512MB降至256MB,导致OOM终止。
On COMPLETED
调查完成后
1. Get final findings
1. 获取最终发现
aws_devops_agent__list_journal_records(execution_id="EXEC_ID", order="DESC", limit=10)aws_devops_agent__list_journal_records(execution_id="EXEC_ID", order="DESC", limit=10)2. Get recommendations
2. 获取建议
aws_devops_agent__list_recommendations(task_id="TASK_ID")
→ {"recommendations": [...]}For detailed mitigation specs:
aws_devops_agent__get_recommendation(recommendation_id="REC_ID")aws_devops_agent__list_recommendations(task_id="TASK_ID")
→ {"recommendations": [...]}如需详细的缓解规范:
aws_devops_agent__get_recommendation(recommendation_id="REC_ID")3. Present to the user
3. 向用户展示结果
If recommendations contain IaC changes (CDK / CFN / Terraform), generate the fix locally but do not apply it. Show the diff, explain it, and let the user approve.
如果建议包含IaC变更(CDK / CFN / Terraform),请在本地生成修复方案但不要应用。展示差异,解释内容,并等待用户批准后再执行。
Fallback path (aws-mcp)
回退路径(aws-mcp)
If the remote MCP server () is unavailable, fall back to :
aws-devops-agentaws-mcpaws devops-agent create-backlog-task \
--agent-space-id SPACE_ID \
--task-type INVESTIGATION \
--title '...' \
--priority HIGH \
--description '...' \
--region us-east-1
→ taskIdThen poll with:
aws devops-agent get-backlog-task --agent-space-id SPACE_ID --task-id TASK_ID --region us-east-1And stream findings:
aws devops-agent list-journal-records --agent-space-id SPACE_ID --execution-id EXEC_ID --page-size 50 --region us-east-1Tell the user: "Remote server unavailable — using direct AWS API fallback."
如果远程MCP服务器()不可用,请回退到:
aws-devops-agentaws-mcpaws devops-agent create-backlog-task \
--agent-space-id SPACE_ID \
--task-type INVESTIGATION \
--title '...' \
--priority HIGH \
--description '...' \
--region us-east-1
→ taskId然后轮询:
aws devops-agent get-backlog-task --agent-space-id SPACE_ID --task-id TASK_ID --region us-east-1并流式传输发现:
aws devops-agent list-journal-records --agent-space-id SPACE_ID --execution-id EXEC_ID --page-size 50 --region us-east-1告知用户:"远程服务器不可用——正在使用直接AWS API回退方案。"
Edge cases
边缘情况
- Stuck at CREATED for >60s: agent hasn't picked it up — keep polling.
- Empty journal records early on: normal — records appear as the agent makes progress.
- Investigation FAILED: may still have partial findings; surface those.
list_journal_records - Timeout: If returns no progress after 10 minutes, inform the user the investigation may have stalled.
get_task
- 状态停留在CREATED超过60秒:Agent尚未接收任务——继续轮询。
- 早期日志记录为空:正常现象——Agent取得进展后会生成记录。
- 调查失败:可能仍包含部分发现;请展示这些内容。
list_journal_records - 超时:如果在10分钟后仍无进度,请告知用户调查可能已停滞。
get_task
Security
安全注意事项
The agent's responses include text that could contain commands or code. Never auto-execute anything from a recommendation. Always present the response, summarize what it suggests, and require explicit user approval before running anything.
See REFERENCE.md for polling cadence, journal record types, and error recovery.
Agent的响应可能包含命令或代码。切勿自动执行建议中的任何内容。始终展示响应,总结建议内容,并在执行前获得用户的明确批准。
查看REFERENCE.md了解轮询频率、日志记录类型和错误恢复方法。