aws-service-chaos-research

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS Service-Specific Chaos & HA Testing Research

AWS特定服务的混沌与高可用测试研究

Generate comprehensive chaos engineering and high availability testing scenarios for a specific AWS service. Uses a Scenario-Library-first approach: read the latest FIS Scenario Library documentation for pre-built composite scenarios first, then query individual FIS actions via
list-actions
, and finally supplement with deep documentation research.
为特定AWS服务生成全面的混沌工程和高可用测试场景。采用优先场景库的方法:首先查阅最新的FIS场景库文档获取预构建的复合场景,然后通过
list-actions
查询单个FIS操作,最后补充深度文档研究。

Output Language Rule

输出语言规则

Detect the language of the user's conversation and use the same language for all output.
  • Chinese input -> Chinese output
  • English input -> English output
  • Mixed -> follow the dominant language
检测用户对话的语言,并使用相同语言输出所有内容。
  • 中文输入 → 中文输出
  • 英文输入 → 英文输出
  • 混合输入 → 遵循主导语言

Prerequisites

先决条件

Required tools (at least one of each group):
FIS Scenario Library (Group A — documentation-based, always available):
  • aws___read_documentation
    — read FIS Scenario Library pages directly (scenarios are console-only and cannot be queried via CLI, so reading the latest docs is the only way to discover them)
FIS Actions Discovery (Group B — use in order of preference):
  1. AWS CLI
    aws fis list-actions
    — definitive, real-time list of FIS actions from user's region
  2. aws___search_documentation — FIS actions reference page as fallback when CLI is unavailable
Documentation Research (Group C):
  • aws___search_documentation
    — search AWS official docs
  • aws___read_documentation
    — read full doc pages
  • aws___recommend
    — discover related pages
All documentation research uses only the AWS Knowledge MCP tools above. Do NOT use SearXNG or other web search tools for documentation research.
所需工具(每组至少选一个):
FIS场景库(A组 — 基于文档,始终可用):
  • aws___read_documentation
    — 直接读取FIS场景库页面(场景仅在控制台可用,无法通过CLI查询,因此读取最新文档是发现它们的唯一途径)
FIS操作发现(B组 — 按优先级顺序使用):
  1. AWS CLI
    aws fis list-actions
    — 来自用户所在区域的FIS操作权威实时列表
  2. aws___search_documentation — 当CLI不可用时,作为FIS操作参考页面的备选方案
文档研究(C组):
  • aws___search_documentation
    — 搜索AWS官方文档
  • aws___read_documentation
    — 读取完整文档页面
  • aws___recommend
    — 发现相关页面
所有文档研究仅使用上述AWS Knowledge MCP工具。请勿使用SearXNG或其他网络搜索工具进行文档研究。

Workflow

工作流

CRITICAL — Sequential execution of all AWS Knowledge MCP calls: All calls to
aws___search_documentation
,
aws___read_documentation
, and
aws___recommend
MUST be executed one at a time, sequentially. NEVER send multiple MCP requests in parallel — the aws-knowledge-mcp-server has strict rate limits and will reject concurrent requests with "Too many requests" errors. Wait for each request to return a complete response before sending the next one. This applies to ALL steps below (Step 2, 4b, 4c, 5a, 5b).
Retry on failure: If any MCP call (especially
aws___read_documentation
) returns a rate limit error ("Too many requests") or any other transient error, retry up to 10 times with a 5-second wait between retries. Only skip the request after all 10 retries have failed.
Multi-service requests: When the user asks about multiple services (e.g., "EKS, RDS, MSK, and ElastiCache"), process them one service at a time. Complete all research steps (Steps 2-5) for one service before starting the next. Do NOT launch parallel research for multiple services — this will trigger rate limiting. The Scenario Library fetch (Step 2) only needs to run once since it covers all services; the per-service steps (3-5) must be repeated sequentially for each service.
关键要求 — 所有AWS Knowledge MCP调用需按顺序执行: 所有对
aws___search_documentation
aws___read_documentation
aws___recommend
的调用必须逐个、按顺序执行。切勿并行发送多个MCP请求 — aws-knowledge-mcp-server有严格的速率限制,会以“请求过多”错误拒绝并发请求。在发送下一个请求前,需等待每个请求返回完整响应。此要求适用于以下所有步骤(步骤2、4b、4c、5a、5b)。
失败重试: 如果任何MCP调用(尤其是
aws___read_documentation
)返回速率限制错误(“请求过多”)或任何其他临时错误,最多重试10次,每次重试间隔5秒。仅在10次重试全部失败后才跳过该请求。
多服务请求: 当用户询问多个服务时(例如“EKS、RDS、MSK和ElastiCache”),需逐个服务处理。完成一个服务的所有研究步骤(步骤2-5)后再开始下一个服务的研究。切勿并行启动多服务研究 — 这会触发速率限制。场景库获取(步骤2)只需运行一次,因为它涵盖所有服务;每个服务的步骤(3-5)必须按顺序重复执行。

Step 1: Identify Target Service

步骤1:确定目标服务

Extract the target AWS service from the user's message and determine the target region.
从用户消息中提取目标AWS服务,并确定目标区域。

Region Detection

区域检测

FIS actions can differ across AWS regions — some actions may be available in
us-east-1
but not yet in
ap-southeast-1
. Always determine the target region first, because service keyword resolution depends on it.
Detection order (use the first one that applies):
  1. User explicitly specifies — e.g., "us-west-2", "东京区域", "ap-northeast-1"
  2. Infer from context — resource ARNs, previous conversation mentioning a region
  3. Check AWS CLI default — run
    aws configure get region
    to get the configured default
  4. Ask the user — if none of the above yields a region, ask: "Which AWS region are you targeting? FIS actions and scenarios may vary by region."
Store the resolved region as
TARGET_REGION
for use in subsequent steps.
FIS操作在不同AWS区域可能存在差异 — 某些操作可能在
us-east-1
可用,但尚未在
ap-southeast-1
推出。始终先确定目标区域,因为服务关键词解析依赖于区域。
检测顺序(使用第一个适用的方式):
  1. 用户明确指定 — 例如“us-west-2”、“东京区域”、“ap-northeast-1”
  2. 从上下文推断 — 资源ARN、之前对话中提到的区域
  3. 检查AWS CLI默认设置 — 运行
    aws configure get region
    获取配置的默认区域
  4. 询问用户 — 如果上述方式均无法确定区域,请询问: “您的目标AWS区域是哪个?FIS操作和场景可能因区域而异。”
将解析后的区域存储为
TARGET_REGION
,供后续步骤使用。

Service Keyword Resolution

服务关键词解析

FIS action IDs follow the pattern
aws:<service>:<action>
. To map the user's input to the correct FIS service keyword, use dynamic discovery from the live FIS action list:
bash
aws fis list-actions --region TARGET_REGION | jq '.actions[].id' | awk -F':' '{print $2}' | sort -u
This returns the definitive list of FIS-supported service keywords in that region (e.g.,
ebs
,
ec2
,
ecs
,
eks
,
elasticache
,
fis
,
network
,
rds
,
s3
,
ssm
...). Match the user's service name against this list. For example, if the user says "Aurora", match it to
rds
; if "Kubernetes", match to
eks
.
If the AWS CLI is not available, derive the keyword by lowercasing the AWS service name and removing spaces/hyphens (e.g., "ElastiCache" ->
elasticache
).
If the service is ambiguous, ask the user to clarify (e.g., "RDS MySQL or Aurora MySQL?").
Also determine the deployment architecture if the user mentions it:
  • Multi-AZ, Multi-Region, Single-AZ
  • Read replicas, Global Tables, Cross-region replication
  • This affects which scenarios are relevant
FIS操作ID遵循
aws:<service>:<action>
的格式。为了将用户输入映射到正确的FIS服务关键词,需从实时FIS操作列表中动态发现:
bash
aws fis list-actions --region TARGET_REGION | jq '.actions[].id' | awk -F':' '{print $2}' | sort -u
这将返回该区域内FIS支持的服务关键词的权威列表(例如
ebs
ec2
ecs
eks
elasticache
fis
network
rds
s3
ssm
...)。将用户的服务名称与此列表匹配。例如,如果用户说“Aurora”,则匹配到
rds
;如果说“Kubernetes”,则匹配到
eks
如果AWS CLI不可用,则通过将AWS服务名称小写并移除空格/连字符来推导关键词(例如“ElastiCache” →
elasticache
)。
如果服务存在歧义,请要求用户澄清(例如“是RDS MySQL还是Aurora MySQL?”)。
如果用户提到部署架构,也需确定:
  • 多可用区、多区域、单可用区
  • 只读副本、全局表、跨区域复制
  • 这会影响哪些场景相关

Step 2: Fetch FIS Scenario Library (Scenario-Library-First)

步骤2:获取FIS场景库(优先场景库)

This step has the highest priority. The FIS Scenario Library provides AWS-curated composite scenarios that orchestrate multiple fault injection actions into realistic failure simulations. These are the most valuable starting point because they represent AWS's own recommendations for how to test resilience.
Scenario Library scenarios are console-only — they cannot be listed or queried via AWS CLI or API. The only way to discover them is by reading the latest documentation.
Fetch the Scenario Library pages listed in
references/search-queries.md
under "FIS Scenario Library Pages (Always Fetch)". Read both the overview and detailed scenario pages relevant to the target service. Read pages one at a time, sequentially — wait for each
aws___read_documentation
call to complete before starting the next one.
此步骤优先级最高。 FIS场景库提供AWS策划的复合场景,可将多个故障注入操作编排为真实的故障模拟。这些是最有价值的起点,因为它们代表了AWS自身推荐的弹性测试方式。
场景库中的场景仅在控制台可用 — 无法通过AWS CLI或API列出或查询。发现它们的唯一途径是读取最新文档。
获取
references/search-queries.md
中“FIS Scenario Library Pages (Always Fetch)”下列出的场景库页面。读取与目标服务相关的概述和详细场景页面。逐个页面按顺序读取 — 等待每个
aws___read_documentation
调用完成后再开始下一个。

From the scenario documentation, extract for each relevant scenario:

从场景文档中,为每个相关场景提取以下信息:

  • Scenario name and description
  • Which sub-actions the scenario orchestrates
  • Which sub-actions are relevant to the target service
  • What resource tags are required to target specific resources
  • The default durations (interruption + recovery phases)
  • Any prerequisites or limitations
  • Stop condition recommendations
  • 场景名称和描述
  • 场景编排的子操作
  • 哪些子操作与目标服务相关
  • 定位特定资源所需的资源标签
  • 默认持续时间(中断+恢复阶段)
  • 任何先决条件或限制
  • 停止条件建议

Decision: Which scenarios apply?

决策:哪些场景适用?

After reading the documentation, classify each scenario's relevance:
RelevanceCriteria
Directly relevantScenario includes sub-actions that explicitly target the service (e.g., "Failover RDS" in AZ Power Interruption)
Indirectly relevantScenario affects infrastructure the service depends on (e.g., network disruption affects any VPC-based service)
Not relevantScenario has no meaningful impact on the target service
Include both directly and indirectly relevant scenarios in the output.
阅读文档后,对每个场景的相关性进行分类:
相关性判定标准
直接相关场景包含明确针对该服务的子操作(例如“AZ电源中断”中的“故障转移RDS”)
间接相关场景影响该服务依赖的基础设施(例如网络中断影响任何基于VPC的服务)
不相关场景对目标服务无实际影响
输出中需包含直接和间接相关的场景。

Step 3: Query FIS Actions

步骤3:查询FIS操作

After the Scenario Library research, query individual FIS actions to discover service-specific fault injection capabilities that may not be covered by composite scenarios.
完成场景库研究后,查询单个FIS操作,发现复合场景可能未覆盖的服务特定故障注入能力。

Path A: AWS CLI Available (Preferred)

路径A:AWS CLI可用(首选)

Step 3a: Fetch ALL FIS actions in the target region:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[].{id:id, description:description}' --output json
Replace
TARGET_REGION
with the region resolved in Step 1 (e.g.,
us-east-1
). If no region was determined, omit
--region
to use the CLI default, but warn the user that results reflect their default region and may differ in other regions.
Step 3b: Filter for target service — from the full list, find actions whose
id
contains the search keyword(s) from Step 1:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:KEYWORD:`)].{id:id, description:description}' --output json
Also scan the description field for the service name, because some actions may reference a service in their description even if the action prefix is different.
Step 3c (Optional): Collect cross-cutting actions — these affect services indirectly. Include them if the user's service would benefit from network, API, or infrastructure-level fault injection testing:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:network:`) || starts_with(id, `aws:fis:inject`) || starts_with(id, `aws:ssm:`) || starts_with(id, `aws:ec2:stop`) || starts_with(id, `aws:ec2:terminate`)].{id:id, description:description}' --output json
Cross-cutting actions and when they're useful:
  • aws:network:disrupt-connectivity
    — useful for any VPC-based service
  • aws:network:disrupt-vpc-endpoint
    — useful for services accessed via PrivateLink
  • aws:fis:inject-api-internal-error
    — useful to test app handling of AWS API failures
  • aws:fis:inject-api-throttle-error
    — useful to test backoff/retry logic
  • aws:fis:inject-api-unavailable-error
    — useful to test graceful degradation
  • aws:ec2:stop-instances
    /
    terminate-instances
    — useful for services running on EC2
  • aws:ssm:send-command
    /
    start-automation-execution
    — useful for custom fault scripts
Whether to include cross-cutting actions depends on context:
  • Include when the service runs on EC2, uses VPC networking, or the user is interested in infrastructure-level failure testing
  • Skip when the user is focused only on service-native failures, or the service is fully managed with no user-accessible infrastructure layer
步骤3a:获取目标区域内的所有FIS操作:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[].{id:id, description:description}' --output json
TARGET_REGION
替换为步骤1中解析的区域(例如
us-east-1
)。如果未确定区域,可省略
--region
以使用CLI默认设置,但需警告用户结果反映其默认区域,在其他区域可能有所不同。
步骤3b:筛选目标服务 — 从完整列表中找到ID包含步骤1中搜索关键词的操作:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:KEYWORD:`)].{id:id, description:description}' --output json
同时扫描描述字段查找服务名称,因为某些操作可能在描述中引用服务,即使操作前缀不同。
步骤3c(可选):收集跨领域操作 — 这些操作间接影响服务。如果用户的服务会受益于网络、API或基础设施级别的故障注入测试,则包含这些操作:
bash
aws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:network:`) || starts_with(id, `aws:fis:inject`) || starts_with(id, `aws:ssm:`) || starts_with(id, `aws:ec2:stop`) || starts_with(id, `aws:ec2:terminate`)].{id:id, description:description}' --output json
跨领域操作及其适用场景:
  • aws:network:disrupt-connectivity
    — 适用于任何基于VPC的服务
  • aws:network:disrupt-vpc-endpoint
    — 适用于通过PrivateLink访问的服务
  • aws:fis:inject-api-internal-error
    — 适用于测试应用对AWS API故障的处理
  • aws:fis:inject-api-throttle-error
    — 适用于测试退避/重试逻辑
  • aws:fis:inject-api-unavailable-error
    — 适用于测试优雅降级
  • aws:ec2:stop-instances
    /
    terminate-instances
    — 适用于运行在EC2上的服务
  • aws:ssm:send-command
    /
    start-automation-execution
    — 适用于自定义故障脚本
是否包含跨领域操作取决于上下文:
  • 包含:当服务运行在EC2上、使用VPC网络,或用户对基础设施级故障测试感兴趣时
  • 跳过:当用户仅关注服务原生故障,或服务是完全托管且无用户可访问的基础设施层时

Path B: AWS CLI Not Available

路径B:AWS CLI不可用

Search the FIS actions reference documentation:
aws___search_documentation(
  search_phrase="AWS FIS actions [SERVICE_NAME] fault injection",
  topics=["reference_documentation"],
  limit=10
)
Then read the FIS actions reference page:
aws___read_documentation(
  url="https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html",
  max_length=10000
)
搜索FIS操作参考文档:
aws___search_documentation(
  search_phrase="AWS FIS actions [SERVICE_NAME] fault injection",
  topics=["reference_documentation"],
  limit=10
)
然后读取FIS操作参考页面:
aws___read_documentation(
  url="https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html",
  max_length=10000
)

Decision Point: FIS Actions Found?

决策点:是否找到FIS操作?

Count the number of service-specific actions found (exclude cross-cutting actions).
  • YES (1+ service-specific actions found) -> Continue to Step 4 (FIS-Enriched Path)
  • NO (zero service-specific actions) -> Jump to Step 5 (Documentation-Only Path)
统计找到的服务特定操作数量(排除跨领域操作)。
  • 是(找到1个及以上服务特定操作) → 继续步骤4(FIS增强路径)
  • 否(未找到服务特定操作) → 跳转至步骤5(仅文档路径)

Step 4: FIS-Enriched Path

步骤4:FIS增强路径

When FIS has native actions for the target service, combine Scenario Library findings with FIS-action-specific details.
当FIS针对目标服务有原生操作时,将场景库发现结果与FIS操作特定细节相结合。

4a: Organize FIS Actions into Testing Scenarios

4a:将FIS操作组织为测试场景

Map each FIS action to a testing scenario. Use the "FIS Native Fault Injection Scenarios" table format from
references/output-template.md
.
IMPORTANT — Scenario Library deduplication (must apply before building the table): Before listing any FIS action in the per-service table, check whether that exact action ID appeared as a sub-action in any Scenario Library composite scenario discovered in Step 2. Common examples of overlap:
  • aws:rds:failover-db-cluster
    — sub-action of AZ Power Interruption
  • aws:elasticache:replicationgroup-interrupt-az-power
    — sub-action of AZ Power Interruption
  • aws:eks:pod-network-latency
    — sub-action of AZ Application Slowdown
  • aws:eks:pod-network-packet-loss
    — sub-action of Cross-AZ Traffic Slowdown
  • aws:ec2:stop-instances
    — sub-action of AZ Power Interruption
Rules:
  1. If an action is a Scenario Library sub-action, still list it in the per-service table but append to the "HA Verification Purpose" column: "(Also sub-action of {Scenario Name} — see Scenario Library section)".
  2. If all service-specific FIS actions are Scenario Library sub-actions (e.g., ElastiCache has only
    replicationgroup-interrupt-az-power
    which is covered by AZ Power Interruption), omit the "FIS Native Fault Injection Scenarios" sub-section entirely and replace with:
    All FIS native actions for {SERVICE} are covered by Scenario Library composite scenarios. See the Scenario Library and Cross-Cutting section for details.
Group scenarios by failure domain:
  1. Instance/Task Level — individual resource failure
  2. Storage Level — disk/volume failure or degradation
  3. Network Level — connectivity disruption
  4. AZ Level — availability zone failure simulation
  5. Region Level — cross-region failover
  6. API/Control Plane — AWS API errors
Scenario Library cross-reference: For each FIS action, check whether it also appears as a sub-action in any Scenario Library composite scenario discovered in Step 2. If it does, append a note in the "HA Verification Purpose" column (e.g., "Also a sub-action of AZ Power Interruption — see Scenario Library section"). If all service-specific FIS actions are sub-actions of Scenario Library scenarios, omit the "FIS Native Fault Injection Scenarios" sub-section entirely and replace it with a note: "All FIS native actions for this service are covered by Scenario Library composite scenarios — see the Scenario Library and Cross-Cutting section."
将每个FIS操作映射到测试场景。使用
references/output-template.md
中的“FIS原生故障注入场景”表格格式。
重要 — 场景库去重(构建表格前必须执行): 在将任何FIS操作列入服务特定表格之前,检查该操作ID是否出现在步骤2中发现的任何场景库复合场景的子操作中。常见重叠示例:
  • aws:rds:failover-db-cluster
    — AZ电源中断的子操作
  • aws:elasticache:replicationgroup-interrupt-az-power
    — AZ电源中断的子操作
  • aws:eks:pod-network-latency
    — AZ应用减速的子操作
  • aws:eks:pod-network-packet-loss
    — 跨AZ流量减速的子操作
  • aws:ec2:stop-instances
    — AZ电源中断的子操作
规则:
  1. 如果操作场景库子操作,仍需将其列入服务特定表格,但在“高可用验证目的”列中添加:“(同时是{场景名称}的子操作 — 请参阅场景库部分)”。
  2. 如果所有服务特定FIS操作都是场景库子操作(例如ElastiCache仅有的
    replicationgroup-interrupt-az-power
    已被AZ电源中断场景覆盖),则省略“FIS原生故障注入场景”子部分,替换为:
    {服务名称}的所有FIS原生操作均已被场景库复合场景覆盖。请参阅场景库和跨领域部分了解详情。
按故障域对场景进行分组:
  1. 实例/任务级别 — 单个资源故障
  2. 存储级别 — 磁盘/卷故障或性能下降
  3. 网络级别 — 连接中断
  4. 可用区级别 — 可用区故障模拟
  5. 区域级别 — 跨区域故障转移
  6. API/控制平面 — AWS API错误
场景库交叉引用: 对于每个FIS操作,检查它是否也出现在步骤2中发现的任何场景库复合场景的子操作中。如果是,则在“高可用验证目的”列中添加注释(例如“同时是AZ电源中断的子操作 — 请参阅场景库部分”)。如果所有服务特定FIS操作都是场景库场景的子操作,则省略“FIS原生故障注入场景”子部分,替换为注释:“该服务的所有FIS原生操作均已被场景库复合场景覆盖 — 请参阅场景库和跨领域部分。”

4b: Enrich with Service-Specific Capabilities

4b:补充服务特定能力

Some services have built-in fault injection beyond FIS. Search for these (sequentially — wait for the search to complete before reading any result pages):
aws___search_documentation(
  search_phrase="[SERVICE_NAME] fault injection testing failover simulation",
  topics=["general", "reference_documentation"],
  limit=10
)
If found, add a "Service Built-in Fault Injection" section using the table format from
references/output-template.md
.
某些服务拥有内置故障注入功能,超出FIS范围。搜索这些功能(按顺序执行 — 等待搜索完成后再读取任何结果页面):
aws___search_documentation(
  search_phrase="[SERVICE_NAME] fault injection testing failover simulation",
  topics=["general", "reference_documentation"],
  limit=10
)
如果找到,则使用
references/output-template.md
中的表格格式添加“服务内置故障注入”部分。

4c: Deep Documentation Research

4c:深度文档研究

Use the search queries from
references/search-queries.md
under "FIS-Enriched Path". Run all 5 queries sequentially (one at a time). After searches, read the top 3-5 most relevant pages one at a time and use
aws___recommend
on the most relevant page for discovery. Never send multiple read or recommend requests in parallel.
使用
references/search-queries.md
中“FIS-Enriched Path”下的搜索查询。按顺序运行所有5个查询(逐个执行)。搜索完成后,逐个读取前3-5个最相关的页面,并在最相关页面上使用
aws___recommend
进行发现。切勿并行发送多个读取或推荐请求。

Step 5: Documentation-Only Path (No FIS Actions)

步骤5:仅文档路径(无FIS操作)

When FIS has no native actions for the target service, fall back to comprehensive documentation research. Note that Scenario Library findings from Step 2 still apply.
当FIS针对目标服务无原生操作时,退回到全面的文档研究。注意步骤2中的场景库发现结果仍然适用。

5a: Deep Documentation Search

5a:深度文档搜索

Use the search queries from
references/search-queries.md
under "Documentation-Only Path". Run all 6 queries sequentially (one at a time, wait for each to complete).
使用
references/search-queries.md
中“Documentation-Only Path”下的搜索查询。按顺序运行所有6个查询(逐个执行,等待每个查询完成)。

5b: Read Key Pages and Discover Related Content

5b:读取关键页面并发现相关内容

From the combined search results, read the top 5 most relevant pages following the priority order in
references/search-queries.md
. Read pages one at a time — wait for each
aws___read_documentation
call to complete before the next. Then use
aws___recommend
on the service's main documentation page to discover related content.
Extract from all pages:
  • Failure modes the service can experience
  • Built-in HA mechanisms (automatic failover, replication, etc.)
  • Testing approaches documented in official guides
  • Monitoring/metrics to watch during tests
从合并的搜索结果中,按照
references/search-queries.md
中的优先级顺序读取前5个最相关页面逐个读取页面 — 等待每个
aws___read_documentation
调用完成后再进行下一个。然后在服务的主文档页面上使用
aws___recommend
发现相关内容。
从所有页面中提取:
  • 服务可能遇到的故障模式
  • 内置高可用机制(自动故障转移、复制等)
  • 官方指南中记录的测试方法
  • 测试期间需监控的指标/监控项

5c: Compile Alternative Testing Approaches

5c:整理替代测试方法

Use the "Testing Methods (No Native FIS Actions)" section format from
references/output-template.md
, including both indirect FIS actions and AWS API/Console methods.
使用
references/output-template.md
中的“测试方法(无原生FIS操作)”部分格式,包括间接FIS操作和AWS API/控制台方法。

Step 6: Compile Output and Save to Local File

步骤6:整理输出并保存到本地文件

Write the report directly to a local markdown file instead of outputting the full content to the terminal. Use the following file naming convention:
bash
TIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)
SERVICE_SLUG=$(echo "{SERVICE_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')
直接将报告写入本地markdown文件,而非将完整内容输出到终端。使用以下文件命名规则:
bash
TIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)
SERVICE_SLUG=$(echo "{SERVICE_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')

File name: ${TIMESTAMP}-${SERVICE_SLUG}-chaos-research.md

文件名: ${TIMESTAMP}-${SERVICE_SLUG}-chaos-research.md


For multi-service requests, generate **one file per service**:
- `${TIMESTAMP}-rds-chaos-research.md`
- `${TIMESTAMP}-eks-chaos-research.md`
- etc.

Compile the report content using the exact format defined in `references/output-template.md`
and save it to the file. The report must include all sections in this order:

1. **Executive Summary** — overview with region, FIS support status, key recommendation
2. **Scenario Library and Cross-Cutting** — Scenario Library composite scenarios (highest priority), cross-cutting actions as optional supplement. **This section comes BEFORE per-service sections.**
3. **Per-service sections** — each with: FIS scenarios (using `{SVC}-#` test IDs, e.g., `EKS-1`, `Redis-1`), built-in methods, recommended testing scenario matrix, environment observations, and stop conditions
4. **Recommended Test Priority (Consolidated)** — references test IDs from per-service sections; do NOT duplicate full descriptions; do NOT list a FIS action separately if already covered by a Scenario Library scenario in the same table
5. **Implementation Best Practices** — steady state, DNS/connection, blast radius
6. **Reference Materials** — only URLs from actual search results or pages read
7. **Next Steps** — 3-4 actionable next steps

After saving, print a brief summary to the terminal listing only:
- The file path(s) of the generated report(s)
- Target service(s) and region
- Number of FIS actions found (service-specific + cross-cutting)
- Number of Scenario Library scenarios identified
- Top 3 recommended test priorities

对于多服务请求,**每个服务生成一个文件**:
- `${TIMESTAMP}-rds-chaos-research.md`
- `${TIMESTAMP}-eks-chaos-research.md`
- 等等

使用`references/output-template.md`中定义的精确格式整理报告内容并保存到文件中。报告必须按以下顺序包含所有部分:

1. **执行摘要** — 概述,包含区域、FIS支持状态、关键建议
2. **场景库与跨领域** — 场景库复合场景(优先级最高),跨领域操作作为可选补充。**此部分需在服务特定部分之前。**
3. **服务特定部分** — 每个部分包含:FIS场景(使用`{SVC}-#`测试ID,例如`EKS-1`、`Redis-1`)、内置方法、推荐测试场景矩阵、环境观察和停止条件
4. **推荐测试优先级(汇总)** — 引用服务特定部分的测试ID;请勿重复完整描述;如果同一表格中的场景库场景已涵盖某个FIS操作,请勿单独列出该操作
5. **实施最佳实践** — 稳定状态、DNS/连接、影响范围
6. **参考资料** — 仅包含实际搜索结果或已读取页面的URL
7. **下一步行动** — 3-4个可执行的下一步

保存后,向终端打印简要摘要,仅列出:
- 生成的报告文件路径
- 目标服务和区域
- 找到的FIS操作数量(服务特定+跨领域)
- 识别的场景库场景数量
- 前3个推荐测试优先级

Important Guidelines

重要指南

  • Scenario Library first, always. The FIS Scenario Library represents AWS's own curated resilience testing scenarios. Always read the latest Scenario Library documentation before anything else. These are documentation-based (console-only), not CLI-queryable.
  • Region matters. Always resolve the target region before querying FIS actions. FIS action availability varies by region. Always pass
    --region
    to the AWS CLI and clearly state the region in the output.
  • Don't fabricate FIS actions. If an action doesn't exist, say so clearly. The fallback path exists precisely for services FIS doesn't cover.
  • Don't fabricate links. Only include URLs from actual search results or known documentation pages you've read.
  • Be specific about the service. Every recommendation should reference the specific service, its HA mechanisms, and its specific metrics.
  • Cross-cutting actions are optional context. Include them when they add value, but focus on service-specific actions and Scenario Library scenarios first.
  • AWS Knowledge MCP only for docs research. Do NOT use SearXNG or other web search. Use
    aws___search_documentation
    ,
    aws___read_documentation
    , and
    aws___recommend
    .
  • Search across multiple topics. Use different
    topics
    values (
    general
    ,
    reference_documentation
    ,
    troubleshooting
    ) sequentially.
  • Use aws___recommend for discovery. After reading a key page, call
    aws___recommend
    to find related content that keyword search may miss.
  • NEVER send MCP requests in parallel. All calls to
    aws___search_documentation
    ,
    aws___read_documentation
    , and
    aws___recommend
    MUST be executed one at a time. Wait for each response before sending the next request. Parallel calls will trigger "Too many requests" errors from the aws-knowledge-mcp-server. This is the single most common cause of failures — enforce strictly in every step.
  • Retry on failure — up to 10 times. If any MCP call fails with a rate limit or transient error, wait 5 seconds and retry. Repeat up to 10 times before skipping.
  • Respect language. Output in the same language as the user's conversation.
  • 始终优先使用场景库。 FIS场景库代表AWS自身策划的弹性测试场景。始终先读取最新的场景库文档,再进行其他操作。这些场景基于文档(仅控制台可用),无法通过CLI查询。
  • 区域至关重要。 查询FIS操作前始终解析目标区域。FIS操作可用性因区域而异。始终向AWS CLI传递
    --region
    参数,并在输出中明确说明区域。
  • 切勿编造FIS操作。 如果操作不存在,请明确说明。备选路径正是为FIS未覆盖的服务设计的。
  • 切勿编造链接。 仅包含实际搜索结果或已读取的已知文档页面的URL。
  • 针对服务提供具体建议。 每个建议都应引用特定服务、其高可用机制和特定指标。
  • 跨领域操作是可选上下文。 当它们能增加价值时包含,但优先关注服务特定操作和场景库场景。
  • 仅使用AWS Knowledge MCP进行文档研究。 请勿使用SearXNG或其他网络搜索工具。使用
    aws___search_documentation
    aws___read_documentation
    aws___recommend
  • 跨多个主题搜索。 按顺序使用不同的
    topics
    值(
    general
    reference_documentation
    troubleshooting
    )。
  • 使用aws___recommend进行发现。 读取关键页面后,调用
    aws___recommend
    查找关键词搜索可能遗漏的相关内容。
  • 切勿并行发送MCP请求。 所有对
    aws___search_documentation
    aws___read_documentation
    aws___recommend
    的调用必须逐个执行。等待每个响应后再发送下一个请求。并行调用会触发aws-knowledge-mcp-server的“请求过多”错误。这是最常见的失败原因 — 需严格执行每一步。
  • 失败重试 — 最多10次。 如果任何MCP调用因速率限制或临时错误失败,等待5秒后重试。最多重试10次后再跳过。
  • 遵循语言规则。 使用与用户对话相同的语言输出。