aws-fis-experiment-execute

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AWS FIS Experiment Execute

AWS FIS 实验执行

Deploy infrastructure, run an AWS FIS experiment, monitor its progress, and generate a results report. Reads configuration files from a prepared experiment directory.

部署基础设施、运行AWS FIS实验、监控其进度并生成结果报告。从已准备好的实验目录中读取配置文件。

Output Language Rule

输出语言规则

Detect the language of the user's conversation and use the same language for all output.

Chinese input -> Chinese output
English input -> English output

检测用户对话的语言，所有输出使用相同语言：

中文输入 → 中文输出
英文输入 → 英文输出

Prerequisites

前置要求

Required tools:

AWS CLI —

aws fis

aws iam

aws cloudwatch

aws cloudformation

A prepared experiment directory (from aws-fis-experiment-prepare skill)

所需工具：

AWS CLI —

aws fis

、

aws iam

、

aws cloudwatch

、

aws cloudformation

已准备好的实验目录（由aws-fis-experiment-prepare技能生成）

Workflow

工作流程

dot

digraph execute_flow {
    "Load experiment directory" [shape=box];
    "Validate files" [shape=box];
    "Choose deployment method" [shape=diamond];
    "CLI deployment" [shape=box];
    "CFN deployment" [shape=box];
    "User confirms deployment" [shape=diamond];
    "Deploy resources" [shape=box];
    "User confirms experiment start" [shape=diamond, style=bold, color=red];
    "Start experiment" [shape=box];
    "Monitor experiment" [shape=box];
    "Experiment complete?" [shape=diamond];
    "Generate results report" [shape=box];

    "Load experiment directory" -> "Validate files";
    "Validate files" -> "Choose deployment method";
    "Choose deployment method" -> "CLI deployment" [label="CLI"];
    "Choose deployment method" -> "CFN deployment" [label="CFN"];
    "CLI deployment" -> "User confirms deployment";
    "CFN deployment" -> "User confirms deployment";
    "User confirms deployment" -> "Deploy resources" [label="Yes"];
    "User confirms deployment" -> "Load experiment directory" [label="No, abort"];
    "Deploy resources" -> "User confirms experiment start";
    "User confirms experiment start" -> "Start experiment" [label="Yes, I confirm"];
    "User confirms experiment start" -> "Generate results report" [label="No, abort"];
    "Start experiment" -> "Monitor experiment";
    "Monitor experiment" -> "Experiment complete?" ;
    "Experiment complete?" -> "Monitor experiment" [label="No, poll again"];
    "Experiment complete?" -> "Generate results report" [label="Yes"];
}

dot

digraph execute_flow {
    "Load experiment directory" [shape=box];
    "Validate files" [shape=box];
    "Choose deployment method" [shape=diamond];
    "CLI deployment" [shape=box];
    "CFN deployment" [shape=box];
    "User confirms deployment" [shape=diamond];
    "Deploy resources" [shape=box];
    "User confirms experiment start" [shape=diamond, style=bold, color=red];
    "Start experiment" [shape=box];
    "Monitor experiment" [shape=box];
    "Experiment complete?" [shape=diamond];
    "Generate results report" [shape=box];

    "Load experiment directory" -> "Validate files";
    "Validate files" -> "Choose deployment method";
    "Choose deployment method" -> "CLI deployment" [label="CLI"];
    "Choose deployment method" -> "CFN deployment" [label="CFN"];
    "CLI deployment" -> "User confirms deployment";
    "CFN deployment" -> "User confirms deployment";
    "User confirms deployment" -> "Deploy resources" [label="Yes"];
    "User confirms deployment" -> "Load experiment directory" [label="No, abort"];
    "Deploy resources" -> "User confirms experiment start";
    "User confirms experiment start" -> "Start experiment" [label="Yes, I confirm"];
    "User confirms experiment start" -> "Generate results report" [label="No, abort"];
    "Start experiment" -> "Monitor experiment";
    "Monitor experiment" -> "Experiment complete?" ;
    "Experiment complete?" -> "Monitor experiment" [label="No, poll again"];
    "Experiment complete?" -> "Generate results report" [label="Yes"];
}

Step 1: Load and Validate Experiment Directory

步骤1：加载并验证实验目录

The user provides the path to the experiment directory. Verify it contains the required files:

bash

EXPERIMENT_DIR="{USER_PROVIDED_PATH}"

用户提供实验目录的路径。验证目录包含所需文件：

bash

EXPERIMENT_DIR="{USER_PROVIDED_PATH}"

Required files

ls "${EXPERIMENT_DIR}/experiment-template.json" ls "${EXPERIMENT_DIR}/iam-policy.json" ls "${EXPERIMENT_DIR}/cfn-template.yaml" ls "${EXPERIMENT_DIR}/README.md" ls "${EXPERIMENT_DIR}/expected-behavior.md"

Optional files

ls "${EXPERIMENT_DIR}/alarms/stop-condition-alarms.json" 2>/dev/null ls "${EXPERIMENT_DIR}/alarms/dashboard.json" 2>/dev/null


Read `README.md` to understand the experiment and present a summary to the user:
- Scenario name
- Target region and AZ
- Affected resources
- Estimated duration

ls "${EXPERIMENT_DIR}/alarms/stop-condition-alarms.json" 2>/dev/null ls "${EXPERIMENT_DIR}/alarms/dashboard.json" 2>/dev/null


读取`README.md`了解实验内容，并向用户展示摘要：
- 场景名称
- 目标区域和可用区
- 受影响的资源
- 预计时长

Step 2: Choose Deployment Method

步骤2：选择部署方式

Ask the user:

How would you like to deploy the experiment resources?

AWS CLI — Step-by-step deployment with individual commands

CloudFormation — All-in-one stack deployment

询问用户：

您希望如何部署实验资源？

AWS CLI — 分步执行独立命令完成部署

CloudFormation — 一体化堆栈部署

Step 3: Deploy Resources

步骤3：部署资源

Path A: AWS CLI Deployment

路径A：AWS CLI 部署

Execute commands sequentially, showing each command before running it. See

references/cli-commands.md

for the exact command sequence.

3a. Create IAM Role

bash

undefined

按顺序执行命令，运行前向用户展示每条命令。具体命令序列可参考

references/cli-commands.md

。

3a. 创建IAM角色

bash

undefined

Show command to user, wait for confirmation

aws iam create-role
--role-name "FISExperimentRole-{SCENARIO}"
--assume-role-policy-document '{...}'
--region {REGION}

aws iam put-role-policy
--role-name "FISExperimentRole-{SCENARIO}"
--policy-name FISExperimentPolicy
--policy-document "file://${EXPERIMENT_DIR}/iam-policy.json"


**3b. Create CloudWatch Alarms (Stop Conditions)**

Read `alarms/stop-condition-alarms.json` and create each alarm:

```bash
aws cloudwatch put-metric-alarm --cli-input-json '{...}' --region {REGION}

3c. Create CloudWatch Dashboard (Optional)

bash

aws cloudwatch put-dashboard \
  --dashboard-name "FIS-{SCENARIO}" \
  --dashboard-body "file://${EXPERIMENT_DIR}/alarms/dashboard.json" \
  --region {REGION}

3d. Update experiment-template.json with real ARNs

After creating IAM role and alarms, update the experiment template with:

Actual IAM role ARN
Actual alarm ARNs for stop conditions

3e. Create FIS Experiment Template

bash

aws fis create-experiment-template \
  --cli-input-json "file://${EXPERIMENT_DIR}/experiment-template.json" \
  --region {REGION}

Save the returned

experimentTemplate.id

for the next step.

aws iam create-role
--role-name "FISExperimentRole-{SCENARIO}"
--assume-role-policy-document '{...}'
--region {REGION}

aws iam put-role-policy
--role-name "FISExperimentRole-{SCENARIO}"
--policy-name FISExperimentPolicy
--policy-document "file://${EXPERIMENT_DIR}/iam-policy.json"


**3b. 创建CloudWatch告警（停止条件）**

读取`alarms/stop-condition-alarms.json`并创建每条告警：

```bash
aws cloudwatch put-metric-alarm --cli-input-json '{...}' --region {REGION}

3c. 创建CloudWatch仪表盘（可选）

bash

aws cloudwatch put-dashboard \
  --dashboard-name "FIS-{SCENARIO}" \
  --dashboard-body "file://${EXPERIMENT_DIR}/alarms/dashboard.json" \
  --region {REGION}

3d. 用真实ARN更新experiment-template.json

创建IAM角色和告警后，更新实验模板的以下内容：

实际IAM角色ARN
停止条件对应的实际告警ARN

3e. 创建FIS实验模板

bash

aws fis create-experiment-template \
  --cli-input-json "file://${EXPERIMENT_DIR}/experiment-template.json" \
  --region {REGION}

保存返回的

experimentTemplate.id

供下一步使用。

Path B: CloudFormation Deployment

路径B：CloudFormation 部署

bash

aws cloudformation deploy \
  --template-file "${EXPERIMENT_DIR}/cfn-template.yaml" \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region {REGION}

Wait for stack creation to complete:

bash

aws cloudformation wait stack-create-complete \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --region {REGION}

Extract the experiment template ID from stack outputs:

bash

TEMPLATE_ID=$(aws cloudformation describe-stacks \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --query 'Stacks[0].Outputs[?OutputKey==`ExperimentTemplateId`].OutputValue' \
  --output text --region {REGION})

bash

aws cloudformation deploy \
  --template-file "${EXPERIMENT_DIR}/cfn-template.yaml" \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region {REGION}

等待堆栈创建完成：

bash

aws cloudformation wait stack-create-complete \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --region {REGION}

从堆栈输出中提取实验模板ID：

bash

TEMPLATE_ID=$(aws cloudformation describe-stacks \
  --stack-name "fis-{SCENARIO}-{TIMESTAMP}" \
  --query 'Stacks[0].Outputs[?OutputKey==`ExperimentTemplateId`].OutputValue' \
  --output text --region {REGION})

Step 4: Start Experiment (CRITICAL CONFIRMATION)

步骤4：启动实验（关键确认步骤）

This is the most dangerous step. The experiment WILL affect real resources.

Before starting, present a clear warning:

⚠️  WARNING: Starting this FIS experiment will cause REAL impact:

Scenario:    {SCENARIO_NAME}
Region:      {REGION}
Target AZ:   {AZ_ID}
Duration:    {DURATION}

Resources that WILL be affected:
  - {list each affected resource type and count}

Stop Conditions:
  - {list each alarm that will stop the experiment}

Type "Yes, start experiment" to proceed, or "No" to abort.

Only proceed if the user explicitly confirms.

bash

aws fis start-experiment \
  --experiment-template-id "{TEMPLATE_ID}" \
  --region {REGION}

Save the returned

experiment.id

这是风险最高的步骤，实验会对真实资源产生影响。

启动前，向用户展示清晰的警告：

⚠️  警告：启动该FIS实验将产生真实影响：

场景：    {SCENARIO_NAME}
区域：      {REGION}
目标可用区：   {AZ_ID}
时长：    {DURATION}

将受影响的资源：
  - {list each affected resource type and count}

停止条件：
  - {list each alarm that will stop the experiment}

输入"Yes, start experiment"继续，或输入"No"中止操作。

仅当用户明确确认后才可继续操作。

bash

aws fis start-experiment \
  --experiment-template-id "{TEMPLATE_ID}" \
  --region {REGION}

保存返回的

experiment.id

。

Step 5: Monitor Experiment

步骤5：监控实验

Poll the experiment status and display progress:

bash

aws fis get-experiment \
  --id "{EXPERIMENT_ID}" \
  --region {REGION} \
  --query '{
    State: experiment.state.status,
    Reason: experiment.state.reason,
    StartTime: experiment.startTime,
    EndTime: experiment.endTime,
    Actions: experiment.actions
  }'

Polling strategy:

Poll every 30 seconds for the first 5 minutes
Poll every 60 seconds after that
Show current status after each poll
Record timestamps for each status change and action state transition — these feed into the per-service timeline in the final report
Track per-service events: For each service in
```
expected-behavior.md
```
, note when it was impacted (action started), when it recovered, and any intermediate states. Query service-specific status (e.g., RDS instance status, ElastiCache replication group status, EKS node status) during monitoring to capture detailed observations.

Status values:

```
initiating
```
— Experiment is starting
```
running
```
— Experiment is in progress
```
completed
```
— Experiment finished successfully
```
stopping
```
— Experiment is being stopped (by user or stop condition)
```
stopped
```
— Experiment was stopped before completion
```
failed
```
— Experiment failed

During monitoring, remind the user:

Check the CloudWatch dashboard for real-time metrics
Read
```
expected-behavior.md
```
to compare actual vs expected behavior

The experiment can be stopped at any time:

bash

aws fis stop-experiment --id "{EXPERIMENT_ID}" --region {REGION}

轮询实验状态并展示进度：

bash

aws fis get-experiment \
  --id "{EXPERIMENT_ID}" \
  --region {REGION} \
  --query '{
    State: experiment.state.status,
    Reason: experiment.state.reason,
    StartTime: experiment.startTime,
    EndTime: experiment.endTime,
    Actions: experiment.actions
  }'

轮询策略：

前5分钟每30秒轮询一次
5分钟后每60秒轮询一次
每次轮询后展示当前状态
记录每个状态变更和操作状态转换的时间戳 — 这些数据会用于最终报告中各服务的时间线
跟踪各服务事件：对于
```
expected-behavior.md
```
中的每个服务，记录其受影响的时间（操作启动时间）、恢复时间以及所有中间状态。监控期间查询特定服务的状态（例如RDS实例状态、ElastiCache复制组状态、EKS节点状态）以捕获详细观测数据。

状态值说明：

```
initiating
```
— 实验正在启动
```
running
```
— 实验进行中
```
completed
```
— 实验成功完成
```
stopping
```
— 实验正在停止（由用户或停止条件触发）
```
stopped
```
— 实验在完成前已停止
```
failed
```
— 实验失败

监控期间提醒用户：

查看CloudWatch仪表盘获取实时指标
阅读
```
expected-behavior.md
```
对比实际行为和预期行为

可随时停止实验：

bash

aws fis stop-experiment --id "{EXPERIMENT_ID}" --region {REGION}

Step 6: Save Results Report to Local File

步骤6：将结果报告保存到本地文件

After the experiment completes (any terminal state), generate a results report and write it directly to a local markdown file instead of outputting the full content to the terminal. Use the following file naming convention:

bash

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
SCENARIO_SLUG=$(echo "{SCENARIO_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')

实验结束后（进入任意终止状态），生成结果报告并直接写入本地markdown文件，不要将完整内容输出到终端。使用以下命名规则：

bash

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
SCENARIO_SLUG=$(echo "{SCENARIO_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')

File name: ${TIMESTAMP}-${SCENARIO_SLUG}-experiment-results.md

Save the file in the experiment directory (${EXPERIMENT_DIR})


**Timeline emphasis:** Timestamps in the report header (Start Time, End Time) use full
ISO 8601 with timezone (e.g., `2025-03-30T14:05:32+08:00`). However, in timeline tables
and action results, use **time-only format in UTC** (e.g., `05:05:32`) — the report date
is already in the header, so repeating the date on every row adds clutter. Mark the
column header as "Time (UTC)" so the timezone is clear. No milliseconds anywhere. Timeline events are embedded directly
into each service's impact analysis section — do NOT create a separate standalone
timeline section. This allows readers to see the full picture (timeline + impact +
findings) for each service without jumping between sections.

**Per-service analysis:** Read `expected-behavior.md` from the experiment directory to
identify all services under test. For each service, create a sub-section under
"Per-Service Impact Analysis" that includes: (1) the timeline events relevant to that
service, (2) observed behavior from monitoring, (3) key findings. Also check for
indirectly affected services (e.g., MSK affected by network disruption) and include
them.

The results report file must include:

```markdown


**时间线强调：** 报告头部的时间戳（开始时间、结束时间）使用带时区的完整ISO 8601格式（例如`2025-03-30T14:05:32+08:00`）。但在时间线表格和操作结果中，使用**UTC时间的仅时分秒格式**（例如`05:05:32`）—— 报告头部已经包含日期，每行重复日期会增加冗余。将列头标记为"Time (UTC)"以明确时区。所有位置都不要显示毫秒。时间线事件直接嵌入到每个服务的影响分析部分—— 不要创建单独的独立时间线板块。这样读者可以在同一个板块查看每个服务的完整信息（时间线+影响+发现），无需在不同板块间跳转。

**各服务分析：** 读取实验目录中的`expected-behavior.md`识别所有被测服务。为每个服务在"各服务影响分析"下创建子板块，包含：(1) 与该服务相关的时间线事件，(2) 监控得到的观测行为，(3) 关键发现。同时检查间接受影响的服务（例如受网络中断影响的MSK）并将其纳入。

结果报告文件必须包含：

```markdown

FIS Experiment Results

FIS 实验结果

Experiment ID: {EXPERIMENT_ID} Template ID: {TEMPLATE_ID} Status: {FINAL_STATUS} Start Time: {START_TIME} End Time: {END_TIME} Duration: {ACTUAL_DURATION}

实验ID： {EXPERIMENT_ID} 模板ID： {TEMPLATE_ID} 状态： {FINAL_STATUS} 开始时间： {START_TIME} 结束时间： {END_TIME} 总时长： {ACTUAL_DURATION}

Action Results

操作结果

Action	Action ID	Status	Start (UTC)	End (UTC)	Duration
{action_name}	{action_id}	{status}	{HH:MM:SS}	{HH:MM:SS}	{duration}

操作	操作ID	状态	开始时间(UTC)	结束时间(UTC)	时长
{action_name}	{action_id}	{status}	{HH:MM:SS}	{HH:MM:SS}	{duration}

Stop Condition Alarms

停止条件告警

Alarm	Final Status
{alarm_name}	{OK/ALARM}

告警名称	最终状态
{alarm_name}	{OK/ALARM}

Per-Service Impact Analysis

各服务影响分析

For EACH service listed in expected-behavior.md, create a sub-section below. Also include indirectly affected services (e.g., services impacted by network disruption even without a dedicated FIS action).

针对expected-behavior.md中列出的每个服务，在下方创建子板块。同时包含间接受影响的服务（例如即使没有专门的FIS操作，也因网络中断受影响的服务）。

{Service Name} ({resource_identifier})

{服务名称} ({resource_identifier})

Time (UTC)	Event	Observation
{HH:MM:SS}	{event}	{what was observed at this point}
{HH:MM:SS}	{event}	{observed result / status change}
...	...	...

Key Findings:

{finding_1 — what happened and why}
{finding_2 — recovery behavior}

(Repeat for each service)

时间(UTC)	事件	观测结果
{HH:MM:SS}	{event}	{当前节点观测到的内容}
{HH:MM:SS}	{event}	{观测到的结果/状态变更}
...	...	...

关键发现：

{发现1 — 发生的事件及原因}
{发现2 — 恢复行为}

(每个服务重复上述结构)

Recovery Status Summary

恢复状态汇总

Resource	Recovery Status	Notes
{service}	{Recovered / Partially Recovered / Recovering}	{details}

资源	恢复状态	备注
{service}	{已恢复/部分恢复/恢复中}	{详情}

Issues Requiring Attention

需要关注的问题

1. {Issue title}

1. {问题标题}

Problem: {description}
Recommendation: {action to take, with CLI command if applicable}

问题： {描述}
建议： {需要采取的操作，适用时附上CLI命令}

Cleanup

清理

{cleanup instructions with CLI commands}


After saving the file, print a brief summary to the terminal listing only:
- The file path of the saved results report
- Experiment ID and final status
- Start time, end time, and duration (all timestamps in ISO 8601 with timezone)
- Per-action status (one line each)
- Per-service recovery status (one line each)
- Issues requiring attention (if any)
- Cleanup instructions

{附带CLI命令的清理说明}


保存文件后，向终端输出简短摘要，仅包含以下内容：
- 已保存的结果报告的文件路径
- 实验ID和最终状态
- 开始时间、结束时间和总时长（所有时间戳使用带时区的ISO 8601格式）
- 每个操作的状态（各占一行）
- 每个服务的恢复状态（各占一行）
- 需要关注的问题（如果有）
- 清理说明

Safety Rules

安全规则

Never auto-start experiments. Always require explicit user confirmation.
Show every CLI command before executing it.
Display impact warning before experiment start with specific resource list.
Provide abort instructions at every step.
Never delete resources without user confirmation.
Recommend dry-run first — suggest the user review all files before deploying.

绝对不要自动启动实验。 始终需要用户明确确认。
执行前展示所有CLI命令。
实验启动前展示影响警告，并附上具体的资源列表。
每一步都提供中止操作说明。
未经用户确认不要删除任何资源。
建议先进行 dry run — 建议用户在部署前查看所有文件。

Cleanup Guide

清理指南

After the experiment, offer cleanup:

实验结束后，提供清理选项：

CLI Cleanup

CLI 清理

bash

undefined

bash

undefined

Delete experiment template

aws fis delete-experiment-template --id "{TEMPLATE_ID}" --region {REGION}

Delete CloudWatch alarms

aws cloudwatch delete-alarms --alarm-names "FIS-StopCondition-{SCENARIO}-{SERVICE}" --region {REGION}

Delete CloudWatch dashboard

aws cloudwatch delete-dashboards --dashboard-names "FIS-{SCENARIO}" --region {REGION}

Delete IAM role

aws iam delete-role-policy --role-name "FISExperimentRole-{SCENARIO}" --policy-name FISExperimentPolicy aws iam delete-role --role-name "FISExperimentRole-{SCENARIO}"

undefined

aws iam delete-role-policy --role-name "FISExperimentRole-{SCENARIO}" --policy-name FISExperimentPolicy aws iam delete-role --role-name "FISExperimentRole-{SCENARIO}"

undefined

CFN Cleanup

CFN 清理

bash

aws cloudformation delete-stack --stack-name "fis-{SCENARIO}-{TIMESTAMP}" --region {REGION}

bash

aws cloudformation delete-stack --stack-name "fis-{SCENARIO}-{TIMESTAMP}" --region {REGION}

Error Handling

错误处理

Error	Cause	Resolution
`AccessDeniedException`	Insufficient permissions	Check IAM policy in iam-policy.json
`ValidationException` on template	Invalid template JSON	Validate with `aws fis create-experiment-template --cli-input-json --generate-cli-skeleton`
`ResourceNotFoundException` on targets	Tagged resources not found	Verify resource tags match template
Alarm creation fails	Metric/namespace mismatch	Check metric name and namespace exist
Stack creation fails	CFN template validation error	Run `aws cloudformation validate-template` first
Experiment stuck in `initiating`	IAM role propagation delay	Wait 30 seconds and check again

错误	原因	解决方案
`AccessDeniedException`	权限不足	检查iam-policy.json中的IAM策略
模板出现 `ValidationException`	模板JSON无效	使用 `aws fis create-experiment-template --cli-input-json --generate-cli-skeleton` 验证
目标资源出现 `ResourceNotFoundException`	未找到带对应标签的资源	验证资源标签与模板匹配
告警创建失败	指标/命名空间不匹配	检查指标名称和命名空间是否存在
堆栈创建失败	CFN模板验证错误	先运行 `aws cloudformation validate-template`
实验卡在 `initiating` 状态	IAM角色传播延迟	等待30秒后再检查