eks-app-log-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

EKS App Log Analysis

EKS 应用日志分析

Analyze EKS application logs during FIS fault injection experiments to understand how applications respond to infrastructure failures. Supports real-time monitoring and post-hoc analysis modes.

在FIS故障注入实验期间分析EKS应用日志，了解应用对基础设施故障的响应情况。支持实时监控和事后分析两种模式。

Output Language Rule

输出语言规则

Detect the language of the user's conversation and use the same language for all output.

Chinese input -> Chinese output
English input -> English output

检测用户对话的语言，所有输出使用相同语言：

中文输入 → 中文输出
英文输入 → 英文输出

Prerequisites

前置要求

Required tools:

kubectl — configured with access to target EKS cluster
AWS CLI — for querying FIS experiment status
A prepared/executed FIS experiment directory (from aws-fis-experiment-prepare or aws-fis-experiment-execute)

所需工具：

kubectl — 已配置目标EKS集群的访问权限
AWS CLI — 用于查询FIS实验状态
已准备/执行的FIS实验目录（来自aws-fis-experiment-prepare或aws-fis-experiment-execute）

Workflow

工作流

dot

digraph log_analysis_flow {
    "Receive input path" [shape=box];
    "Detect mode" [shape=diamond];
    "Real-time mode" [shape=box];
    "Post-hoc mode" [shape=box];
    "Read service list" [shape=box];
    "Ask user for app dependencies" [shape=box];
    "Start background log collection" [shape=box];
    "Batch fetch historical logs" [shape=box];
    "Frontend polling + insight display" [shape=box];
    "Experiment complete?" [shape=diamond];
    "Generate analysis report" [shape=box];

    "Receive input path" -> "Detect mode";
    "Detect mode" -> "Real-time mode" [label="directory with README"];
    "Detect mode" -> "Post-hoc mode" [label="*-experiment-results.md"];
    "Real-time mode" -> "Read service list";
    "Post-hoc mode" -> "Read service list";
    "Read service list" -> "Ask user for app dependencies";
    "Ask user for app dependencies" -> "Start background log collection" [label="real-time"];
    "Ask user for app dependencies" -> "Batch fetch historical logs" [label="post-hoc"];
    "Start background log collection" -> "Frontend polling + insight display";
    "Frontend polling + insight display" -> "Experiment complete?";
    "Experiment complete?" -> "Frontend polling + insight display" [label="No, continue"];
    "Experiment complete?" -> "Generate analysis report" [label="Yes"];
    "Batch fetch historical logs" -> "Generate analysis report";
}

dot

digraph log_analysis_flow {
    "Receive input path" [shape=box];
    "Detect mode" [shape=diamond];
    "Real-time mode" [shape=box];
    "Post-hoc mode" [shape=box];
    "Read service list" [shape=box];
    "Ask user for app dependencies" [shape=box];
    "Start background log collection" [shape=box];
    "Batch fetch historical logs" [shape=box];
    "Frontend polling + insight display" [shape=box];
    "Experiment complete?" [shape=diamond];
    "Generate analysis report" [shape=box];

    "Receive input path" -> "Detect mode";
    "Detect mode" -> "Real-time mode" [label="directory with README"];
    "Detect mode" -> "Post-hoc mode" [label="*-experiment-results.md"];
    "Real-time mode" -> "Read service list";
    "Post-hoc mode" -> "Read service list";
    "Read service list" -> "Ask user for app dependencies";
    "Ask user for app dependencies" -> "Start background log collection" [label="real-time"];
    "Ask user for app dependencies" -> "Batch fetch historical logs" [label="post-hoc"];
    "Start background log collection" -> "Frontend polling + insight display";
    "Frontend polling + insight display" -> "Experiment complete?";
    "Experiment complete?" -> "Frontend polling + insight display" [label="No, continue"];
    "Experiment complete?" -> "Generate analysis report" [label="Yes"];
    "Batch fetch historical logs" -> "Generate analysis report";
}

Step 1: Detect Mode and Load Context

步骤1：检测模式并加载上下文

The user provides either:

Directory path (e.g.,

./az-power-interruption-2026-03-31-14-30-22/

) → Real-time mode

Report file path (e.g.,
```
./2026-03-31-...-experiment-results.md
```
) → Post-hoc mode

用户可提供以下两种输入之一：

目录路径（例如

./az-power-interruption-2026-03-31-14-30-22/

）→ 实时模式

报告文件路径（例如
```
./2026-03-31-...-experiment-results.md
```
）→ 事后模式

Real-time Mode Detection

实时模式检测

bash

undefined

bash

undefined

Check if input is a directory with README.md

if [ -d "${INPUT_PATH}" ] && [ -f "${INPUT_PATH}/README.md" ]; then MODE="realtime" # Extract template ID from README TEMPLATE_ID=$(grep -oP 'experiment-template-id["\s:]+\K[A-Za-z0-9-]+' "${INPUT_PATH}/README.md" ||
grep -oP 'ExperimentTemplateId["\s:]+\K[A-Za-z0-9-]+' "${INPUT_PATH}/README.md") REGION=$(grep -oP 'Region:["\s]+\K[a-z0-9-]+' "${INPUT_PATH}/README.md") fi

undefined

undefined

Post-hoc Mode Detection

事后模式检测

bash

undefined

bash

undefined

Check if input is an experiment results file

if [ -f "${INPUT_PATH}" ] && grep -q "FIS Experiment Results" "${INPUT_PATH}"; then MODE="posthoc" # Extract time range from report START_TIME=$(grep -oP 'Start Time:["\s]+\K[0-9T:+-]+' "${INPUT_PATH}") END_TIME=$(grep -oP 'End Time:["\s]+\K[0-9T:+-]+' "${INPUT_PATH}") EXPERIMENT_ID=$(grep -oP 'Experiment ID:["\s]+\K[A-Za-z0-9-]+' "${INPUT_PATH}") fi

undefined

undefined

Step 2: Read Service List

步骤2：读取服务列表

Extract affected services from

expected-behavior.md

(real-time) or the experiment report (post-hoc).

bash

undefined

从

expected-behavior.md

（实时模式）或实验报告（事后模式）中提取受影响的服务。

bash

undefined

From expected-behavior.md - look for "### {Service Name}" sections

grep -oP '### \K[A-Za-z0-9 ]+(?= ()' "${EXPERIMENT_DIR}/expected-behavior.md"

From experiment report - look for "#### {Service Name}" in Per-Service Impact Analysis

grep -oP '#### \K[A-Za-z0-9 ]+(?= ()' "${REPORT_PATH}"


Present the service list to user:

Detected affected services from the experiment:

RDS (cluster-xxx)
ElastiCache (redis-xxx)
EC2 (instances in ap-northeast-1a)

For each service, please provide the EKS applications that depend on it.

undefined

grep -oP '#### \K[A-Za-z0-9 ]+(?= ()' "${REPORT_PATH}"


向用户展示服务列表：

Detected affected services from the experiment:

RDS (cluster-xxx)
ElastiCache (redis-xxx)
EC2 (instances in ap-northeast-1a)

For each service, please provide the EKS applications that depend on it.

undefined

Step 3: Collect Application Dependencies

步骤3：收集应用依赖关系

For each service, ask the user to provide dependent applications:

Which EKS applications depend on RDS (cluster-xxx)?
Please provide in format: namespace/deployment or namespace/pod-label-selector
Example: default/app-backend, production/api-server

>

Store the mapping:

SERVICE_APP_MAP:
  rds-cluster-xxx:
    - namespace: default
      deployment: app-backend
    - namespace: production
      deployment: api-server
  elasticache-redis-xxx:
    - namespace: default
      deployment: cache-layer

针对每个服务，询问用户提供依赖该服务的应用：

Which EKS applications depend on RDS (cluster-xxx)?
Please provide in format: namespace/deployment or namespace/pod-label-selector
Example: default/app-backend, production/api-server

>

存储映射关系：

SERVICE_APP_MAP:
  rds-cluster-xxx:
    - namespace: default
      deployment: app-backend
    - namespace: production
      deployment: api-server
  elasticache-redis-xxx:
    - namespace: default
      deployment: cache-layer

Step 4: Log Collection

步骤4：日志采集

Real-time Mode: Background Collection

实时模式：后台采集

For each application, start a background

kubectl logs -f

process:

bash

undefined

为每个应用启动后台

kubectl logs -f

进程：

bash

undefined

Create output directory structure

LOG_DIR="${EXPERIMENT_DIR}/app-logs" mkdir -p "${LOG_DIR}/rds-cluster-xxx" mkdir -p "${LOG_DIR}/elasticache-redis-xxx"

Start background log collection for each app

for app in ${APPS[@]}; do NAMESPACE=$(echo $app | cut -d'/' -f1) DEPLOYMENT=$(echo $app | cut -d'/' -f2) SERVICE_DIR="${LOG_DIR}/${SERVICE_NAME}"

kubectl logs -f deployment/${DEPLOYMENT} -n ${NAMESPACE} \
    --timestamps --all-containers=true \
    >> "${SERVICE_DIR}/${DEPLOYMENT}.log" 2>&1 &

# Store PID for cleanup
echo $! >> "${LOG_DIR}/.pids"

done

undefined

for app in ${APPS[@]}; do NAMESPACE=$(echo $app | cut -d'/' -f1) DEPLOYMENT=$(echo $app | cut -d'/' -f2) SERVICE_DIR="${LOG_DIR}/${SERVICE_NAME}"

kubectl logs -f deployment/${DEPLOYMENT} -n ${NAMESPACE} \
    --timestamps --all-containers=true \
    >> "${SERVICE_DIR}/${DEPLOYMENT}.log" 2>&1 &

# Store PID for cleanup
echo $! >> "${LOG_DIR}/.pids"

done

undefined

Post-hoc Mode: Batch Fetch

事后模式：批量拉取

bash

undefined

bash

undefined

Fetch logs for the experiment time window

kubectl logs deployment/${DEPLOYMENT} -n ${NAMESPACE}
--timestamps --all-containers=true
--since-time="${START_TIME}"
> "${SERVICE_DIR}/${DEPLOYMENT}.log" 2>&1

undefined

kubectl logs deployment/${DEPLOYMENT} -n ${NAMESPACE}
--timestamps --all-containers=true
--since-time="${START_TIME}"
> "${SERVICE_DIR}/${DEPLOYMENT}.log" 2>&1

undefined

Step 5: Real-time Monitoring Display

步骤5：实时监控展示

Poll every 30 seconds and display insights per service group:

bash

while experiment_is_running; do
    clear_screen_section

    for SERVICE in ${SERVICES[@]}; do
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        echo "[$(date +%H:%M:%S)] ${SERVICE} Impact Analysis"
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"

        for APP in ${SERVICE_APPS[$SERVICE]}; do
            LOG_FILE="${LOG_DIR}/${SERVICE}/${APP}.log"

            # Get last 30 seconds of logs
            RECENT_LOGS=$(tail -1000 "$LOG_FILE" | awk -v cutoff="$(date -d '30 seconds ago' +%Y-%m-%dT%H:%M:%S)" '$1 >= cutoff')

            # Count errors
            ERROR_COUNT=$(echo "$RECENT_LOGS" | grep -ciE 'error|exception|fail|refused|timeout')
            WARN_COUNT=$(echo "$RECENT_LOGS" | grep -ciE 'warn|retry')

            echo ""
            echo "▶ ${APP} (last 30s: ${ERROR_COUNT} errors, ${WARN_COUNT} warnings)"
            echo "┌─────────────────────────────────────────────────────────────┐"

            # Show 5 most relevant log lines (errors first)
            echo "$RECENT_LOGS" | grep -iE 'error|exception|fail|refused|timeout' | tail -5

            echo "└─────────────────────────────────────────────────────────────┘"

            # Generate insight
            if [ $ERROR_COUNT -gt 0 ]; then
                FIRST_ERROR=$(echo "$RECENT_LOGS" | grep -iE 'error|exception' | head -1 | cut -d' ' -f1)
                LAST_ERROR=$(echo "$RECENT_LOGS" | grep -iE 'error|exception' | tail -1 | cut -d' ' -f1)
                echo "💡 Insight: ${ERROR_COUNT} errors between ${FIRST_ERROR} - ${LAST_ERROR}"

                # Detect recovery
                if echo "$RECENT_LOGS" | tail -5 | grep -qiE 'connected|restored|success|recovered'; then
                    echo "✅ Recovery signal detected in recent logs"
                fi
            else
                echo "✅ No errors detected"
            fi
        done
    done

    sleep 30
done

每30秒轮询一次，按服务分组展示洞察信息：

bash

while experiment_is_running; do
    clear_screen_section

    for SERVICE in ${SERVICES[@]}; do
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        echo "[$(date +%H:%M:%S)] ${SERVICE} Impact Analysis"
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"

        for APP in ${SERVICE_APPS[$SERVICE]}; do
            LOG_FILE="${LOG_DIR}/${SERVICE}/${APP}.log"

            # Get last 30 seconds of logs
            RECENT_LOGS=$(tail -1000 "$LOG_FILE" | awk -v cutoff="$(date -d '30 seconds ago' +%Y-%m-%dT%H:%M:%S)" '$1 >= cutoff')

            # Count errors
            ERROR_COUNT=$(echo "$RECENT_LOGS" | grep -ciE 'error|exception|fail|refused|timeout')
            WARN_COUNT=$(echo "$RECENT_LOGS" | grep -ciE 'warn|retry')

            echo ""
            echo "▶ ${APP} (last 30s: ${ERROR_COUNT} errors, ${WARN_COUNT} warnings)"
            echo "┌─────────────────────────────────────────────────────────────┐"

            # Show 5 most relevant log lines (errors first)
            echo "$RECENT_LOGS" | grep -iE 'error|exception|fail|refused|timeout' | tail -5

            echo "└─────────────────────────────────────────────────────────────┘"

            # Generate insight
            if [ $ERROR_COUNT -gt 0 ]; then
                FIRST_ERROR=$(echo "$RECENT_LOGS" | grep -iE 'error|exception' | head -1 | cut -d' ' -f1)
                LAST_ERROR=$(echo "$RECENT_LOGS" | grep -iE 'error|exception' | tail -1 | cut -d' ' -f1)
                echo "💡 Insight: ${ERROR_COUNT} errors between ${FIRST_ERROR} - ${LAST_ERROR}"

                # Detect recovery
                if echo "$RECENT_LOGS" | tail -5 | grep -qiE 'connected|restored|success|recovered'; then
                    echo "✅ Recovery signal detected in recent logs"
                fi
            else
                echo "✅ No errors detected"
            fi
        done
    done

    sleep 30
done

Step 6: Check Experiment Status (Real-time Mode)

步骤6：检查实验状态（实时模式）

bash

check_experiment_status() {
    # Query running experiments for this template
    RUNNING=$(aws fis list-experiments \
        --query "experiments[?experimentTemplateId=='${TEMPLATE_ID}' && state.status=='running']" \
        --region ${REGION} --output json)

    if [ "$(echo $RUNNING | jq length)" -gt 0 ]; then
        return 0  # Still running
    else
        return 1  # Completed or not started
    fi
}

bash

check_experiment_status() {
    # Query running experiments for this template
    RUNNING=$(aws fis list-experiments \
        --query "experiments[?experimentTemplateId=='${TEMPLATE_ID}' && state.status=='running']" \
        --region ${REGION} --output json)

    if [ "$(echo $RUNNING | jq length)" -gt 0 ]; then
        return 0  # Still running
    else
        return 1  # Completed or not started
    fi
}

Step 7: Generate Analysis Report

步骤7：生成分析报告

After experiment completes (or immediately in post-hoc mode), generate the report:

bash

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
REPORT_FILE="${EXPERIMENT_DIR}/${TIMESTAMP}-app-log-analysis.md"

Report structure:

markdown

undefined

实验结束后（事后模式下直接生成）生成报告：

bash

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
REPORT_FILE="${EXPERIMENT_DIR}/${TIMESTAMP}-app-log-analysis.md"

报告结构：

markdown

undefined

Application Log Analysis Report

Experiment ID: {EXPERIMENT_ID} Analysis Time: {TIMESTAMP} Time Range: {START_TIME} - {END_TIME} Duration: {DURATION}

Summary

Service	Application	Total Errors	Peak Error Rate	Recovery Time
{service}	{app}	{count}	{rate}/min	{time}

Service	Application	Total Errors	Peak Error Rate	Recovery Time
{service}	{app}	{count}	{rate}/min	{time}

Per-Service Application Analysis

{Service Name} ({resource_id})

{Application Name} ({namespace}/{deployment})

Error Timeline:

Time (UTC)	Level	Message
{HH:MM:SS}	ERROR	{truncated message}
...	...	...

Key Error Patterns:

Pattern	Count	First Occurrence	Last Occurrence
Connection refused	{n}	{time}	{time}
Timeout	{n}	{time}	{time}

Log Sample (Critical Errors):

{5-10 lines of actual error logs}

Insights:

{insight_1}: Error spike at {time}, correlates with {service} failover
{insight_2}: Recovery detected at {time}, {duration} after fault injection ended
{insight_3}: Application retry mechanism worked/failed because...

(Repeat for each application)

Error Timeline:

Time (UTC)	Level	Message
{HH:MM:SS}	ERROR	{truncated message}
...	...	...

Key Error Patterns:

Pattern	Count	First Occurrence	Last Occurrence
Connection refused	{n}	{time}	{time}
Timeout	{n}	{time}	{time}

Log Sample (Critical Errors):

{5-10 lines of actual error logs}

Insights:

{insight_1}: Error spike at {time}, correlates with {service} failover
{insight_2}: Recovery detected at {time}, {duration} after fault injection ended
{insight_3}: Application retry mechanism worked/failed because...

(Repeat for each application)

Cross-Service Correlation

Time	Event	RDS Impact	ElastiCache Impact	Application Response
{time}	Fault injection start	-	-	First errors appear
{time}	{service} failover	Connection errors	-	Retrying...
{time}	Recovery	Connections restored	-	Normal operation

Time	Event	RDS Impact	ElastiCache Impact	Application Response
{time}	Fault injection start	-	-	First errors appear
{time}	{service} failover	Connection errors	-	Retrying...
{time}	Recovery	Connections restored	-	Normal operation

Recommendations

{Issue}: {description}
- Impact: {what happened}
- Recommendation: {what to improve}

{Issue}: {description}
- Impact: {what happened}
- Recommendation: {what to improve}

Appendix: Log File Locations

Application	Log File
{app}	`{path}`

undefined

Application	Log File
{app}	`{path}`

undefined

Step 8: Cleanup (Real-time Mode)

步骤8：清理（实时模式）

Stop all background log collection processes:

bash

cleanup_log_collectors() {
    if [ -f "${LOG_DIR}/.pids" ]; then
        while read pid; do
            kill $pid 2>/dev/null
        done < "${LOG_DIR}/.pids"
        rm "${LOG_DIR}/.pids"
    fi
}

停止所有后台日志采集进程：

bash

cleanup_log_collectors() {
    if [ -f "${LOG_DIR}/.pids" ]; then
        while read pid; do
            kill $pid 2>/dev/null
        done < "${LOG_DIR}/.pids"
        rm "${LOG_DIR}/.pids"
    fi
}

Register cleanup on exit

trap cleanup_log_collectors EXIT

undefined

trap cleanup_log_collectors EXIT

undefined

Error Handling

错误处理

Error	Cause	Resolution
`kubectl: command not found`	kubectl not installed	Install kubectl and configure kubeconfig
`error: You must be logged in`	kubeconfig not configured	Run `aws eks update-kubeconfig --name {cluster}`
`No resources found`	Deployment/pod doesn't exist	Verify deployment name and namespace
`Unable to retrieve logs`	Pod not running or restarted	Check pod status, may need to fetch from CloudWatch Logs
Template ID not found	README format changed	Manually provide template ID

错误	原因	解决方案
`kubectl: command not found`	未安装kubectl	安装kubectl并配置kubeconfig
`error: You must be logged in`	未配置kubeconfig	执行 `aws eks update-kubeconfig --name {cluster}`
`No resources found`	Deployment/pod不存在	验证deployment名称和命名空间
`Unable to retrieve logs`	Pod未运行或已重启	检查Pod状态，可能需要从CloudWatch Logs拉取
Template ID not found	README格式变更	手动提供模板ID

Output Files

输出文件

{experiment-dir}/
├── app-logs/
│   ├── rds-cluster-xxx/
│   │   ├── app-backend.log
│   │   └── api-server.log
│   ├── elasticache-redis-xxx/
│   │   └── cache-layer.log
│   └── .pids (temporary, cleaned up)
└── {timestamp}-app-log-analysis.md

{experiment-dir}/
├── app-logs/
│   ├── rds-cluster-xxx/
│   │   ├── app-backend.log
│   │   └── api-server.log
│   ├── elasticache-redis-xxx/
│   │   └── cache-layer.log
│   └── .pids (temporary, cleaned up)
└── {timestamp}-app-log-analysis.md

Usage Examples

使用示例

undefined

undefined

Real-time monitoring (during experiment)

实时监控（实验进行中）

"Analyze app logs for ./az-power-interruption-2026-03-31-14-30-22/" "Monitor application behavior in the experiment directory" "实时监控应用日志"

Post-hoc analysis (after experiment)

事后分析（实验结束后）

"Analyze app logs using ./2026-03-31-14-35-00-az-power-interruption-experiment-results.md" "分析实验报告中的应用表现" "Check what happened to applications during the experiment"

undefined

"Analyze app logs using ./2026-03-31-14-35-00-az-power-interruption-experiment-results.md" "分析实验报告中的应用表现" "Check what happened to applications during the experiment"

undefined

Integration with Other Skills

与其他技能的集成

aws-fis-experiment-prepare — Reads
```
README.md
```
and
```
expected-behavior.md
```
for context
aws-fis-experiment-execute — Reads
```
*-experiment-results.md
```
for time range and service list
Does NOT modify any files from other skills

aws-fis-experiment-prepare — 读取
```
README.md
```
和
```
expected-behavior.md
```
获取上下文
aws-fis-experiment-execute — 读取
```
*-experiment-results.md
```
获取时间范围和服务列表
不会修改其他技能生成的任何文件