azure-resource-health-diagnose

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Azure Resource Health & Issue Diagnosis

Azure资源运行状况与问题诊断

This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.

本工作流会分析特定Azure资源以评估其运行状况，利用日志和遥测数据诊断潜在问题，并为发现的问题制定全面的修复计划。

Prerequisites

前提条件

Azure MCP server configured and authenticated
Target Azure resource identified (name and optionally resource group/subscription)
Resource must be deployed and running to generate logs/telemetry
Prefer Azure MCP tools (
```
azmcp-*
```
) over direct Azure CLI when available

已配置并完成身份验证的Azure MCP服务器
已确定目标Azure资源（名称，可选资源组/订阅信息）
资源必须已部署且处于运行状态以生成日志/遥测数据
优先使用Azure MCP工具（
```
azmcp-*
```
），而非直接使用Azure CLI（若有可用工具）

Workflow Steps

工作流步骤

Step 1: Get Azure Best Practices

步骤1：获取Azure最佳实践

Action: Retrieve diagnostic and troubleshooting best practices Tools: Azure MCP best practices tool Process:

Load Best Practices:
- Execute Azure best practices tool to get diagnostic guidelines
- Focus on health monitoring, log analysis, and issue resolution patterns
- Use these practices to inform diagnostic approach and remediation recommendations

操作：检索诊断和故障排除最佳实践工具：Azure MCP最佳实践工具流程:

加载最佳实践:
- 运行Azure最佳实践工具获取诊断指南
- 重点关注运行状况监控、日志分析和问题解决模式
- 利用这些实践指导诊断方法和修复建议

Step 2: Resource Discovery & Identification

步骤2：资源发现与识别

Action: Locate and identify the target Azure resource Tools: Azure MCP tools + Azure CLI fallback Process:

Resource Lookup:
- If only resource name provided: Search across subscriptions using
```
azmcp-subscription-list
```
- Use
```
az resource list --name <resource-name>
```
  to find matching resources
- If multiple matches found, prompt user to specify subscription/resource group
- Gather detailed resource information:
  - Resource type and current status
  - Location, tags, and configuration
  - Associated services and dependencies
Resource Type Detection:
- Identify resource type to determine appropriate diagnostic approach:
  - Web Apps/Function Apps: Application logs, performance metrics, dependency tracking
  - Virtual Machines: System logs, performance counters, boot diagnostics
  - Cosmos DB: Request metrics, throttling, partition statistics
  - Storage Accounts: Access logs, performance metrics, availability
  - SQL Database: Query performance, connection logs, resource utilization
  - Application Insights: Application telemetry, exceptions, dependencies
  - Key Vault: Access logs, certificate status, secret usage
  - Service Bus: Message metrics, dead letter queues, throughput

操作：定位并识别目标Azure资源工具：Azure MCP工具 + Azure CLI备选方案流程:

资源查找:
- 若仅提供资源名称：使用
```
azmcp-subscription-list
```
  在所有订阅中搜索
- 使用
```
az resource list --name <resource-name>
```
  查找匹配资源
- 若找到多个匹配项，提示用户指定订阅/资源组
- 收集详细资源信息:
  - 资源类型和当前状态
  - 位置、标签和配置
  - 关联服务和依赖项
资源类型检测:
- 识别资源类型以确定合适的诊断方法:
  - Web应用/函数应用：应用日志、性能指标、依赖项跟踪
  - 虚拟机：系统日志、性能计数器、启动诊断
  - Cosmos DB：请求指标、限流、分区统计
  - 存储账户：访问日志、性能指标、可用性
  - SQL数据库：查询性能、连接日志、资源利用率
  - Application Insights：应用遥测数据、异常、依赖项
  - Key Vault：访问日志、证书状态、密钥使用情况
  - Service Bus：消息指标、死信队列、吞吐量

Step 3: Health Status Assessment

步骤3：运行状况评估

Action: Evaluate current resource health and availability Tools: Azure MCP monitoring tools + Azure CLI Process:

Basic Health Check:
- Check resource provisioning state and operational status
- Verify service availability and responsiveness
- Review recent deployment or configuration changes
- Assess current resource utilization (CPU, memory, storage, etc.)
Service-Specific Health Indicators:
- Web Apps: HTTP response codes, response times, uptime
- Databases: Connection success rate, query performance, deadlocks
- Storage: Availability percentage, request success rate, latency
- VMs: Boot diagnostics, guest OS metrics, network connectivity
- Functions: Execution success rate, duration, error frequency

操作：评估资源当前的运行状况和可用性工具：Azure MCP监控工具 + Azure CLI 流程:

基础运行状况检查:
- 检查资源的部署状态和运行状态
- 验证服务可用性和响应能力
- 查看最近的部署或配置变更
- 评估当前资源利用率（CPU、内存、存储等）
服务特定运行状况指标:
- Web应用：HTTP响应码、响应时间、正常运行时间
- 数据库：连接成功率、查询性能、死锁
- 存储：可用性百分比、请求成功率、延迟
- 虚拟机：启动诊断、来宾操作系统指标、网络连接
- 函数应用：执行成功率、持续时间、错误频率

Step 4: Log & Telemetry Analysis

步骤4：日志与遥测数据分析

Action: Analyze logs and telemetry to identify issues and patterns Tools: Azure MCP monitoring tools for Log Analytics queries Process:

Find Monitoring Sources:
- Use
```
azmcp-monitor-workspace-list
```
  to identify Log Analytics workspaces
- Locate Application Insights instances associated with the resource
- Identify relevant log tables using
```
azmcp-monitor-table-list
```

Execute Diagnostic Queries: Use

azmcp-monitor-log-query

with targeted KQL queries based on resource type:

General Error Analysis:

kql

// Recent errors and exceptions
union isfuzzy=true 
    AzureDiagnostics,
    AppServiceHTTPLogs,
    AppServiceAppLogs,
    AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Performance Analysis:

kql

// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80

Application-Specific Queries:

kql

// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc

// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)

Pattern Recognition:
- Identify recurring error patterns or anomalies
- Correlate errors with deployment times or configuration changes
- Analyze performance trends and degradation patterns
- Look for dependency failures or external service issues

操作：分析日志和遥测数据以识别问题和模式工具：用于Log Analytics查询的Azure MCP监控工具流程:

查找监控源:
- 使用
```
azmcp-monitor-workspace-list
```
  识别Log Analytics工作区
- 定位与资源关联的Application Insights实例
- 使用
```
azmcp-monitor-table-list
```
  识别相关日志表

执行诊断查询: 根据资源类型，使用

azmcp-monitor-log-query

执行针对性的KQL查询:

通用错误分析:

kql

// Recent errors and exceptions
union isfuzzy=true 
    AzureDiagnostics,
    AppServiceHTTPLogs,
    AppServiceAppLogs,
    AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

性能分析:

kql

// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80

应用特定查询:

kql

// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc

// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)

模式识别:
- 识别重复出现的错误模式或异常
- 将错误与部署时间或配置变更关联
- 分析性能趋势和性能下降模式
- 查找依赖项故障或外部服务问题

Step 5: Issue Classification & Root Cause Analysis

步骤5：问题分类与根本原因分析

Action: Categorize identified issues and determine root causes Process:

Issue Classification:
- Critical: Service unavailable, data loss, security breaches
- High: Performance degradation, intermittent failures, high error rates
- Medium: Warnings, suboptimal configuration, minor performance issues
- Low: Informational alerts, optimization opportunities
Root Cause Analysis:
- Configuration Issues: Incorrect settings, missing dependencies
- Resource Constraints: CPU/memory/disk limitations, throttling
- Network Issues: Connectivity problems, DNS resolution, firewall rules
- Application Issues: Code bugs, memory leaks, inefficient queries
- External Dependencies: Third-party service failures, API limits
- Security Issues: Authentication failures, certificate expiration
Impact Assessment:
- Determine business impact and affected users/systems
- Evaluate data integrity and security implications
- Assess recovery time objectives and priorities

操作：对识别出的问题进行分类并确定根本原因流程:

问题分类:
- 严重：服务不可用、数据丢失、安全漏洞
- 高优先级：性能下降、间歇性故障、高错误率
- 中等：警告、配置不佳、轻微性能问题
- 低优先级：信息性警报、优化机会
根本原因分析:
- 配置问题：设置错误、缺少依赖项
- 资源限制：CPU/内存/磁盘不足、限流
- 网络问题：连接故障、DNS解析、防火墙规则
- 应用问题：代码漏洞、内存泄漏、查询效率低下
- 外部依赖项：第三方服务故障、API限制
- 安全问题：身份验证失败、证书过期
影响评估:
- 确定业务影响和受影响的用户/系统
- 评估数据完整性和安全影响
- 评估恢复时间目标和优先级

Step 6: Generate Remediation Plan

步骤6：生成修复计划

Action: Create a comprehensive plan to address identified issues Process:

Immediate Actions (Critical issues):
- Emergency fixes to restore service availability
- Temporary workarounds to mitigate impact
- Escalation procedures for complex issues
Short-term Fixes (High/Medium issues):
- Configuration adjustments and resource scaling
- Application updates and patches
- Monitoring and alerting improvements
Long-term Improvements (All issues):
- Architectural changes for better resilience
- Preventive measures and monitoring enhancements
- Documentation and process improvements
Implementation Steps:
- Prioritized action items with specific Azure CLI commands
- Testing and validation procedures
- Rollback plans for each change
- Monitoring to verify issue resolution

操作：创建全面的计划以解决识别出的问题流程:

立即操作（严重问题）:
- 紧急修复以恢复服务可用性
- 临时解决方法以减轻影响
- 复杂问题的升级流程
短期修复（高/中等问题）:
- 配置调整和资源扩容
- 应用更新和补丁
- 监控和告警改进
长期改进（所有问题）:
- 架构变更以提升韧性
- 预防措施和监控增强
- 文档和流程改进
实施步骤:
- 按优先级排序的操作项，附带具体Azure CLI命令
- 测试和验证流程
- 每项变更的回滚计划
- 监控以验证问题是否解决

Step 7: User Confirmation & Report Generation

步骤7：用户确认与报告生成

Action: Present findings and get approval for remediation actions Process:

Display Health Assessment Summary:

🏥 Azure Resource Health Assessment

📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Location: [Region]
• Last Analyzed: [Timestamp]

🚨 Issues Identified:
• Critical: X issues requiring immediate attention
• High: Y issues affecting performance/reliability  
• Medium: Z issues for optimization
• Low: N informational items

🔍 Top Issues:
1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
3. [Issue Type]: [Description] - Impact: [High/Medium/Low]

🛠️ Remediation Plan:
• Immediate Actions: X items
• Short-term Fixes: Y items  
• Long-term Improvements: Z items
• Estimated Resolution Time: [Timeline]

❓ Proceed with detailed remediation plan? (y/n)

Generate Detailed Report:

markdown

# Azure Resource Health Report: [Resource Name]

**Generated**: [Timestamp]  
**Resource**: [Full Resource ID]  
**Overall Health**: [Status with color indicator]

## 🔍 Executive Summary
[Brief overview of health status and key findings]

## 📊 Health Metrics
- **Availability**: X% over last 24h
- **Performance**: [Average response time/throughput]
- **Error Rate**: X% over last 24h
- **Resource Utilization**: [CPU/Memory/Storage percentages]

## 🚨 Issues Identified

### Critical Issues
- **[Issue 1]**: [Description]
  - **Root Cause**: [Analysis]
  - **Impact**: [Business impact]
  - **Immediate Action**: [Required steps]

### High Priority Issues  
- **[Issue 2]**: [Description]
  - **Root Cause**: [Analysis]
  - **Impact**: [Performance/reliability impact]
  - **Recommended Fix**: [Solution steps]

## 🛠️ Remediation Plan

### Phase 1: Immediate Actions (0-2 hours)
```bash
# Critical fixes to restore service
[Azure CLI commands with explanations]

Phase 2: Short-term Fixes (2-24 hours)

bash

# Performance and reliability improvements
[Azure CLI commands with explanations]

Phase 3: Long-term Improvements (1-4 weeks)

bash

# Architectural and preventive measures
[Azure CLI commands and configuration changes]

📈 Monitoring Recommendations

Alerts to Configure: [List of recommended alerts]
Dashboards to Create: [Monitoring dashboard suggestions]
Regular Health Checks: [Recommended frequency and scope]

✅ Validation Steps

Verify issue resolution through logs
Confirm performance improvements
Test application functionality
Update monitoring and alerting
Document lessons learned

📝 Prevention Measures

[Recommendations to prevent similar issues]
[Process improvements]
[Monitoring enhancements]

undefined

操作：展示发现结果并获取修复操作的批准流程:

展示运行状况评估摘要:

🏥 Azure资源运行状况评估

📊 资源概述:
• 资源: [名称] ([类型])
• 状态: [健康/警告/严重]
• 位置: [区域]
• 最后分析时间: [时间戳]

🚨 识别的问题:
• 严重: X个需要立即处理的问题
• 高优先级: Y个影响性能/可靠性的问题  
• 中等: Z个需要优化的问题
• 低优先级: N个信息性事项

🔍 主要问题:
1. [问题类型]: [描述] - 影响: [高/中/低]
2. [问题类型]: [描述] - 影响: [高/中/低]
3. [问题类型]: [描述] - 影响: [高/中/低]

🛠️ 修复计划:
• 立即操作: X项
• 短期修复: Y项  
• 长期改进: Z项
• 预计解决时间: [时间线]

❓ 是否继续查看详细修复计划? (y/n)

生成详细报告:

markdown

# Azure资源运行状况报告: [资源名称]

**生成时间**: [时间戳]  
**资源**: [完整资源ID]  
**整体运行状况**: [带颜色标识的状态]

## 🔍 执行摘要
[运行状况和关键发现的简要概述]

## 📊 运行状况指标
- **可用性**: 过去24小时内X%
- **性能**: [平均响应时间/吞吐量]
- **错误率**: 过去24小时内X%
- **资源利用率**: [CPU/内存/存储百分比]

## 🚨 识别的问题

### 严重问题
- **[问题1]**: [描述]
  - **根本原因**: [分析]
  - **影响**: [业务影响]
  - **立即操作**: [所需步骤]

### 高优先级问题  
- **[问题2]**: [描述]
  - **根本原因**: [分析]
  - **影响**: [性能/可靠性影响]
  - **建议修复**: [解决步骤]

## 🛠️ 修复计划

### 阶段1: 立即操作 (0-2小时)
```bash
# 恢复服务的严重修复
[带说明的Azure CLI命令]

阶段2: 短期修复 (2-24小时)

bash

# 性能和可靠性改进
[带说明的Azure CLI命令]

阶段3: 长期改进 (1-4周)

bash

# 架构和预防措施
[Azure CLI命令和配置变更]

📈 监控建议

需配置的告警: [建议告警列表]
需创建的仪表板: [监控仪表板建议]
定期运行状况检查: [建议频率和范围]

✅ 验证步骤

通过日志验证问题是否解决
确认性能改进
测试应用功能
更新监控和告警
记录经验教训

📝 预防措施

[防止类似问题的建议]
[流程改进]
[监控增强]

undefined

Error Handling

错误处理

Resource Not Found: Provide guidance on resource name/location specification
Authentication Issues: Guide user through Azure authentication setup
Insufficient Permissions: List required RBAC roles for resource access
No Logs Available: Suggest enabling diagnostic settings and waiting for data
Query Timeouts: Break down analysis into smaller time windows
Service-Specific Issues: Provide generic health assessment with limitations noted

资源未找到: 提供资源名称/位置指定的指导
身份验证问题: 指导用户完成Azure身份验证设置
权限不足: 列出访问资源所需的RBAC角色
无可用日志: 建议启用诊断设置并等待数据生成
查询超时: 将分析拆分为更小的时间窗口
特定服务问题: 提供带有局限性说明的通用运行状况评估

Success Criteria

成功标准

✅ Resource health status accurately assessed
✅ All significant issues identified and categorized
✅ Root cause analysis completed for major problems
✅ Actionable remediation plan with specific steps provided
✅ Monitoring and prevention recommendations included
✅ Clear prioritization of issues by business impact
✅ Implementation steps include validation and rollback procedures

✅ 准确评估资源运行状况
✅ 识别并分类所有重大问题
✅ 完成主要问题的根本原因分析
✅ 提供包含具体步骤的可执行修复计划
✅ 包含监控和预防建议
✅ 根据业务影响清晰划分问题优先级
✅ 实施步骤包含验证和回滚流程