azure-resource-health-diagnose

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure Resource Health & Issue Diagnosis

Azure资源运行状况与问题诊断

This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
本工作流会分析特定Azure资源以评估其运行状况,利用日志和遥测数据诊断潜在问题,并为发现的问题制定全面的修复计划。

Prerequisites

前提条件

  • Azure MCP server configured and authenticated
  • Target Azure resource identified (name and optionally resource group/subscription)
  • Resource must be deployed and running to generate logs/telemetry
  • Prefer Azure MCP tools (
    azmcp-*
    ) over direct Azure CLI when available
  • 已配置并完成身份验证的Azure MCP服务器
  • 已确定目标Azure资源(名称,可选资源组/订阅信息)
  • 资源必须已部署且处于运行状态以生成日志/遥测数据
  • 优先使用Azure MCP工具(
    azmcp-*
    ),而非直接使用Azure CLI(若有可用工具)

Workflow Steps

工作流步骤

Step 1: Get Azure Best Practices

步骤1:获取Azure最佳实践

Action: Retrieve diagnostic and troubleshooting best practices Tools: Azure MCP best practices tool Process:
  1. Load Best Practices:
    • Execute Azure best practices tool to get diagnostic guidelines
    • Focus on health monitoring, log analysis, and issue resolution patterns
    • Use these practices to inform diagnostic approach and remediation recommendations
操作:检索诊断和故障排除最佳实践 工具:Azure MCP最佳实践工具 流程:
  1. 加载最佳实践:
    • 运行Azure最佳实践工具获取诊断指南
    • 重点关注运行状况监控、日志分析和问题解决模式
    • 利用这些实践指导诊断方法和修复建议

Step 2: Resource Discovery & Identification

步骤2:资源发现与识别

Action: Locate and identify the target Azure resource Tools: Azure MCP tools + Azure CLI fallback Process:
  1. Resource Lookup:
    • If only resource name provided: Search across subscriptions using
      azmcp-subscription-list
    • Use
      az resource list --name <resource-name>
      to find matching resources
    • If multiple matches found, prompt user to specify subscription/resource group
    • Gather detailed resource information:
      • Resource type and current status
      • Location, tags, and configuration
      • Associated services and dependencies
  2. Resource Type Detection:
    • Identify resource type to determine appropriate diagnostic approach:
      • Web Apps/Function Apps: Application logs, performance metrics, dependency tracking
      • Virtual Machines: System logs, performance counters, boot diagnostics
      • Cosmos DB: Request metrics, throttling, partition statistics
      • Storage Accounts: Access logs, performance metrics, availability
      • SQL Database: Query performance, connection logs, resource utilization
      • Application Insights: Application telemetry, exceptions, dependencies
      • Key Vault: Access logs, certificate status, secret usage
      • Service Bus: Message metrics, dead letter queues, throughput
操作:定位并识别目标Azure资源 工具:Azure MCP工具 + Azure CLI备选方案 流程:
  1. 资源查找:
    • 若仅提供资源名称:使用
      azmcp-subscription-list
      在所有订阅中搜索
    • 使用
      az resource list --name <resource-name>
      查找匹配资源
    • 若找到多个匹配项,提示用户指定订阅/资源组
    • 收集详细资源信息:
      • 资源类型和当前状态
      • 位置、标签和配置
      • 关联服务和依赖项
  2. 资源类型检测:
    • 识别资源类型以确定合适的诊断方法:
      • Web应用/函数应用:应用日志、性能指标、依赖项跟踪
      • 虚拟机:系统日志、性能计数器、启动诊断
      • Cosmos DB:请求指标、限流、分区统计
      • 存储账户:访问日志、性能指标、可用性
      • SQL数据库:查询性能、连接日志、资源利用率
      • Application Insights:应用遥测数据、异常、依赖项
      • Key Vault:访问日志、证书状态、密钥使用情况
      • Service Bus:消息指标、死信队列、吞吐量

Step 3: Health Status Assessment

步骤3:运行状况评估

Action: Evaluate current resource health and availability Tools: Azure MCP monitoring tools + Azure CLI Process:
  1. Basic Health Check:
    • Check resource provisioning state and operational status
    • Verify service availability and responsiveness
    • Review recent deployment or configuration changes
    • Assess current resource utilization (CPU, memory, storage, etc.)
  2. Service-Specific Health Indicators:
    • Web Apps: HTTP response codes, response times, uptime
    • Databases: Connection success rate, query performance, deadlocks
    • Storage: Availability percentage, request success rate, latency
    • VMs: Boot diagnostics, guest OS metrics, network connectivity
    • Functions: Execution success rate, duration, error frequency
操作:评估资源当前的运行状况和可用性 工具:Azure MCP监控工具 + Azure CLI 流程:
  1. 基础运行状况检查:
    • 检查资源的部署状态和运行状态
    • 验证服务可用性和响应能力
    • 查看最近的部署或配置变更
    • 评估当前资源利用率(CPU、内存、存储等)
  2. 服务特定运行状况指标:
    • Web应用:HTTP响应码、响应时间、正常运行时间
    • 数据库:连接成功率、查询性能、死锁
    • 存储:可用性百分比、请求成功率、延迟
    • 虚拟机:启动诊断、来宾操作系统指标、网络连接
    • 函数应用:执行成功率、持续时间、错误频率

Step 4: Log & Telemetry Analysis

步骤4:日志与遥测数据分析

Action: Analyze logs and telemetry to identify issues and patterns Tools: Azure MCP monitoring tools for Log Analytics queries Process:
  1. Find Monitoring Sources:
    • Use
      azmcp-monitor-workspace-list
      to identify Log Analytics workspaces
    • Locate Application Insights instances associated with the resource
    • Identify relevant log tables using
      azmcp-monitor-table-list
  2. Execute Diagnostic Queries: Use
    azmcp-monitor-log-query
    with targeted KQL queries based on resource type:
    General Error Analysis:
    kql
    // Recent errors and exceptions
    union isfuzzy=true 
        AzureDiagnostics,
        AppServiceHTTPLogs,
        AppServiceAppLogs,
        AzureActivity
    | where TimeGenerated > ago(24h)
    | where Level == "Error" or ResultType != "Success"
    | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
    | order by TimeGenerated desc
    Performance Analysis:
    kql
    // Performance degradation patterns
    Perf
    | where TimeGenerated > ago(7d)
    | where ObjectName == "Processor" and CounterName == "% Processor Time"
    | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
    | where avg_CounterValue > 80
    Application-Specific Queries:
    kql
    // Application Insights - Failed requests
    requests
    | where timestamp > ago(24h)
    | where success == false
    | summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
    | order by timestamp desc
    
    // Database - Connection failures
    AzureDiagnostics
    | where ResourceProvider == "MICROSOFT.SQL"
    | where Category == "SQLSecurityAuditEvents"
    | where action_name_s == "CONNECTION_FAILED"
    | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
  3. Pattern Recognition:
    • Identify recurring error patterns or anomalies
    • Correlate errors with deployment times or configuration changes
    • Analyze performance trends and degradation patterns
    • Look for dependency failures or external service issues
操作:分析日志和遥测数据以识别问题和模式 工具:用于Log Analytics查询的Azure MCP监控工具 流程:
  1. 查找监控源:
    • 使用
      azmcp-monitor-workspace-list
      识别Log Analytics工作区
    • 定位与资源关联的Application Insights实例
    • 使用
      azmcp-monitor-table-list
      识别相关日志表
  2. 执行诊断查询: 根据资源类型,使用
    azmcp-monitor-log-query
    执行针对性的KQL查询:
    通用错误分析:
    kql
    // Recent errors and exceptions
    union isfuzzy=true 
        AzureDiagnostics,
        AppServiceHTTPLogs,
        AppServiceAppLogs,
        AzureActivity
    | where TimeGenerated > ago(24h)
    | where Level == "Error" or ResultType != "Success"
    | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
    | order by TimeGenerated desc
    性能分析:
    kql
    // Performance degradation patterns
    Perf
    | where TimeGenerated > ago(7d)
    | where ObjectName == "Processor" and CounterName == "% Processor Time"
    | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
    | where avg_CounterValue > 80
    应用特定查询:
    kql
    // Application Insights - Failed requests
    requests
    | where timestamp > ago(24h)
    | where success == false
    | summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
    | order by timestamp desc
    
    // Database - Connection failures
    AzureDiagnostics
    | where ResourceProvider == "MICROSOFT.SQL"
    | where Category == "SQLSecurityAuditEvents"
    | where action_name_s == "CONNECTION_FAILED"
    | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
  3. 模式识别:
    • 识别重复出现的错误模式或异常
    • 将错误与部署时间或配置变更关联
    • 分析性能趋势和性能下降模式
    • 查找依赖项故障或外部服务问题

Step 5: Issue Classification & Root Cause Analysis

步骤5:问题分类与根本原因分析

Action: Categorize identified issues and determine root causes Process:
  1. Issue Classification:
    • Critical: Service unavailable, data loss, security breaches
    • High: Performance degradation, intermittent failures, high error rates
    • Medium: Warnings, suboptimal configuration, minor performance issues
    • Low: Informational alerts, optimization opportunities
  2. Root Cause Analysis:
    • Configuration Issues: Incorrect settings, missing dependencies
    • Resource Constraints: CPU/memory/disk limitations, throttling
    • Network Issues: Connectivity problems, DNS resolution, firewall rules
    • Application Issues: Code bugs, memory leaks, inefficient queries
    • External Dependencies: Third-party service failures, API limits
    • Security Issues: Authentication failures, certificate expiration
  3. Impact Assessment:
    • Determine business impact and affected users/systems
    • Evaluate data integrity and security implications
    • Assess recovery time objectives and priorities
操作:对识别出的问题进行分类并确定根本原因 流程:
  1. 问题分类:
    • 严重:服务不可用、数据丢失、安全漏洞
    • 高优先级:性能下降、间歇性故障、高错误率
    • 中等:警告、配置不佳、轻微性能问题
    • 低优先级:信息性警报、优化机会
  2. 根本原因分析:
    • 配置问题:设置错误、缺少依赖项
    • 资源限制:CPU/内存/磁盘不足、限流
    • 网络问题:连接故障、DNS解析、防火墙规则
    • 应用问题:代码漏洞、内存泄漏、查询效率低下
    • 外部依赖项:第三方服务故障、API限制
    • 安全问题:身份验证失败、证书过期
  3. 影响评估:
    • 确定业务影响和受影响的用户/系统
    • 评估数据完整性和安全影响
    • 评估恢复时间目标和优先级

Step 6: Generate Remediation Plan

步骤6:生成修复计划

Action: Create a comprehensive plan to address identified issues Process:
  1. Immediate Actions (Critical issues):
    • Emergency fixes to restore service availability
    • Temporary workarounds to mitigate impact
    • Escalation procedures for complex issues
  2. Short-term Fixes (High/Medium issues):
    • Configuration adjustments and resource scaling
    • Application updates and patches
    • Monitoring and alerting improvements
  3. Long-term Improvements (All issues):
    • Architectural changes for better resilience
    • Preventive measures and monitoring enhancements
    • Documentation and process improvements
  4. Implementation Steps:
    • Prioritized action items with specific Azure CLI commands
    • Testing and validation procedures
    • Rollback plans for each change
    • Monitoring to verify issue resolution
操作:创建全面的计划以解决识别出的问题 流程:
  1. 立即操作(严重问题):
    • 紧急修复以恢复服务可用性
    • 临时解决方法以减轻影响
    • 复杂问题的升级流程
  2. 短期修复(高/中等问题):
    • 配置调整和资源扩容
    • 应用更新和补丁
    • 监控和告警改进
  3. 长期改进(所有问题):
    • 架构变更以提升韧性
    • 预防措施和监控增强
    • 文档和流程改进
  4. 实施步骤:
    • 按优先级排序的操作项,附带具体Azure CLI命令
    • 测试和验证流程
    • 每项变更的回滚计划
    • 监控以验证问题是否解决

Step 7: User Confirmation & Report Generation

步骤7:用户确认与报告生成

Action: Present findings and get approval for remediation actions Process:
  1. Display Health Assessment Summary:
    🏥 Azure Resource Health Assessment
    
    📊 Resource Overview:
    • Resource: [Name] ([Type])
    • Status: [Healthy/Warning/Critical]
    • Location: [Region]
    • Last Analyzed: [Timestamp]
    
    🚨 Issues Identified:
    • Critical: X issues requiring immediate attention
    • High: Y issues affecting performance/reliability  
    • Medium: Z issues for optimization
    • Low: N informational items
    
    🔍 Top Issues:
    1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
    2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
    3. [Issue Type]: [Description] - Impact: [High/Medium/Low]
    
    🛠️ Remediation Plan:
    • Immediate Actions: X items
    • Short-term Fixes: Y items  
    • Long-term Improvements: Z items
    • Estimated Resolution Time: [Timeline]
    
    ❓ Proceed with detailed remediation plan? (y/n)
  2. Generate Detailed Report:
    markdown
    # Azure Resource Health Report: [Resource Name]
    
    **Generated**: [Timestamp]  
    **Resource**: [Full Resource ID]  
    **Overall Health**: [Status with color indicator]
    
    ## 🔍 Executive Summary
    [Brief overview of health status and key findings]
    
    ## 📊 Health Metrics
    - **Availability**: X% over last 24h
    - **Performance**: [Average response time/throughput]
    - **Error Rate**: X% over last 24h
    - **Resource Utilization**: [CPU/Memory/Storage percentages]
    
    ## 🚨 Issues Identified
    
    ### Critical Issues
    - **[Issue 1]**: [Description]
      - **Root Cause**: [Analysis]
      - **Impact**: [Business impact]
      - **Immediate Action**: [Required steps]
    
    ### High Priority Issues  
    - **[Issue 2]**: [Description]
      - **Root Cause**: [Analysis]
      - **Impact**: [Performance/reliability impact]
      - **Recommended Fix**: [Solution steps]
    
    ## 🛠️ Remediation Plan
    
    ### Phase 1: Immediate Actions (0-2 hours)
    ```bash
    # Critical fixes to restore service
    [Azure CLI commands with explanations]

    Phase 2: Short-term Fixes (2-24 hours)

    bash
    # Performance and reliability improvements
    [Azure CLI commands with explanations]

    Phase 3: Long-term Improvements (1-4 weeks)

    bash
    # Architectural and preventive measures
    [Azure CLI commands and configuration changes]

    📈 Monitoring Recommendations

    • Alerts to Configure: [List of recommended alerts]
    • Dashboards to Create: [Monitoring dashboard suggestions]
    • Regular Health Checks: [Recommended frequency and scope]

    ✅ Validation Steps

    • Verify issue resolution through logs
    • Confirm performance improvements
    • Test application functionality
    • Update monitoring and alerting
    • Document lessons learned

    📝 Prevention Measures

    • [Recommendations to prevent similar issues]
    • [Process improvements]
    • [Monitoring enhancements]
    undefined
操作:展示发现结果并获取修复操作的批准 流程:
  1. 展示运行状况评估摘要:
    🏥 Azure资源运行状况评估
    
    📊 资源概述:
    • 资源: [名称] ([类型])
    • 状态: [健康/警告/严重]
    • 位置: [区域]
    • 最后分析时间: [时间戳]
    
    🚨 识别的问题:
    • 严重: X个需要立即处理的问题
    • 高优先级: Y个影响性能/可靠性的问题  
    • 中等: Z个需要优化的问题
    • 低优先级: N个信息性事项
    
    🔍 主要问题:
    1. [问题类型]: [描述] - 影响: [高/中/低]
    2. [问题类型]: [描述] - 影响: [高/中/低]
    3. [问题类型]: [描述] - 影响: [高/中/低]
    
    🛠️ 修复计划:
    • 立即操作: X项
    • 短期修复: Y项  
    • 长期改进: Z项
    • 预计解决时间: [时间线]
    
    ❓ 是否继续查看详细修复计划? (y/n)
  2. 生成详细报告:
    markdown
    # Azure资源运行状况报告: [资源名称]
    
    **生成时间**: [时间戳]  
    **资源**: [完整资源ID]  
    **整体运行状况**: [带颜色标识的状态]
    
    ## 🔍 执行摘要
    [运行状况和关键发现的简要概述]
    
    ## 📊 运行状况指标
    - **可用性**: 过去24小时内X%
    - **性能**: [平均响应时间/吞吐量]
    - **错误率**: 过去24小时内X%
    - **资源利用率**: [CPU/内存/存储百分比]
    
    ## 🚨 识别的问题
    
    ### 严重问题
    - **[问题1]**: [描述]
      - **根本原因**: [分析]
      - **影响**: [业务影响]
      - **立即操作**: [所需步骤]
    
    ### 高优先级问题  
    - **[问题2]**: [描述]
      - **根本原因**: [分析]
      - **影响**: [性能/可靠性影响]
      - **建议修复**: [解决步骤]
    
    ## 🛠️ 修复计划
    
    ### 阶段1: 立即操作 (0-2小时)
    ```bash
    # 恢复服务的严重修复
    [带说明的Azure CLI命令]

    阶段2: 短期修复 (2-24小时)

    bash
    # 性能和可靠性改进
    [带说明的Azure CLI命令]

    阶段3: 长期改进 (1-4周)

    bash
    # 架构和预防措施
    [Azure CLI命令和配置变更]

    📈 监控建议

    • 需配置的告警: [建议告警列表]
    • 需创建的仪表板: [监控仪表板建议]
    • 定期运行状况检查: [建议频率和范围]

    ✅ 验证步骤

    • 通过日志验证问题是否解决
    • 确认性能改进
    • 测试应用功能
    • 更新监控和告警
    • 记录经验教训

    📝 预防措施

    • [防止类似问题的建议]
    • [流程改进]
    • [监控增强]
    undefined

Error Handling

错误处理

  • Resource Not Found: Provide guidance on resource name/location specification
  • Authentication Issues: Guide user through Azure authentication setup
  • Insufficient Permissions: List required RBAC roles for resource access
  • No Logs Available: Suggest enabling diagnostic settings and waiting for data
  • Query Timeouts: Break down analysis into smaller time windows
  • Service-Specific Issues: Provide generic health assessment with limitations noted
  • 资源未找到: 提供资源名称/位置指定的指导
  • 身份验证问题: 指导用户完成Azure身份验证设置
  • 权限不足: 列出访问资源所需的RBAC角色
  • 无可用日志: 建议启用诊断设置并等待数据生成
  • 查询超时: 将分析拆分为更小的时间窗口
  • 特定服务问题: 提供带有局限性说明的通用运行状况评估

Success Criteria

成功标准

  • ✅ Resource health status accurately assessed
  • ✅ All significant issues identified and categorized
  • ✅ Root cause analysis completed for major problems
  • ✅ Actionable remediation plan with specific steps provided
  • ✅ Monitoring and prevention recommendations included
  • ✅ Clear prioritization of issues by business impact
  • ✅ Implementation steps include validation and rollback procedures
  • ✅ 准确评估资源运行状况
  • ✅ 识别并分类所有重大问题
  • ✅ 完成主要问题的根本原因分析
  • ✅ 提供包含具体步骤的可执行修复计划
  • ✅ 包含监控和预防建议
  • ✅ 根据业务影响清晰划分问题优先级
  • ✅ 实施步骤包含验证和回滚流程