azure-resource-health-diagnose
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAzure Resource Health & Issue Diagnosis
Azure资源运行状况与问题诊断
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
本工作流会分析特定Azure资源以评估其运行状况,利用日志和遥测数据诊断潜在问题,并为发现的问题制定全面的修复计划。
Prerequisites
前提条件
- Azure MCP server configured and authenticated
- Target Azure resource identified (name and optionally resource group/subscription)
- Resource must be deployed and running to generate logs/telemetry
- Prefer Azure MCP tools () over direct Azure CLI when available
azmcp-*
- 已配置并完成身份验证的Azure MCP服务器
- 已确定目标Azure资源(名称,可选资源组/订阅信息)
- 资源必须已部署且处于运行状态以生成日志/遥测数据
- 优先使用Azure MCP工具(),而非直接使用Azure CLI(若有可用工具)
azmcp-*
Workflow Steps
工作流步骤
Step 1: Get Azure Best Practices
步骤1:获取Azure最佳实践
Action: Retrieve diagnostic and troubleshooting best practices
Tools: Azure MCP best practices tool
Process:
- Load Best Practices:
- Execute Azure best practices tool to get diagnostic guidelines
- Focus on health monitoring, log analysis, and issue resolution patterns
- Use these practices to inform diagnostic approach and remediation recommendations
操作:检索诊断和故障排除最佳实践
工具:Azure MCP最佳实践工具
流程:
- 加载最佳实践:
- 运行Azure最佳实践工具获取诊断指南
- 重点关注运行状况监控、日志分析和问题解决模式
- 利用这些实践指导诊断方法和修复建议
Step 2: Resource Discovery & Identification
步骤2:资源发现与识别
Action: Locate and identify the target Azure resource
Tools: Azure MCP tools + Azure CLI fallback
Process:
-
Resource Lookup:
- If only resource name provided: Search across subscriptions using
azmcp-subscription-list - Use to find matching resources
az resource list --name <resource-name> - If multiple matches found, prompt user to specify subscription/resource group
- Gather detailed resource information:
- Resource type and current status
- Location, tags, and configuration
- Associated services and dependencies
- If only resource name provided: Search across subscriptions using
-
Resource Type Detection:
- Identify resource type to determine appropriate diagnostic approach:
- Web Apps/Function Apps: Application logs, performance metrics, dependency tracking
- Virtual Machines: System logs, performance counters, boot diagnostics
- Cosmos DB: Request metrics, throttling, partition statistics
- Storage Accounts: Access logs, performance metrics, availability
- SQL Database: Query performance, connection logs, resource utilization
- Application Insights: Application telemetry, exceptions, dependencies
- Key Vault: Access logs, certificate status, secret usage
- Service Bus: Message metrics, dead letter queues, throughput
- Identify resource type to determine appropriate diagnostic approach:
操作:定位并识别目标Azure资源
工具:Azure MCP工具 + Azure CLI备选方案
流程:
-
资源查找:
- 若仅提供资源名称:使用在所有订阅中搜索
azmcp-subscription-list - 使用查找匹配资源
az resource list --name <resource-name> - 若找到多个匹配项,提示用户指定订阅/资源组
- 收集详细资源信息:
- 资源类型和当前状态
- 位置、标签和配置
- 关联服务和依赖项
- 若仅提供资源名称:使用
-
资源类型检测:
- 识别资源类型以确定合适的诊断方法:
- Web应用/函数应用:应用日志、性能指标、依赖项跟踪
- 虚拟机:系统日志、性能计数器、启动诊断
- Cosmos DB:请求指标、限流、分区统计
- 存储账户:访问日志、性能指标、可用性
- SQL数据库:查询性能、连接日志、资源利用率
- Application Insights:应用遥测数据、异常、依赖项
- Key Vault:访问日志、证书状态、密钥使用情况
- Service Bus:消息指标、死信队列、吞吐量
- 识别资源类型以确定合适的诊断方法:
Step 3: Health Status Assessment
步骤3:运行状况评估
Action: Evaluate current resource health and availability
Tools: Azure MCP monitoring tools + Azure CLI
Process:
-
Basic Health Check:
- Check resource provisioning state and operational status
- Verify service availability and responsiveness
- Review recent deployment or configuration changes
- Assess current resource utilization (CPU, memory, storage, etc.)
-
Service-Specific Health Indicators:
- Web Apps: HTTP response codes, response times, uptime
- Databases: Connection success rate, query performance, deadlocks
- Storage: Availability percentage, request success rate, latency
- VMs: Boot diagnostics, guest OS metrics, network connectivity
- Functions: Execution success rate, duration, error frequency
操作:评估资源当前的运行状况和可用性
工具:Azure MCP监控工具 + Azure CLI
流程:
-
基础运行状况检查:
- 检查资源的部署状态和运行状态
- 验证服务可用性和响应能力
- 查看最近的部署或配置变更
- 评估当前资源利用率(CPU、内存、存储等)
-
服务特定运行状况指标:
- Web应用:HTTP响应码、响应时间、正常运行时间
- 数据库:连接成功率、查询性能、死锁
- 存储:可用性百分比、请求成功率、延迟
- 虚拟机:启动诊断、来宾操作系统指标、网络连接
- 函数应用:执行成功率、持续时间、错误频率
Step 4: Log & Telemetry Analysis
步骤4:日志与遥测数据分析
Action: Analyze logs and telemetry to identify issues and patterns
Tools: Azure MCP monitoring tools for Log Analytics queries
Process:
-
Find Monitoring Sources:
- Use to identify Log Analytics workspaces
azmcp-monitor-workspace-list - Locate Application Insights instances associated with the resource
- Identify relevant log tables using
azmcp-monitor-table-list
- Use
-
Execute Diagnostic Queries: Usewith targeted KQL queries based on resource type:
azmcp-monitor-log-queryGeneral Error Analysis:kql// Recent errors and exceptions union isfuzzy=true AzureDiagnostics, AppServiceHTTPLogs, AppServiceAppLogs, AzureActivity | where TimeGenerated > ago(24h) | where Level == "Error" or ResultType != "Success" | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h) | order by TimeGenerated descPerformance Analysis:kql// Performance degradation patterns Perf | where TimeGenerated > ago(7d) | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h) | where avg_CounterValue > 80Application-Specific Queries:kql// Application Insights - Failed requests requests | where timestamp > ago(24h) | where success == false | summarize FailureCount=count() by resultCode, bin(timestamp, 1h) | order by timestamp desc // Database - Connection failures AzureDiagnostics | where ResourceProvider == "MICROSOFT.SQL" | where Category == "SQLSecurityAuditEvents" | where action_name_s == "CONNECTION_FAILED" | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h) -
Pattern Recognition:
- Identify recurring error patterns or anomalies
- Correlate errors with deployment times or configuration changes
- Analyze performance trends and degradation patterns
- Look for dependency failures or external service issues
操作:分析日志和遥测数据以识别问题和模式
工具:用于Log Analytics查询的Azure MCP监控工具
流程:
-
查找监控源:
- 使用识别Log Analytics工作区
azmcp-monitor-workspace-list - 定位与资源关联的Application Insights实例
- 使用识别相关日志表
azmcp-monitor-table-list
- 使用
-
执行诊断查询: 根据资源类型,使用执行针对性的KQL查询:
azmcp-monitor-log-query通用错误分析:kql// Recent errors and exceptions union isfuzzy=true AzureDiagnostics, AppServiceHTTPLogs, AppServiceAppLogs, AzureActivity | where TimeGenerated > ago(24h) | where Level == "Error" or ResultType != "Success" | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h) | order by TimeGenerated desc性能分析:kql// Performance degradation patterns Perf | where TimeGenerated > ago(7d) | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h) | where avg_CounterValue > 80应用特定查询:kql// Application Insights - Failed requests requests | where timestamp > ago(24h) | where success == false | summarize FailureCount=count() by resultCode, bin(timestamp, 1h) | order by timestamp desc // Database - Connection failures AzureDiagnostics | where ResourceProvider == "MICROSOFT.SQL" | where Category == "SQLSecurityAuditEvents" | where action_name_s == "CONNECTION_FAILED" | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h) -
模式识别:
- 识别重复出现的错误模式或异常
- 将错误与部署时间或配置变更关联
- 分析性能趋势和性能下降模式
- 查找依赖项故障或外部服务问题
Step 5: Issue Classification & Root Cause Analysis
步骤5:问题分类与根本原因分析
Action: Categorize identified issues and determine root causes
Process:
-
Issue Classification:
- Critical: Service unavailable, data loss, security breaches
- High: Performance degradation, intermittent failures, high error rates
- Medium: Warnings, suboptimal configuration, minor performance issues
- Low: Informational alerts, optimization opportunities
-
Root Cause Analysis:
- Configuration Issues: Incorrect settings, missing dependencies
- Resource Constraints: CPU/memory/disk limitations, throttling
- Network Issues: Connectivity problems, DNS resolution, firewall rules
- Application Issues: Code bugs, memory leaks, inefficient queries
- External Dependencies: Third-party service failures, API limits
- Security Issues: Authentication failures, certificate expiration
-
Impact Assessment:
- Determine business impact and affected users/systems
- Evaluate data integrity and security implications
- Assess recovery time objectives and priorities
操作:对识别出的问题进行分类并确定根本原因
流程:
-
问题分类:
- 严重:服务不可用、数据丢失、安全漏洞
- 高优先级:性能下降、间歇性故障、高错误率
- 中等:警告、配置不佳、轻微性能问题
- 低优先级:信息性警报、优化机会
-
根本原因分析:
- 配置问题:设置错误、缺少依赖项
- 资源限制:CPU/内存/磁盘不足、限流
- 网络问题:连接故障、DNS解析、防火墙规则
- 应用问题:代码漏洞、内存泄漏、查询效率低下
- 外部依赖项:第三方服务故障、API限制
- 安全问题:身份验证失败、证书过期
-
影响评估:
- 确定业务影响和受影响的用户/系统
- 评估数据完整性和安全影响
- 评估恢复时间目标和优先级
Step 6: Generate Remediation Plan
步骤6:生成修复计划
Action: Create a comprehensive plan to address identified issues
Process:
-
Immediate Actions (Critical issues):
- Emergency fixes to restore service availability
- Temporary workarounds to mitigate impact
- Escalation procedures for complex issues
-
Short-term Fixes (High/Medium issues):
- Configuration adjustments and resource scaling
- Application updates and patches
- Monitoring and alerting improvements
-
Long-term Improvements (All issues):
- Architectural changes for better resilience
- Preventive measures and monitoring enhancements
- Documentation and process improvements
-
Implementation Steps:
- Prioritized action items with specific Azure CLI commands
- Testing and validation procedures
- Rollback plans for each change
- Monitoring to verify issue resolution
操作:创建全面的计划以解决识别出的问题
流程:
-
立即操作(严重问题):
- 紧急修复以恢复服务可用性
- 临时解决方法以减轻影响
- 复杂问题的升级流程
-
短期修复(高/中等问题):
- 配置调整和资源扩容
- 应用更新和补丁
- 监控和告警改进
-
长期改进(所有问题):
- 架构变更以提升韧性
- 预防措施和监控增强
- 文档和流程改进
-
实施步骤:
- 按优先级排序的操作项,附带具体Azure CLI命令
- 测试和验证流程
- 每项变更的回滚计划
- 监控以验证问题是否解决
Step 7: User Confirmation & Report Generation
步骤7:用户确认与报告生成
Action: Present findings and get approval for remediation actions
Process:
-
Display Health Assessment Summary:
🏥 Azure Resource Health Assessment 📊 Resource Overview: • Resource: [Name] ([Type]) • Status: [Healthy/Warning/Critical] • Location: [Region] • Last Analyzed: [Timestamp] 🚨 Issues Identified: • Critical: X issues requiring immediate attention • High: Y issues affecting performance/reliability • Medium: Z issues for optimization • Low: N informational items 🔍 Top Issues: 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] 🛠️ Remediation Plan: • Immediate Actions: X items • Short-term Fixes: Y items • Long-term Improvements: Z items • Estimated Resolution Time: [Timeline] ❓ Proceed with detailed remediation plan? (y/n) -
Generate Detailed Report:markdown
# Azure Resource Health Report: [Resource Name] **Generated**: [Timestamp] **Resource**: [Full Resource ID] **Overall Health**: [Status with color indicator] ## 🔍 Executive Summary [Brief overview of health status and key findings] ## 📊 Health Metrics - **Availability**: X% over last 24h - **Performance**: [Average response time/throughput] - **Error Rate**: X% over last 24h - **Resource Utilization**: [CPU/Memory/Storage percentages] ## 🚨 Issues Identified ### Critical Issues - **[Issue 1]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Business impact] - **Immediate Action**: [Required steps] ### High Priority Issues - **[Issue 2]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Performance/reliability impact] - **Recommended Fix**: [Solution steps] ## 🛠️ Remediation Plan ### Phase 1: Immediate Actions (0-2 hours) ```bash # Critical fixes to restore service [Azure CLI commands with explanations]Phase 2: Short-term Fixes (2-24 hours)
bash# Performance and reliability improvements [Azure CLI commands with explanations]Phase 3: Long-term Improvements (1-4 weeks)
bash# Architectural and preventive measures [Azure CLI commands and configuration changes]📈 Monitoring Recommendations
- Alerts to Configure: [List of recommended alerts]
- Dashboards to Create: [Monitoring dashboard suggestions]
- Regular Health Checks: [Recommended frequency and scope]
✅ Validation Steps
- Verify issue resolution through logs
- Confirm performance improvements
- Test application functionality
- Update monitoring and alerting
- Document lessons learned
📝 Prevention Measures
- [Recommendations to prevent similar issues]
- [Process improvements]
- [Monitoring enhancements]
undefined
操作:展示发现结果并获取修复操作的批准
流程:
-
展示运行状况评估摘要:
🏥 Azure资源运行状况评估 📊 资源概述: • 资源: [名称] ([类型]) • 状态: [健康/警告/严重] • 位置: [区域] • 最后分析时间: [时间戳] 🚨 识别的问题: • 严重: X个需要立即处理的问题 • 高优先级: Y个影响性能/可靠性的问题 • 中等: Z个需要优化的问题 • 低优先级: N个信息性事项 🔍 主要问题: 1. [问题类型]: [描述] - 影响: [高/中/低] 2. [问题类型]: [描述] - 影响: [高/中/低] 3. [问题类型]: [描述] - 影响: [高/中/低] 🛠️ 修复计划: • 立即操作: X项 • 短期修复: Y项 • 长期改进: Z项 • 预计解决时间: [时间线] ❓ 是否继续查看详细修复计划? (y/n) -
生成详细报告:markdown
# Azure资源运行状况报告: [资源名称] **生成时间**: [时间戳] **资源**: [完整资源ID] **整体运行状况**: [带颜色标识的状态] ## 🔍 执行摘要 [运行状况和关键发现的简要概述] ## 📊 运行状况指标 - **可用性**: 过去24小时内X% - **性能**: [平均响应时间/吞吐量] - **错误率**: 过去24小时内X% - **资源利用率**: [CPU/内存/存储百分比] ## 🚨 识别的问题 ### 严重问题 - **[问题1]**: [描述] - **根本原因**: [分析] - **影响**: [业务影响] - **立即操作**: [所需步骤] ### 高优先级问题 - **[问题2]**: [描述] - **根本原因**: [分析] - **影响**: [性能/可靠性影响] - **建议修复**: [解决步骤] ## 🛠️ 修复计划 ### 阶段1: 立即操作 (0-2小时) ```bash # 恢复服务的严重修复 [带说明的Azure CLI命令]阶段2: 短期修复 (2-24小时)
bash# 性能和可靠性改进 [带说明的Azure CLI命令]阶段3: 长期改进 (1-4周)
bash# 架构和预防措施 [Azure CLI命令和配置变更]📈 监控建议
- 需配置的告警: [建议告警列表]
- 需创建的仪表板: [监控仪表板建议]
- 定期运行状况检查: [建议频率和范围]
✅ 验证步骤
- 通过日志验证问题是否解决
- 确认性能改进
- 测试应用功能
- 更新监控和告警
- 记录经验教训
📝 预防措施
- [防止类似问题的建议]
- [流程改进]
- [监控增强]
undefined
Error Handling
错误处理
- Resource Not Found: Provide guidance on resource name/location specification
- Authentication Issues: Guide user through Azure authentication setup
- Insufficient Permissions: List required RBAC roles for resource access
- No Logs Available: Suggest enabling diagnostic settings and waiting for data
- Query Timeouts: Break down analysis into smaller time windows
- Service-Specific Issues: Provide generic health assessment with limitations noted
- 资源未找到: 提供资源名称/位置指定的指导
- 身份验证问题: 指导用户完成Azure身份验证设置
- 权限不足: 列出访问资源所需的RBAC角色
- 无可用日志: 建议启用诊断设置并等待数据生成
- 查询超时: 将分析拆分为更小的时间窗口
- 特定服务问题: 提供带有局限性说明的通用运行状况评估
Success Criteria
成功标准
- ✅ Resource health status accurately assessed
- ✅ All significant issues identified and categorized
- ✅ Root cause analysis completed for major problems
- ✅ Actionable remediation plan with specific steps provided
- ✅ Monitoring and prevention recommendations included
- ✅ Clear prioritization of issues by business impact
- ✅ Implementation steps include validation and rollback procedures
- ✅ 准确评估资源运行状况
- ✅ 识别并分类所有重大问题
- ✅ 完成主要问题的根本原因分析
- ✅ 提供包含具体步骤的可执行修复计划
- ✅ 包含监控和预防建议
- ✅ 根据业务影响清晰划分问题优先级
- ✅ 实施步骤包含验证和回滚流程