azure-reliability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure Reliability Assessment & Configuration

Azure 可靠性评估与配置

Quick Reference

快速参考

PropertyDetails
Best forReliability posture assessment, zone redundancy enablement, multi-region failover setup
Primary capabilitiesReliability assessment table, Zone Redundancy Configuration, Multi-Region IaC Generation
Supported servicesAzure Functions (App Service and Container Apps planned for a future version)
MCP toolsAzure Resource Graph queries, Azure CLI commands
属性详情
适用场景可靠性状态评估、区域冗余启用、多区域故障转移设置
核心功能可靠性评估表格、区域冗余配置、多区域IaC生成
支持服务Azure Functions(App Service和Container Apps计划在未来版本中支持)
MCP工具Azure Resource Graph查询、Azure CLI命令

When to Use This Skill

何时使用此技能

Activate this skill when user wants to:
  • "Assess my Functions app's reliability"
  • "Check the reliability of my resource group" (Functions resources only)
  • "Is my function app zone redundant?"
  • "Make my function app zone redundant"
  • "Set up multi-region failover for my Functions app"
  • "Check my reliability posture"
  • "Find single points of failure" (in Functions workloads)
  • "Enable high availability for my Functions app"
  • "Check disaster recovery readiness"
  • "Improve my Functions app's resilience"
Scope note: This skill currently covers Azure Functions only. If the user asks about Azure App Service or Azure Container Apps reliability, acknowledge that support is planned but not yet available, and only proceed with the parts that apply to Functions resources in scope.
当用户有以下需求时激活此技能:
  • "评估我的Functions应用可靠性"
  • "检查我的资源组可靠性"(仅限Functions资源)
  • "我的函数应用是否支持区域冗余?"
  • "让我的函数应用支持区域冗余"
  • "为我的Functions应用设置多区域故障转移"
  • "检查我的可靠性状态"
  • "查找单点故障"(在Functions工作负载中)
  • "为我的Functions应用启用高可用性"
  • "检查灾难恢复就绪状态"
  • "提升我的Functions应用的弹性"
范围说明: 此技能目前仅覆盖Azure Functions。如果用户询问Azure App Service或Azure Container Apps的可靠性,需告知用户相关支持已列入计划但尚未可用,仅针对范围内的Functions资源进行处理。

Prerequisites

前提条件

  • Authentication: user is logged in to Azure via
    az login
  • Permissions: Reader access on target subscription/resource group (for assessment)
  • Permissions: Contributor access (for configuration changes)
  • Azure Resource Graph extension:
    az extension add --name resource-graph
  • 身份验证:用户已通过
    az login
    登录Azure
  • 权限:目标订阅/资源组的Reader权限(用于评估)
  • 权限:Contributor权限(用于配置变更)
  • Azure Resource Graph扩展:
    az extension add --name resource-graph

MCP Tools

MCP工具

ToolPurpose
mcp_azure_mcp_extension_cli_generate
Generate
az
CLI commands for resource queries and configuration
mcp_azure_mcp_subscription_list
List available subscriptions
mcp_azure_mcp_group_list
List resource groups
Primary query method: Azure Resource Graph via
az graph query
(requires
az extension add --name resource-graph
).
工具用途
mcp_azure_mcp_extension_cli_generate
生成用于资源查询和配置的
az
CLI命令
mcp_azure_mcp_subscription_list
列出可用订阅
mcp_azure_mcp_group_list
列出资源组
主要查询方式:通过
az graph query
使用Azure Resource Graph(需要
az extension add --name resource-graph
)。

Assessment Workflow

评估工作流

Phase 1: Discover Resources

阶段1:发现资源

  1. Identify scope — Ask user for resource group, subscription, or app name
  2. Query Azure Resource Graph to discover all resources in scope
  3. Classify resources by service type (Functions, Storage, etc.). If non-Functions compute (App Service sites that aren't Function Apps, Container Apps) is found, note it but do not deep-dive — those services are planned for a future version of this skill.
Important: Always scope queries to the user's specified resource group or subscription. Add these filters to every Resource Graph query:
  • Resource group:
    | where resourceGroup =~ '<rg-name>'
  • Subscription: Use
    --subscriptions <sub-id>
    flag on
    az graph query
  • App name:
    | where name =~ '<app-name>'
  1. 确定范围 — 询问用户资源组、订阅或应用名称
  2. 查询Azure Resource Graph以发现范围内的所有资源
  3. 按服务类型分类资源(Functions、Storage等)。如果发现非Functions计算资源(非Function Apps的App Service站点、Container Apps),仅做记录但不深入分析 — 这些服务计划在本技能的未来版本中支持。
重要提示: 始终将查询限定在用户指定的资源组或订阅范围内。为每个Resource Graph查询添加以下过滤器:
  • 资源组:
    | where resourceGroup =~ '<rg-name>'
  • 订阅:在
    az graph query
    上使用
    --subscriptions <sub-id>
    参数
  • 应用名称:
    | where name =~ '<app-name>'

Phase 2: Assess Reliability

阶段2:评估可靠性

Two-step assessment: platform-level discovery first, then per-service deep dive.
Step 1 — Platform discovery (find what's there). Use these to enumerate resources in scope and detect cross-cutting reliability gaps:
Platform checkReference
Zone redundancy — discoveryreferences/zone-redundancy-checks.md
Storage redundancy (cross-service)references/storage-redundancy-checks.md
Multi-region & global load balancersreferences/multi-region-checks.md
Front Door / Traffic Manager / App Insights probesreferences/health-probe-checks.md
Step 2 — Per-service deep dive. For each compute resource discovered in Step 1, load the matching service reference. The service reference is the single source of truth for that service's plan/SKU rules, assessment queries, CLI commands, IaC patches (Bicep + Terraform + AVM), and reporting hints.
This skill version ships only the Azure Functions per-service reference. Other compute services are listed below explicitly so the dispatch logic is unambiguous: if a resource matches an unsupported row, do not attempt to load a reference, fabricate CLI commands, or generate IaC patches for it.
Service detectedReference
Azure Functions (
microsoft.web/serverfarms
with
kind contains 'functionapp'
)
references/services/functions/reliability.md
Azure App Service (non-Functions sites:
microsoft.web/sites
without
kind contains 'functionapp'
,
microsoft.web/serverfarms
without
kind contains 'functionapp'
)
⚪ Not yet shipped — planned for a future version
Azure Container Apps (
microsoft.app/containerapps
,
microsoft.app/managedenvironments
)
⚪ Not yet shipped — planned for a future version
Handling unsupported services: If a resource matches an unsupported row above, surface it in the discovery summary, mark it as
⚪ not assessed (planned)
in the Phase 3 table, and skip the per-service remediation steps for it. Do not attempt to fabricate CLI commands or IaC patches for those services.
两步评估:先进行平台级发现,再进行逐服务深度分析。
步骤1 — 平台发现(了解现有资源) 使用以下内容枚举范围内的资源并检测跨服务的可靠性差距:
平台检查项参考文档
区域冗余 — 发现references/zone-redundancy-checks.md
存储冗余(跨服务)references/storage-redundancy-checks.md
多区域与全局负载均衡器references/multi-region-checks.md
Front Door / Traffic Manager / App Insights探测references/health-probe-checks.md
步骤2 — 逐服务深度分析 针对步骤1中发现的每个计算资源,加载对应的服务参考文档。服务参考文档是该服务计划/SKU规则、评估查询、CLI命令、IaC补丁(Bicep + Terraform + AVM)和报告提示的唯一可信来源。
本技能版本仅提供Azure Functions的逐服务参考文档。其他计算服务已明确列出,以便调度逻辑清晰:如果资源匹配不支持的行,请勿尝试加载参考文档、编造CLI命令或生成IaC补丁。
检测到的服务参考文档
Azure Functions(
microsoft.web/serverfarms
kind contains 'functionapp'
references/services/functions/reliability.md
Azure App Service(非Functions站点:
microsoft.web/sites
kind contains 'functionapp'
不成立,
microsoft.web/serverfarms
kind contains 'functionapp'
不成立)
⚪ 尚未发布 — 计划在未来版本中支持
Azure Container Apps(
microsoft.app/containerapps
microsoft.app/managedenvironments
⚪ 尚未发布 — 计划在未来版本中支持
处理不支持的服务: 如果资源匹配上述不支持的行,在发现摘要中列出该资源,在阶段3的表格中标记为
⚪ 未评估(计划中)
,并跳过该资源的逐服务修复步骤。请勿尝试为这些服务编造CLI命令或生成IaC补丁。

Phase 3: Generate Reliability Checklist

阶段3:生成可靠性检查清单

Present findings as a feature-pivoted table: one row per reliability feature (Zone redundancy on compute, Zone-redundant storage, Health probes, Multi-region failover), with a single status indicator and the specific resources that are relevant to that feature. This avoids the noise of one-row-per-resource with mostly
n/a
cells. Do not assign numeric scores or grades.
🔍 Reliability Assessment — {scope}
─────────────────────────────────────────────────────────────────────────────────────────────
Reliability Feature              Status      Resources
─────────────────────────────────────────────────────────────────────────────────────────────
Zone redundancy — compute        🔴 OFF      • plan-ii5trxva2ark4 (FC1)

Zone-redundant storage           🔴 GRS      • stii5trxva2ark4 (defaulted; no SKU set in IaC)

Health probes                    🔴 OFF      • func-api-ii5trxva2ark4 — needs code change (FC1)

Multi-region failover            🔴 OFF      • Single region (eastus) only — Front Door not configured
─────────────────────────────────────────────────────────────────────────────────────────────

Want me to fix the 🔴 items? I'll do the quick wins first (Function App
plan zone redundancy + health checks on supported plans), then ask before
storage migration and multi-region setup. (yes/no)
Rules for the table:
  • Four feature rows, in this order: Zone redundancy — compute · Zone-redundant storage · Health probes · Multi-region failover. Omit a row entirely only if no resource in scope could ever apply to it.
  • Status column is one symbol + one short word, no other characters:
    • 🟢 ON
      — feature is fully enabled across all relevant resources in scope
    • 🟡 PARTIAL
      — some resources have it, some don't (or partial config like liveness-only)
    • 🔴 OFF
      — feature is missing on all relevant resources
    • For storage, replace
      OFF
      with the current SKU when relevant (
      🔴 LRS
      ,
      🔴 GRS
      ,
      🟢 ZRS
      ,
      🟢 GZRS
      ). When no SKU is set in IaC, label as
      🔴 GRS
      (ARM/AVM default) and note that in the resource line.
  • Resources column lists only what's relevant to that feature, one bullet per resource:
    • For "needs fixing" resources, include a short inline reason (
      (FC1)
      ,
      (defaulted; no SKU set)
      ,
      liveness only
      ,
      needs code change (FC1)
      ).
    • For resources that are already ON for that feature, mention them on the same row with
      — already ON
      so the user sees credit for what's right.
  • Do not include
    n/a
    ,
    , or empty cells. If a feature doesn't apply to any resource in scope, drop the row.
  • Do not include numeric scores, grades, or point totals.
  • End the assessment with a single yes/no question that kicks off the staged remediation flow. Do not enumerate the per-resource fix list here — the user will see it after they say yes (Configuration Workflow Step 1).
UX Note: If the assessment finds the app already has all core reliability features (zone redundancy, ZRS/GZRS storage, health probes), skip the fix-it question and jump straight to Configuration Workflow Step 3 (Multi-region follow-up). Do NOT start any multi-region work without explicit consent.
功能为核心的表格呈现发现结果:每行对应一个可靠性功能(计算区域冗余、区域冗余存储、健康探测、多区域故障转移),包含单个状态指示器和与该功能相关的具体资源。这样可以避免每行对应一个资源且多数单元格为
n/a
的冗余情况。请勿分配数字分数或等级。
🔍 可靠性评估 — {scope}
─────────────────────────────────────────────────────────────────────────────────────────────
可靠性功能              状态      资源
─────────────────────────────────────────────────────────────────────────────────────────────
区域冗余 — 计算        🔴 未启用      • plan-ii5trxva2ark4 (FC1)

区域冗余存储           🔴 GRS      • stii5trxva2ark4(默认设置;IaC中未指定SKU)

健康探测                    🔴 未启用      • func-api-ii5trxva2ark4 — 需要代码变更 (FC1)

多区域故障转移            🔴 未启用      • 仅单区域(eastus)— 未配置Front Door
─────────────────────────────────────────────────────────────────────────────────────────────

是否需要修复上述🔴项?我将先处理快速修复项(Function App计划区域冗余 + 支持计划的健康检查),然后在进行存储迁移和多区域设置前征得您的同意。(是/否)
表格规则:
  • 固定四行功能,按以下顺序排列: 区域冗余 — 计算 · 区域冗余存储 · 健康探测 · 多区域故障转移。仅当范围内没有任何资源适用该功能时,才完全省略该行。
  • 状态列为一个符号 + 一个简短词汇,无其他字符:
    • 🟢 已启用
      — 范围内所有相关资源均已完全启用该功能
    • 🟡 部分启用
      — 部分资源已启用,部分未启用(或仅配置了部分功能,如仅存活探测)
    • 🔴 未启用
      — 所有相关资源均未启用该功能
    • 对于存储,相关时用当前SKU替换
      未启用
      🔴 LRS
      🔴 GRS
      🟢 ZRS
      🟢 GZRS
      )。如果IaC中未设置SKU,标记为
      🔴 GRS
      (ARM/AVM默认值)并在资源行中注明。
  • 资源列仅列出与该功能相关的资源,每个资源对应一个项目符号:
    • 对于"需要修复"的资源,包含简短的内联原因(
      (FC1)
      (默认设置;未指定SKU)
      仅存活探测
      需要代码变更 (FC1)
      )。
    • 对于该功能已启用的资源,在同一行中注明
      — 已启用
      ,让用户了解已完成的正确配置。
  • 请勿包含
    n/a
    或空单元格。如果某个功能不适用于范围内的任何资源,则删除该行。
  • 请勿包含数字分数、等级或总分。
  • 评估结束时提出一个是/否问题,启动分阶段修复流程。此处请勿列出逐资源修复列表 — 用户同意后将在配置工作流步骤1中看到。
用户体验提示: 如果评估发现应用已具备所有核心可靠性功能(区域冗余、ZRS/GZRS存储、健康探测),跳过修复问题直接进入配置工作流步骤3(多区域跟进)。未经明确同意,请勿启动任何多区域工作。

Configuration Workflow

配置工作流

When user wants to fix findings from the assessment:
⛔ ALWAYS confirm with user before executing changes. Show what will change, any cost implications, and any destructive actions (e.g., environment recreation).
当用户希望修复评估中发现的问题时:
⛔ 执行变更前务必征得用户确认。 展示将进行的变更、任何成本影响以及任何破坏性操作(例如环境重建)。

Step 1: Present Fix Plan + Choose Path

步骤1:呈现修复计划 + 选择路径

After assessment, if user says "fix it" / "improve my reliability" / "enable zone redundancy":
  1. List each fixable finding with the specific action
  2. Flag any cost implications or breaking changes
  3. Ask user which path they want:
I'll start with the quick wins (no downtime, fast):

1. ✏️  Enable zone redundancy on plan-ii5trxva2ark4 (Flex Consumption — no cost change)
2. ✏️  Set health check path to /api/health on func-api-ii5trxva2ark4

Then, separately, I'll ask if you want to upgrade storage:

3. 🕒  Upgrade stii5trxva2ark4 from LRS → ZRS (small cost increase, migration takes hours)
   — Required for full zone redundancy, but I'll confirm with you before starting.

How would you like to apply these changes?

  A) Fix now — Run az CLI commands against your live resources (immediate, one-time)
  B) Patch my IaC — Update your Bicep/Terraform files so changes persist across deploys

(If you use azd or Terraform, option B is recommended so `azd up` won't overwrite changes.)
评估完成后,如果用户说"修复它" / "提升我的可靠性" / "启用区域冗余":
  1. 列出每个可修复问题及具体操作
  2. 标记任何成本影响或破坏性变更
  3. 询问用户希望选择哪种路径:
我将先处理快速修复项(无停机,速度快):

1. ✏️  在plan-ii5trxva2ark4上启用区域冗余(Flex Consumption — 无成本变化)
2. ✏️  为func-api-ii5trxva2ark4设置健康检查路径为/api/health

然后,我会单独询问您是否要升级存储:

3. 🕒  将stii5trxva2ark4从LRS升级为ZRS(成本小幅增加,迁移需数小时)
   — 实现完全区域冗余的必要步骤,但我会在开始前征得您的确认。

您希望如何应用这些变更?

  A) 立即修复 — 针对您的实时资源运行az CLI命令(即时生效,一次性操作)
  B) 修补我的IaC — 更新您的Bicep/Terraform文件,使变更在部署中保持持久

(如果您使用azd或Terraform,建议选择选项B,这样`azd up`不会覆盖变更。)

Path A: Fix Now (CLI)

路径A:立即修复(CLI)

Run fixes against live resources using
az
CLI commands. Quick wins first, then ask before the slow storage migration.
The exact CLI commands per service live in the per-service references — pick the one(s) matching the resources discovered in Phase 2:
FixReference
Enable zone redundancy / configure health probes (Functions)references/services/functions/reliability.md
Upgrade storage replication (cross-service)references/configure-storage.md
Set up multi-region (cross-service)references/configure-multi-region.md
Platform overview / verificationreferences/configure-zone-redundancy.md, references/configure-health-probes.md
Execution order — always quick wins first:
  1. Zone redundancy on compute (fast, in-place property update on the Function App's plan).
  2. Health probes (Premium / Dedicated only — in-place; for FC1 / Consumption, follow the consent gate in configure-health-probes.md).
  3. Verify the compute changes succeeded before doing anything else.
  4. ⛔ STOP — Ask about storage upgrade. Compute is now zone-redundant, but storage may still be LRS or GRS. Ask the user explicitly:
    ✅ Compute is now zone-redundant.
    
    To be **fully zone-redundant**, your storage account also needs to be upgraded:
      • stii5trxva2ark4: currently `Standard_LRS` → needs `Standard_ZRS`
    
    ⚠️  This is a live storage redundancy conversion:
       • Takes hours to days depending on data volume
       • Small ongoing cost increase (~$0.01/GB/month more)
       • Only supported for Standard general-purpose v2 accounts
    
    Do you want me to start the storage migration now? (yes / no / later)
    • yes → run
      az storage account update --sku Standard_ZRS
      (or
      migration start
      if needed); poll
      az storage account show --query sku.name
      until it reports
      Standard_ZRS
      .
    • no / later → leave storage as-is; note in the re-assessment that ZR storage remains a gap.
  5. Multi-region — do NOT auto-run. Handled in Step 3 below as an explicit follow-up after re-assessment.
⚠️ Warning: If the user uses
azd up
or
terraform apply
later, CLI-only changes may be overwritten by the IaC definitions. Recommend also patching IaC after CLI fixes.
使用
az
CLI命令针对实时资源运行修复。先处理快速修复项,然后在进行缓慢的存储迁移前征得用户同意。
针对每个服务的确切CLI命令位于逐服务参考文档中 — 选择与阶段2中发现的资源匹配的命令:
修复操作参考文档
启用区域冗余 / 配置健康探测(Functions)references/services/functions/reliability.md
升级存储复制(跨服务)references/configure-storage.md
设置多区域(跨服务)references/configure-multi-region.md
平台概览 / 验证references/configure-zone-redundancy.mdreferences/configure-health-probes.md
执行顺序 — 始终先处理快速修复项:
  1. 计算区域冗余(快速,在Function App计划上进行原地属性更新)。
  2. 健康探测(仅Premium / Dedicated计划支持 — 原地更新;对于FC1 / Consumption计划,请遵循configure-health-probes.md中的同意流程)。
  3. 验证计算变更成功后再进行其他操作。
  4. ⛔ 停止 — 询问是否升级存储。 计算资源现已支持区域冗余,但存储可能仍为LRS或GRS。明确询问用户:
    ✅ 计算资源现已支持区域冗余。
    
    要实现**完全区域冗余**,您的存储账户也需要升级:
      • stii5trxva2ark4:当前为`Standard_LRS` → 需要升级为`Standard_ZRS`
    
    ⚠️ 这是实时存储冗余转换:
       • 根据数据量,可能需要数小时到数天
       • 持续成本小幅增加(约$0.01/GB/月)
       • 仅支持标准通用v2账户
    
    是否要立即启动存储迁移?(是 / 否 / 稍后)
    • → 运行
      az storage account update --sku Standard_ZRS
      (必要时运行
      migration start
      );轮询
      az storage account show --query sku.name
      直到返回
      Standard_ZRS
    • 否 / 稍后 → 保持存储不变;在重新评估中注明ZR存储仍存在差距。
  5. 多区域 — 请勿自动运行。在重新评估后的步骤3中作为明确的跟进事项处理。
⚠️ 警告: 如果用户稍后使用
azd up
terraform apply
,仅通过CLI进行的变更可能会被IaC定义覆盖。建议在CLI修复后同时修补IaC。

Path B: Patch IaC

路径B:修补IaC

Update the user's Bicep or Terraform files so reliability settings are persistent.
Step 1: Detect IaC type
  1. Look for
    infra/
    folder in project root
  2. If not found, check project root for
    *.bicep
    or
    *.tf
    files
  3. If still not found, ask user: "Where are your IaC files located?"
  4. Check for
    *.bicep
    files → use Bicep patching
  5. Check for
    *.tf
    files → use Terraform patching
  6. If both exist, ask user which to patch
  7. If no IaC exists, fall back to Path A (CLI) and inform user
Step 2: Classify each fix by risk level
FixRisk LevelWhat Happens
Zone redundancy (Function App plan)🟢 Safe patchIn-place property update on next deploy
Storage LRS → ZRS🟡 Pre-migration requiredLive storage migration must complete before the IaC SKU change can deploy. Never bundle with safe patches — use the two-deploy flow in Steps 3–5.
Health check path (Premium / Dedicated)🟢 Safe patchIn-place update, but causes app restart
Health check path (FC1 / Consumption)⚪ Code-only — ask first
healthCheckPath
is unsupported. Adding a health endpoint requires adding an HTTP-triggered
/api/health
function to app code. Always ask the user for explicit consent before touching source code. Do not patch IaC.
Step 3: Apply patches in two deploys (quick wins first)
The IaC patching framework (detection, AVM-module guidance, deploy-order rule, storage SKU patch) lives in:
IaC TypeFramework reference
Bicepreferences/iac-patching-bicep.md
Terraformreferences/iac-patching-terraform.md
The actual per-service compute patches (Function App plan ZR, etc.) live in the per-service references — load the matching service file from Phase 2 for the exact Bicep / Terraform / AVM snippets. Only Azure Functions has a per-service reference in this skill version; non-Functions compute (App Service / Container Apps) is out of scope.
Deploy 1 — Quick wins only. Patch the 🟢 Safe items (zone redundancy on the Function App plan, health probes on Premium / Dedicated). Do NOT include the storage SKU patch in this deploy.
After patching, the skill runs the deploy itself (do not stop and tell the user to run it). Detect the deployment tool and confirm once before executing:
📦 Patches applied to your IaC. Ready to deploy:
   Tool detected: azd (found azure.yaml)
   Command:       azd up

Proceed with deployment? (yes / no)
On yes, run the appropriate command, stream output back to the user, and continue to the next step on success:
  • AZD project (has
    azure.yaml
    ):
    azd up
  • Bicep-only:
    az deployment group create --resource-group <rg> --template-file infra/main.bicep --parameters @infra/main.parameters.json
  • Terraform:
    terraform plan -out tfplan
    → (show plan summary) →
    terraform apply tfplan
On no, stop and report the patched files; do not proceed to Step 4 / Re-Assess.
If deployment fails, surface the error and stop — do not continue to the storage step.
⛔ STOP — Ask about storage upgrade before Deploy 2. After Deploy 1 succeeds, ask the user explicitly:
✅ Quick-win patches deployed. Compute is now zone-redundant.

To be **fully zone-redundant**, your storage account also needs to be upgraded:
  • stii5trxva2ark4: currently `Standard_LRS` → needs `Standard_ZRS`

⚠️  This is a two-part change:
   1. Live storage migration (`az storage account migration start`) — takes hours to days
   2. A second deploy to update your IaC's storage SKU to match

Do you want me to start the storage migration now? (yes / no / later)
  • yes → the skill runs the migration command itself, polls until complete, then patches the storage SKU in IaC and runs Deploy 2 (now a no-op confirmation). The user does not need to run anything manually.
  • no / later → leave the storage SKU patch unapplied. Note in the re-assessment that ZR storage remains a gap; suggest revisiting later.
Step 4: Storage migration (only if user said yes in Step 3)
The skill runs these commands itself — do not ask the user to run them. Show progress as you go:
🔄 Starting storage migration (this can take up to 72 hours)...

   az storage account migration start --name stii5trxva2ark4 \
     --resource-group rg-example --sku Standard_ZRS --no-wait

   Polling: az storage account show --name stii5trxva2ark4 --query sku.name
   ...
   ✅ Migration complete: sku.name = Standard_ZRS
For very long migrations, you may surface a checkpoint to the user ("this is still running, check back later") rather than blocking the entire conversation.
Step 5: Deploy 2 — storage SKU patch
After the migration completes, the skill patches the storage SKU in IaC and runs the same deploy command as Step 3 (e.g.
azd up
). This deploy is a no-op confirmation that the IaC matches the live state. Confirm once with the user before executing, then run it directly.
更新用户的Bicep或Terraform文件,使可靠性设置保持持久。
步骤1:检测IaC类型
  1. 在项目根目录中查找
    infra/
    文件夹
  2. 如果未找到,检查项目根目录中的
    *.bicep
    *.tf
    文件
  3. 如果仍未找到,询问用户:"您的IaC文件位于何处?"
  4. 检查
    *.bicep
    文件 → 使用Bicep修补
  5. 检查
    *.tf
    文件 → 使用Terraform修补
  6. 如果两者都存在,询问用户要修补哪一个
  7. 如果没有IaC,回退到路径A(CLI)并告知用户
步骤2:按风险级别分类每个修复操作
修复操作风险级别说明
区域冗余(Function App计划)🟢 安全修补下次部署时进行原地属性更新
存储LRS → ZRS🟡 需要预迁移实时存储迁移完成后才能部署IaC SKU变更。请勿与安全修补打包 — 使用步骤3–5中的两阶段部署流程。
健康检查路径(Premium / Dedicated)🟢 安全修补原地更新,但会导致应用重启
健康检查路径(FC1 / Consumption)⚪ 仅代码变更 — 需先询问
healthCheckPath
不被支持。添加健康端点需要在应用代码中添加HTTP触发的
/api/health
函数。修改源代码前务必征得用户明确同意。请勿修补IaC。
步骤3:分两阶段部署修补(先处理快速修复项)
IaC修补框架(检测、AVM模块指南、部署顺序规则、存储SKU修补)位于:
IaC类型框架参考文档
Bicepreferences/iac-patching-bicep.md
Terraformreferences/iac-patching-terraform.md
实际的逐服务计算修补(Function App计划ZR等)位于逐服务参考文档中 — 加载阶段2中匹配的服务文件以获取确切的Bicep / Terraform / AVM代码片段。本技能版本仅Azure Functions有逐服务参考文档;非Functions计算(App Service / Container Apps)超出范围。
部署1 — 仅快速修复项 修补🟢安全项(Function App计划区域冗余、Premium / Dedicated计划的健康探测)。请勿在本次部署中包含存储SKU修补。
修补完成后,技能将自行运行部署(请勿停止并让用户自行运行)。检测部署工具并在执行前确认:
📦 已对您的IaC应用修补。准备部署:
   检测到的工具:azd(找到azure.yaml)
   命令:       azd up

是否继续部署?(是 / 否)
选择后,运行相应的命令,向用户流式输出结果,成功后继续下一步:
  • AZD项目(包含
    azure.yaml
    ):
    azd up
  • 仅Bicep:
    az deployment group create --resource-group <rg> --template-file infra/main.bicep --parameters @infra/main.parameters.json
  • Terraform:
    terraform plan -out tfplan
    →(显示计划摘要)→
    terraform apply tfplan
选择后,停止并报告已修补的文件;请勿继续步骤4 / 重新评估。
如果部署失败,显示错误并停止 — 请勿继续存储步骤。
⛔ 停止 — 部署2前询问是否升级存储 部署1成功后,明确询问用户:
✅ 快速修复项已部署。计算资源现已支持区域冗余。

要实现**完全区域冗余**,您的存储账户也需要升级:
  • stii5trxva2ark4:当前为`Standard_LRS` → 需要升级为`Standard_ZRS`

⚠️ 这是分两步的变更:
   1. 实时存储迁移(`az storage account migration start`)— 需要数小时到数天
   2. 第二次部署以更新IaC中的存储SKU以匹配实时状态

是否要立即启动存储迁移?(是 / 否 / 稍后)
  • → 技能自行运行迁移命令,轮询直到完成,然后修补IaC中的存储SKU并运行部署2(现在是无操作确认)。用户无需手动运行任何命令。
  • 否 / 稍后 → 不应用存储SKU修补。在重新评估中注明ZR存储仍存在差距;建议稍后再处理。
步骤4:存储迁移(仅当用户在步骤3中选择是时)
技能将自行运行以下命令 — 请勿让用户运行。随时向用户展示进度:
🔄 启动存储迁移(可能需要长达72小时)...

   az storage account migration start --name stii5trxva2ark4 \
     --resource-group rg-example --sku Standard_ZRS --no-wait

   轮询:az storage account show --name stii5trxva2ark4 --query sku.name
   ...
   ✅ 迁移完成:sku.name = Standard_ZRS
对于非常长的迁移,您可以向用户展示检查点("迁移仍在进行中,请稍后再查看"),而不是阻塞整个对话。
步骤5:部署2 — 存储SKU修补
迁移完成后,技能修补IaC中的存储SKU并运行与步骤3相同的部署命令(例如
azd up
)。此部署是确认IaC与实时状态匹配的无操作步骤。执行前先征得用户确认,然后直接运行。

Step 2 (both paths): Re-Assess

步骤2(两条路径通用):重新评估

After changes are applied (CLI) or deployed (IaC), automatically re-run the assessment and show the same feature-pivoted table as Phase 3, with each feature row's status updated to reflect the new state. Briefly call out what changed since the previous run.
🔄 Reliability Re-Assessment — rg-eventhubs-python-jan13 (eastus)
───────────────────────────────────────────────────────────────────────────────────────
Reliability Feature              Status      Resources
───────────────────────────────────────────────────────────────────────────────────────
Zone redundancy — compute        🟢 ON       • plan-ii5trxva2ark4 (FC1)              — now ON

Zone-redundant storage           🟢 ZRS      • stii5trxva2ark4                       — GRS → ZRS

Health probes                    🔴 OFF      • func-api-ii5trxva2ark4                — still off (FC1, code change declined)

Multi-region failover            🔴 OFF      • Single region (eastus) only
───────────────────────────────────────────────────────────────────────────────────────

What changed: Function App plan zone redundancy and storage replication.
(Multi-region offered next — see Step 3.)
变更应用后(CLI)或部署完成后(IaC),自动重新运行评估并显示与阶段3相同的功能核心表格,更新每行功能的状态以反映新状态。简要说明自上次评估以来的变更。
🔄 可靠性重新评估 — rg-eventhubs-python-jan13 (eastus)
───────────────────────────────────────────────────────────────────────────────────────
可靠性功能              状态      资源
───────────────────────────────────────────────────────────────────────────────────────
区域冗余 — 计算        🟢 已启用       • plan-ii5trxva2ark4 (FC1)              — 现已启用

区域冗余存储           🟢 ZRS      • stii5trxva2ark4                       — GRS → ZRS

健康探测                    🔴 未启用      • func-api-ii5trxva2ark4                — 仍未启用(FC1,已拒绝代码变更)

多区域故障转移            🔴 未启用      • 仅单区域(eastus)
───────────────────────────────────────────────────────────────────────────────────────

变更内容:Function App计划区域冗余和存储复制。
(下一步将提供多区域设置 — 请参阅步骤3。)

Step 3 (both paths): Multi-region follow-up — ASK and WAIT

步骤3(两条路径通用):多区域跟进 — 询问并等待

Multi-region is a significant cost/complexity step. Do NOT start it automatically. After re-assessment, only if all core single-region reliability features are 🟢 ON (zone-redundant compute, ZRS/GZRS storage, health probes), explicitly ask the user and wait for their response before doing anything:
🟢 Your app is now fully zone-redundant in {region}.

The next step (optional) is multi-region failover with Azure Front Door:
   • Deploys compute + storage in a second region (paired region recommended)
   • Adds Azure Front Door for global load balancing with health-probe-driven failover
   • Protects against full region outages
   • Estimated additional cost: ~2x compute (active-passive); Front Door ~$35/month base

Do you want me to set up multi-region failover now? (yes / no / later)
  • yes → proceed with references/configure-multi-region.md. Confirm secondary region choice with the user, then:
    1. Generate the multi-region IaC (Bicep / Terraform additions for the secondary region + Front Door).
    2. Confirm once with the user:
      📦 Multi-region IaC generated. Ready to deploy with \
      azd up`. Proceed? (yes / no)`
    3. On yes, the skill runs the deploy itself (
      azd up
      /
      az deployment group create
      /
      terraform apply
      ) and streams output. Do not stop and tell the user to run it.
    4. After successful deploy, run a final re-assessment so the user sees Multi-region failover flip to 🟢 ON.
  • no / later → leave the deployment as-is. Note that single-region zone-redundant is a reliable end state; multi-region can be revisited anytime.
⛔ Do not skip the wait. Do not generate multi-region IaC, deploy a Front Door, or modify any files until the user has explicitly said yes. If core reliability is not yet all 🟢, do not ask about multi-region — finish the core gaps first.
多区域设置是成本/复杂度较高的步骤。请勿自动启动。重新评估后,仅当所有核心单区域可靠性功能均为🟢已启用(区域冗余计算、ZRS/GZRS存储、健康探测)时,明确询问用户并等待用户响应后再进行任何操作:
🟢 您的应用现已在{region}区域完全支持区域冗余。

下一步(可选)是使用Azure Front Door设置多区域故障转移:
   • 在第二个区域(建议使用配对区域)部署计算 + 存储
   • 添加Azure Front Door进行全局负载均衡,基于健康探测实现故障转移
   • 防范整个区域中断
   • 预估额外成本:约2倍计算资源(主备模式);Front Door基础费用约$35/月

是否要立即设置多区域故障转移?(是 / 否 / 稍后)
  • → 按照references/configure-multi-region.md进行操作。先征得用户对次要区域选择的确认,然后:
    1. 生成多区域IaC(次要区域 + Front Door的Bicep / Terraform新增内容)。
    2. 征得用户确认:
      📦 已生成多区域IaC。准备使用\
      azd up`部署。是否继续?(是 / 否)`
    3. 选择后,技能自行运行部署
      azd up
      /
      az deployment group create
      /
      terraform apply
      )并流式输出结果。请勿停止并让用户自行运行。
    4. 部署成功后,运行最终重新评估,让用户看到多区域故障转移状态变为🟢已启用。
  • 否 / 稍后 → 保持当前部署状态。说明单区域区域冗余已是可靠的最终状态;多区域设置可随时再处理。
⛔ 请勿跳过等待步骤。 在用户明确同意前,请勿生成多区域IaC、部署Front Door或修改任何文件。如果核心可靠性尚未全部达到🟢已启用,请勿询问多区域设置 — 先完成核心差距的修复。

Priority Classification

优先级分类

PriorityCriteriaAction
CriticalNo zone redundancy AND production workloadFix immediately
HighLRS storage on zone-redundant computeFix within days
MediumNo multi-region (single region but zone-redundant)Plan for next sprint
LowMissing health probes or monitoring gapsTrack and fix
优先级标准操作
关键无区域冗余且为生产工作负载立即修复
区域冗余计算使用LRS存储数天内修复
无多区域(单区域但支持区域冗余)列入下一个迭代计划
缺少健康探测或监控差距跟踪并修复

Error Handling

错误处理

ErrorMessageRemediation
Authentication required"Please login"Run
az login
and retry
Access denied"Forbidden"Confirm Reader/Contributor role assignment
Plan doesn't support ZR"Upgrade required"Inform user of plan upgrade path + cost delta
Region doesn't support AZ"Region limitation"Suggest supported regions
错误提示信息修复措施
需要身份验证"请登录"运行
az login
并重试
访问被拒绝"禁止访问"确认Reader/Contributor角色分配
计划不支持ZR"需要升级"告知用户计划升级路径 + 成本差异
区域不支持AZ"区域限制"建议支持的区域

Best Practices

最佳实践

  • Run reliability assessments after every significant infrastructure change
  • Test failover scenarios periodically (at least quarterly)
  • 每次重大基础设施变更后运行可靠性评估
  • 定期测试故障转移场景(至少每季度一次)

Skill Boundaries

技能边界

ActionThis skill doesHand off to
Assess reliability posture✅ Yes
Recommend improvements✅ Yes
Enable zone redundancy (CLI commands)✅ Yes
Patch Bicep/Terraform for reliability✅ Yes
Generate multi-region IaC✅ Yes (additions for the secondary region + Front Door)
azure-prepare
for full new-app IaC scaffolding
Deploy IaC for reliability changes✅ Yes (runs
azd up
/
terraform apply
/
az deployment
itself, after user confirmation)
azure-deploy
for general/non-reliability deploys
Validate pre-deploymentReliability checks only
azure-validate
for full validation
操作本技能支持转交至
评估可靠性状态✅ 是
建议改进措施✅ 是
启用区域冗余(CLI命令)✅ 是
修补Bicep/Terraform以提升可靠性✅ 是
生成多区域IaC✅ 是(次要区域 + Front Door的新增内容)
azure-prepare
用于完整的新应用IaC搭建
部署可靠性变更的IaC✅ 是(用户确认后自行运行
azd up
/
terraform apply
/
az deployment
azure-deploy
用于通用/非可靠性部署
部署前验证仅可靠性检查
azure-validate
用于完整验证