google-cloud-waf-reliability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Google Cloud Well-Architected Framework skill for the Reliability pillar

Google Cloud Well-Architected Framework 可靠性支柱技能

Overview

概述

The Reliability pillar of the Google Cloud Well-Architected Framework provides principles and recommendations to help you design, deploy, and manage reliable, resilient, and highly available workloads in Google Cloud. A reliable system consistently performs its intended functions under defined conditions, is resilient to failures, and recovers gracefully from disruptions, thereby minimizing downtime, enhancing user experience, and ensuring data integrity.

Google Cloud Well-Architected Framework的可靠性支柱提供原则和建议，帮助您在Google Cloud中设计、部署和管理可靠、具备弹性且高可用的工作负载。可靠的系统能在指定条件下持续执行预期功能，可抵御故障，并能从中断中优雅恢复，从而最大限度减少停机时间、提升用户体验并确保数据完整性。

Core principles

核心原则

The recommendations in the reliability pillar of the Well-Architected Framework are aligned with the following core principles:

Define reliability based on user-experience goals: Measurement of reliability should reflect the actual experience of the system's users rather than merely relying on infrastructure metrics. Focus on outcomes that matter most to users. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/define-reliability-based-on-user-experience-goals
Set realistic targets for reliability: Determine appropriate Service Level Objectives (SLOs) that balance the cost and complexity of maximizing availability against business requirements. Utilize error budgets to manage feature velocity. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/set-targets
Build highly available systems through resource redundancy: Eliminate single points of failure by duplicating critical components across zones and regions to maintain operations during localized outages. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/build-highly-available-systems
Take advantage of horizontal scalability: Design system architectures to scale horizontally (adding more instances) to seamlessly accommodate load fluctuations and improve overall fault tolerance. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/horizontal-scalability
Detect potential failures by using observability: Implement thorough monitoring, logging, and alerting systems to proactively detect, diagnose, and address anomalies before they cause user-facing issues. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/observability
Design for graceful degradation: Architect systems to maintain critical functionality, even if at reduced performance or with limited features, when dependencies fail or the system experiences extreme stress. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/graceful-degradation
Perform testing for recovery from failures: Build confidence in system resilience by continuously simulating failures and verifying the effectiveness of automated and manual recovery procedures. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-failures
Perform testing for recovery from data loss: Regularly test backup and restore protocols to ensure rapid recovery from data corruption or loss, remaining within the defined Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-data-loss
Conduct thorough postmortems: Foster a blameless culture by investigating outages comprehensively to understand root causes, followed by implementing measures that prevent recurrence. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/conduct-postmortems

Well-Architected Framework可靠性支柱中的建议与以下核心原则保持一致：

基于用户体验目标定义可靠性：可靠性的衡量应反映系统用户的实际体验，而非仅依赖基础设施指标。聚焦对用户最重要的结果。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/define-reliability-based-on-user-experience-goals
设定切合实际的可靠性目标：确定适当的服务水平目标（SLOs），在最大化可用性的成本与复杂度和业务需求之间取得平衡。利用错误预算管理功能迭代速度。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/set-targets
通过资源冗余构建高可用系统：通过在可用区和区域间复制关键组件，消除单点故障，以便在局部中断期间维持运营。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/build-highly-available-systems
利用水平扩展性：设计系统架构以实现水平扩展（添加更多实例），从而无缝适应负载波动并提升整体容错能力。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/horizontal-scalability
通过可观测性检测潜在故障：实施全面的监控、日志记录和告警系统，在异常导致用户可见问题之前主动检测、诊断并解决问题。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/observability
设计优雅降级机制：设计系统使其在依赖项故障或承受极端压力时，仍能维持关键功能，即使性能降低或功能受限。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/graceful-degradation
开展故障恢复测试：通过持续模拟故障并验证自动化和手动恢复流程的有效性，增强对系统弹性的信心。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-failures
开展数据丢失恢复测试：定期测试备份和恢复协议，确保从数据损坏或丢失中快速恢复，且符合定义的恢复时间目标（RTO）和恢复点目标（RPO）。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-data-loss
进行全面的事后复盘：通过全面调查中断事件以了解根本原因，随后实施预防复发的措施，培养无责文化。参考文档：https://docs.cloud.google.com/architecture/framework/reliability/conduct-postmortems

Relevant Google Cloud products

Workload assessment questions

工作负载评估问题

Ask appropriate questions to understand the reliability-related requirements and constraints of the workload and the user's organization. Choose questions from the following list:

How does your organization define and measure the reliability of your systems in relation to user experience?
How does your organization approach setting reliability targets for your services?
What is your organization's strategy for ensuring high availability through resource redundancy?
How does your organization leverage horizontal scalability to maintain performance and reliability?
How does your organization utilize observability (metrics, logs, traces) to gain insights and detect potential failures?
How does your organization manage alerting based on observability data to ensure timely responses to significant issues without causing alert fatigue?
What measures does your organization take to ensure systems can gracefully degrade during high load or partial failures?
How frequently and comprehensively does your organization test for recovery from system failures (e.g., regional failovers, release rollbacks)?
What is your organization's approach to testing for recovery from data loss?
How does your organization conduct and utilize postmortems after incidents?

提出恰当的问题，了解工作负载及用户组织的可靠性相关需求和约束。可从以下列表中选择问题：

您的组织如何结合用户体验定义和衡量系统可靠性？
您的组织如何为服务设定可靠性目标？
您的组织通过资源冗余确保高可用性的策略是什么？
您的组织如何利用水平扩展性维持性能和可靠性？
您的组织如何利用可观测性（指标、日志、追踪）获取洞察并检测潜在故障？
您的组织如何基于可观测性数据管理告警，确保及时响应重大问题同时避免告警疲劳？
您的组织采取哪些措施确保系统在高负载或部分故障时能够优雅降级？
您的组织多久进行一次全面的系统故障恢复测试（例如区域故障切换、版本回滚）？
您的组织针对数据丢失恢复测试的方法是什么？
您的组织如何在事件后开展并利用事后复盘？

Validation checklist

验证清单

Use the following checklist to evaluate the architecture's alignment with reliability recommendations:

User-focused SLIs and SLOs are explicitly defined and actively monitored.
The architecture avoids single points of failure through cross-zone or cross-region redundancy.
Autoscaling is enabled to handle variable demand without manual intervention.
Application and infrastructure health checks are configured to trigger automated failovers.
Regular backup schedules are in place, and restoration processes are routinely tested.
The system architecture incorporates patterns like circuit breakers, retries with exponential backoff, and rate limiting to support graceful degradation.
Game days or chaos engineering practices are regularly held to validate failure recovery.
A formalized, blameless postmortem process exists to ensure organizational learning from operational incidents.

使用以下清单评估架构与可靠性建议的契合度：

已明确定义并主动监控以用户为中心的SLI和SLO。
架构通过跨可用区或跨区域冗余避免单点故障。
已启用自动扩缩容，无需手动干预即可应对可变需求。
已配置应用和基础设施健康检查，以触发自动化故障切换。
已制定定期备份计划，并常规测试恢复流程。
系统架构融入了断路器、指数退避重试和速率限制等模式，以支持优雅降级。
定期举办故障演练或混沌工程实践，验证故障恢复能力。
存在正式的无责事后复盘流程，确保从运营事件中获取组织级经验。