google-cloud-waf-reliability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGoogle Cloud Well-Architected Framework skill for the Reliability pillar
Google Cloud Well-Architected Framework 可靠性支柱技能
Overview
概述
The Reliability pillar of the Google Cloud Well-Architected Framework provides
principles and recommendations to help you design, deploy, and manage reliable,
resilient, and highly available workloads in Google Cloud. A reliable system
consistently performs its intended functions under defined conditions, is
resilient to failures, and recovers gracefully from disruptions, thereby
minimizing downtime, enhancing user experience, and ensuring data integrity.
Google Cloud Well-Architected Framework的可靠性支柱提供原则和建议,帮助您在Google Cloud中设计、部署和管理可靠、具备弹性且高可用的工作负载。可靠的系统能在指定条件下持续执行预期功能,可抵御故障,并能从中断中优雅恢复,从而最大限度减少停机时间、提升用户体验并确保数据完整性。
Core principles
核心原则
The recommendations in the reliability pillar of the Well-Architected Framework
are aligned with the following core principles:
-
Define reliability based on user-experience goals: Measurement of reliability should reflect the actual experience of the system's users rather than merely relying on infrastructure metrics. Focus on outcomes that matter most to users. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/define-reliability-based-on-user-experience-goals
-
Set realistic targets for reliability: Determine appropriate Service Level Objectives (SLOs) that balance the cost and complexity of maximizing availability against business requirements. Utilize error budgets to manage feature velocity. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/set-targets
-
Build highly available systems through resource redundancy: Eliminate single points of failure by duplicating critical components across zones and regions to maintain operations during localized outages. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/build-highly-available-systems
-
Take advantage of horizontal scalability: Design system architectures to scale horizontally (adding more instances) to seamlessly accommodate load fluctuations and improve overall fault tolerance. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/horizontal-scalability
-
Detect potential failures by using observability: Implement thorough monitoring, logging, and alerting systems to proactively detect, diagnose, and address anomalies before they cause user-facing issues. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/observability
-
Design for graceful degradation: Architect systems to maintain critical functionality, even if at reduced performance or with limited features, when dependencies fail or the system experiences extreme stress. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/graceful-degradation
-
Perform testing for recovery from failures: Build confidence in system resilience by continuously simulating failures and verifying the effectiveness of automated and manual recovery procedures. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-failures
-
Perform testing for recovery from data loss: Regularly test backup and restore protocols to ensure rapid recovery from data corruption or loss, remaining within the defined Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-data-loss
-
Conduct thorough postmortems: Foster a blameless culture by investigating outages comprehensively to understand root causes, followed by implementing measures that prevent recurrence. Grounding document: https://docs.cloud.google.com/architecture/framework/reliability/conduct-postmortems
Well-Architected Framework可靠性支柱中的建议与以下核心原则保持一致:
-
基于用户体验目标定义可靠性:可靠性的衡量应反映系统用户的实际体验,而非仅依赖基础设施指标。聚焦对用户最重要的结果。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/define-reliability-based-on-user-experience-goals
-
设定切合实际的可靠性目标:确定适当的服务水平目标(SLOs),在最大化可用性的成本与复杂度和业务需求之间取得平衡。利用错误预算管理功能迭代速度。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/set-targets
-
通过资源冗余构建高可用系统:通过在可用区和区域间复制关键组件,消除单点故障,以便在局部中断期间维持运营。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/build-highly-available-systems
-
利用水平扩展性:设计系统架构以实现水平扩展(添加更多实例),从而无缝适应负载波动并提升整体容错能力。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/horizontal-scalability
-
通过可观测性检测潜在故障:实施全面的监控、日志记录和告警系统,在异常导致用户可见问题之前主动检测、诊断并解决问题。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/observability
-
设计优雅降级机制:设计系统使其在依赖项故障或承受极端压力时,仍能维持关键功能,即使性能降低或功能受限。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/graceful-degradation
-
开展故障恢复测试:通过持续模拟故障并验证自动化和手动恢复流程的有效性,增强对系统弹性的信心。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-failures
-
开展数据丢失恢复测试:定期测试备份和恢复协议,确保从数据损坏或丢失中快速恢复,且符合定义的恢复时间目标(RTO)和恢复点目标(RPO)。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/perform-testing-for-recovery-from-data-loss
-
进行全面的事后复盘:通过全面调查中断事件以了解根本原因,随后实施预防复发的措施,培养无责文化。参考文档:https://docs.cloud.google.com/architecture/framework/reliability/conduct-postmortems
Relevant Google Cloud products
相关Google Cloud产品
The following are examples of Google Cloud products and features that are
relevant to reliability:
- Compute: Compute Engine Managed Instance Groups (MIGs), Google Kubernetes Engine (GKE), Cloud Run
- Networking: Cloud Load Balancing, Cloud CDN, Cloud DNS
- Storage and databases: Cloud Storage (multi-region), Cloud SQL High Availability, Spanner, Filestore, Firestore
- Operations: Cloud Monitoring, Cloud Logging, Google Cloud Managed Service for Prometheus
- Disaster recovery: Backup and DR Service, Filestore backups
以下是与可靠性相关的Google Cloud产品和功能示例:
- 计算:Compute Engine托管实例组(MIGs)、Google Kubernetes Engine(GKE)、Cloud Run
- 网络:Cloud Load Balancing、Cloud CDN、Cloud DNS
- 存储与数据库:Cloud Storage(多区域)、Cloud SQL高可用、Spanner、Filestore、Firestore
- 运维:Cloud Monitoring、Cloud Logging、Google Cloud Managed Service for Prometheus
- 灾难恢复:Backup and DR Service、Filestore备份
Workload assessment questions
工作负载评估问题
Ask appropriate questions to understand the reliability-related requirements and
constraints of the workload and the user's organization. Choose questions from
the following list:
- How does your organization define and measure the reliability of your systems in relation to user experience?
- How does your organization approach setting reliability targets for your services?
- What is your organization's strategy for ensuring high availability through resource redundancy?
- How does your organization leverage horizontal scalability to maintain performance and reliability?
- How does your organization utilize observability (metrics, logs, traces) to gain insights and detect potential failures?
- How does your organization manage alerting based on observability data to ensure timely responses to significant issues without causing alert fatigue?
- What measures does your organization take to ensure systems can gracefully degrade during high load or partial failures?
- How frequently and comprehensively does your organization test for recovery from system failures (e.g., regional failovers, release rollbacks)?
- What is your organization's approach to testing for recovery from data loss?
- How does your organization conduct and utilize postmortems after incidents?
提出恰当的问题,了解工作负载及用户组织的可靠性相关需求和约束。可从以下列表中选择问题:
- 您的组织如何结合用户体验定义和衡量系统可靠性?
- 您的组织如何为服务设定可靠性目标?
- 您的组织通过资源冗余确保高可用性的策略是什么?
- 您的组织如何利用水平扩展性维持性能和可靠性?
- 您的组织如何利用可观测性(指标、日志、追踪)获取洞察并检测潜在故障?
- 您的组织如何基于可观测性数据管理告警,确保及时响应重大问题同时避免告警疲劳?
- 您的组织采取哪些措施确保系统在高负载或部分故障时能够优雅降级?
- 您的组织多久进行一次全面的系统故障恢复测试(例如区域故障切换、版本回滚)?
- 您的组织针对数据丢失恢复测试的方法是什么?
- 您的组织如何在事件后开展并利用事后复盘?
Validation checklist
验证清单
Use the following checklist to evaluate the architecture's alignment with
reliability recommendations:
- User-focused SLIs and SLOs are explicitly defined and actively monitored.
- The architecture avoids single points of failure through cross-zone or cross-region redundancy.
- Autoscaling is enabled to handle variable demand without manual intervention.
- Application and infrastructure health checks are configured to trigger automated failovers.
- Regular backup schedules are in place, and restoration processes are routinely tested.
- The system architecture incorporates patterns like circuit breakers, retries with exponential backoff, and rate limiting to support graceful degradation.
- Game days or chaos engineering practices are regularly held to validate failure recovery.
- A formalized, blameless postmortem process exists to ensure organizational learning from operational incidents.
使用以下清单评估架构与可靠性建议的契合度:
- 已明确定义并主动监控以用户为中心的SLI和SLO。
- 架构通过跨可用区或跨区域冗余避免单点故障。
- 已启用自动扩缩容,无需手动干预即可应对可变需求。
- 已配置应用和基础设施健康检查,以触发自动化故障切换。
- 已制定定期备份计划,并常规测试恢复流程。
- 系统架构融入了断路器、指数退避重试和速率限制等模式,以支持优雅降级。
- 定期举办故障演练或混沌工程实践,验证故障恢复能力。
- 存在正式的无责事后复盘流程,确保从运营事件中获取组织级经验。