Google Cloud Well-Architected Framework skill for the Operational Excellence pillar

Google Cloud架构完善框架卓越运营支柱技能

Overview

概述

The operational excellence pillar in the Google Cloud Well-Architected Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

Google Cloud架构完善框架中的卓越运营支柱提供了在Google Cloud上高效运行工作负载的建议。云端卓越运营涉及设计、实施和管理能够提供价值、性能、安全性和可靠性的云解决方案。该支柱中的建议可帮助您持续改进和调整工作负载，以适应云端动态且不断演变的需求。

Core principles

核心原则

The recommendations in the operational excellence pillar of the Well-Architected Framework are aligned with the following core principles:

Ensure operational readiness: Define and measure criteria for a workload to be considered ready for production, including staffing, processes, and governance. Grounding document: https://docs.cloud.google.com/architecture/framework/operational-excellence/operational-readiness-and-performance-using-cloudops
Manage incidents and problems: Establish structured processes for incident response, communication, and root cause analysis to minimize impact and prevent recurrence. Grounding document: https://docs.cloud.google.com/architecture/framework/operational-excellence/manage-incidents-and-problems
Manage and optimize cloud resources: Monitor resource utilization and right-size environments to maintain performance while ensuring operational efficiency. Grounding document: https://docs.cloud.google.com/architecture/framework/operational-excellence/manage-and-optimize-cloud-resources
Automate and manage change: Use Infrastructure as Code (IaC) and CI/CD pipelines to ensure consistent, repeatable, and low-risk deployments and configuration changes. Grounding document: https://docs.cloud.google.com/architecture/framework/operational-excellence/automate-and-manage-change
Continuously improve and innovate: Regularly review architectures, monitor industry trends, and adapt operations to meet evolving business needs. Grounding document: https://docs.cloud.google.com/architecture/framework/operational-excellence/continuously-improve-and-innovate

架构完善框架卓越运营支柱中的建议与以下核心原则保持一致：

确保运营就绪：定义并衡量工作负载可投入生产的标准，包括人员配置、流程和治理。参考文档： https://docs.cloud.google.com/architecture/framework/operational-excellence/operational-readiness-and-performance-using-cloudops
管理事件与问题：建立结构化的事件响应、沟通和根本原因分析流程，以降低影响并防止复发。参考文档： https://docs.cloud.google.com/architecture/framework/operational-excellence/manage-incidents-and-problems
管理与优化云资源：监控资源利用率并合理调整环境规模，在确保运营效率的同时维持性能。参考文档： https://docs.cloud.google.com/architecture/framework/operational-excellence/manage-and-optimize-cloud-resources
自动化与变更管理：使用Infrastructure as Code (IaC)和CI/CD流水线确保部署和配置变更的一致性、可重复性和低风险性。参考文档： https://docs.cloud.google.com/architecture/framework/operational-excellence/automate-and-manage-change
持续改进与创新：定期审查架构、监控行业趋势，并调整运营以满足不断变化的业务需求。参考文档： https://docs.cloud.google.com/architecture/framework/operational-excellence/continuously-improve-and-innovate

Relevant Google Cloud products

Workload assessment questions

工作负载评估问题

Ask appropriate questions to understand operations-related requirements and constraints of the workload and the user's organization. Choose questions from the following list:

Operational readiness and performance
- How do you define and measure operational readiness for your cloud workloads and what specific criteria or metrics do you use?
- Describe your process for defining, tracking, and achieving SLOs for your critical workloads.
Incident and problem management
- Describe your incident management process, including roles, responsibilities, and communication channels.
- How do you conduct post-incident reviews (PIRs) to identify root causes and implement preventive measures?
Resource management and optimization
- How do you ensure that your cloud resources are right-sized for your workloads, and what tools or techniques do you use?
Change automation
- Describe your change management process, including approval workflows, testing procedures, and deployment strategies.
- How do you automate deployments, ensure their consistency and manage configuration?
Continuous improvement
- How do you ensure that your cloud operations are continuously adapting to meet evolving business needs and technological advancements?

提出合适的问题，以了解工作负载及用户组织的运营相关需求和约束。可从以下列表中选择问题：

运营就绪与性能
- 您如何定义和衡量云工作负载的运营就绪状态，使用哪些具体标准或指标？
- 请描述您为关键工作负载定义、跟踪和实现SLOs的流程。
事件与问题管理
- 请描述您的事件管理流程，包括角色、职责和沟通渠道。
- 您如何开展事后审查（PIRs）以识别根本原因并实施预防措施？
资源管理与优化
- 您如何确保云资源与工作负载规模匹配，使用哪些工具或技术？
变更自动化
- 请描述您的变更管理流程，包括审批工作流、测试程序和部署策略。
- 您如何自动化部署、确保一致性并管理配置？
持续改进
- 您如何确保云运营持续适应不断变化的业务需求和技术进步？

Validation checklist

验证清单

Use the following checklist to evaluate the architecture's alignment with operational excellence recommendations:

Operational readiness
- A formal framework or set of criteria exists to assess operational readiness before production deployment.
- Service Level Objectives (SLOs) are explicitly defined and monitored using automated tools.
Incident management
- Incident response roles and communication channels are clearly defined and documented.
- A structured, blameless post-mortem process is followed for all major incidents.
Change automation
- All infrastructure changes are performed using Infrastructure as Code (IaC) to ensure consistency.
- CI/CD pipelines are integrated with automated testing for all deployment changes.
Resource optimization
- Resource utilization is regularly reviewed using recommendations from Active Assist or performance data.
Culture of improvement
- A documented strategy is in place for regularly reviewing and adapting cloud operations to industry advancements.

使用以下清单评估架构与卓越运营建议的契合度：

运营就绪
- 存在正式的框架或标准集，用于在生产部署前评估运营就绪状态。
- 已明确定义服务水平目标（SLOs），并使用自动化工具进行监控。
事件管理
- 事件响应角色和沟通渠道已明确界定并形成文档。
- 针对所有重大事件遵循结构化、无责的事后复盘流程。
变更自动化
- 所有基础设施变更均通过Infrastructure as Code (IaC)执行，以确保一致性。
- CI/CD流水线已与自动化测试集成，覆盖所有部署变更。
资源优化
- 定期结合Active Assist的建议或性能数据审查资源利用率。
改进文化
- 已制定文档化策略，用于定期审查和调整云运营以适应行业发展。

google-cloud-waf-operational-excellence

Original

Translation