google-cloud-waf-operational-excellence

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Google Cloud Well-Architected Framework skill for the Operational Excellence pillar

Google Cloud架构完善框架卓越运营支柱技能

Overview

概述

The operational excellence pillar in the Google Cloud Well-Architected Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.
Google Cloud架构完善框架中的卓越运营支柱提供了在Google Cloud上高效运行工作负载的建议。云端卓越运营涉及设计、实施和管理能够提供价值、性能、安全性和可靠性的云解决方案。该支柱中的建议可帮助您持续改进和调整工作负载,以适应云端动态且不断演变的需求。

Core principles

核心原则

The recommendations in the operational excellence pillar of the Well-Architected Framework are aligned with the following core principles:
架构完善框架卓越运营支柱中的建议与以下核心原则保持一致:

Relevant Google Cloud products

相关Google Cloud产品

The following are examples of Google Cloud products and features that are relevant to operational excellence:
  • Observability and monitoring
    • Cloud Monitoring: Full-stack observability for Google Cloud and hybrid environments.
    • Cloud Logging: Real-time log management and analysis at scale.
    • Error Reporting: Aggregates and displays errors for running cloud services.
    • Service Monitoring: Tools for defining and tracking Service Level Objectives (SLOs).
  • Automation and CI/CD
    • Cloud Build: Serverless platform for building, testing, and deploying software.
    • Cloud Deploy: Managed continuous delivery service for GKE, Cloud Run, and GCE.
    • Terraform / Infrastructure Manager: Managed service for Infrastructure as Code (IaC) automation.
    • Artifact Registry: Central repository for managing build artifacts and container images.
  • Resource management and optimization
    • Recommender (Active Assist): Automatically identifies idle resources and right-sizing opportunities.
    • Resource Manager: Hierarchical management of resources across organizations, folders, and projects.
  • Incident response
    • Incident response & management (IRM): Structured tools and processes for managing operational disruptions.
以下是与卓越运营相关的Google Cloud产品和功能示例:
  • 可观测性与监控
    • Cloud Monitoring:面向Google Cloud和混合环境的全栈可观测性工具。
    • Cloud Logging:大规模实时日志管理与分析工具。
    • Error Reporting:聚合并展示运行中云服务的错误信息。
    • Service Monitoring:用于定义和跟踪服务水平目标(SLOs)的工具。
  • 自动化与CI/CD
    • Cloud Build:用于构建、测试和部署软件的无服务器平台。
    • Cloud Deploy:面向GKE、Cloud Run和GCE的托管式持续交付服务。
    • Terraform / Infrastructure Manager:用于Infrastructure as Code (IaC)自动化的托管服务。
    • Artifact Registry:用于管理构建制品和容器镜像的中央仓库。
  • 资源管理与优化
    • Recommender (Active Assist):自动识别闲置资源和合理调整规模的机会。
    • Resource Manager:跨组织、文件夹和项目的分层资源管理工具。
  • 事件响应
    • Incident response & management (IRM):用于管理运营中断的结构化工具和流程。

Workload assessment questions

工作负载评估问题

Ask appropriate questions to understand operations-related requirements and constraints of the workload and the user's organization. Choose questions from the following list:
  • Operational readiness and performance
    • How do you define and measure operational readiness for your cloud workloads and what specific criteria or metrics do you use?
    • Describe your process for defining, tracking, and achieving SLOs for your critical workloads.
  • Incident and problem management
    • Describe your incident management process, including roles, responsibilities, and communication channels.
    • How do you conduct post-incident reviews (PIRs) to identify root causes and implement preventive measures?
  • Resource management and optimization
    • How do you ensure that your cloud resources are right-sized for your workloads, and what tools or techniques do you use?
  • Change automation
    • Describe your change management process, including approval workflows, testing procedures, and deployment strategies.
    • How do you automate deployments, ensure their consistency and manage configuration?
  • Continuous improvement
    • How do you ensure that your cloud operations are continuously adapting to meet evolving business needs and technological advancements?
提出合适的问题,以了解工作负载及用户组织的运营相关需求和约束。可从以下列表中选择问题:
  • 运营就绪与性能
    • 您如何定义和衡量云工作负载的运营就绪状态,使用哪些具体标准或指标?
    • 请描述您为关键工作负载定义、跟踪和实现SLOs的流程。
  • 事件与问题管理
    • 请描述您的事件管理流程,包括角色、职责和沟通渠道。
    • 您如何开展事后审查(PIRs)以识别根本原因并实施预防措施?
  • 资源管理与优化
    • 您如何确保云资源与工作负载规模匹配,使用哪些工具或技术?
  • 变更自动化
    • 请描述您的变更管理流程,包括审批工作流、测试程序和部署策略。
    • 您如何自动化部署、确保一致性并管理配置?
  • 持续改进
    • 您如何确保云运营持续适应不断变化的业务需求和技术进步?

Validation checklist

验证清单

Use the following checklist to evaluate the architecture's alignment with operational excellence recommendations:
  • Operational readiness
    • A formal framework or set of criteria exists to assess operational readiness before production deployment.
    • Service Level Objectives (SLOs) are explicitly defined and monitored using automated tools.
  • Incident management
    • Incident response roles and communication channels are clearly defined and documented.
    • A structured, blameless post-mortem process is followed for all major incidents.
  • Change automation
    • All infrastructure changes are performed using Infrastructure as Code (IaC) to ensure consistency.
    • CI/CD pipelines are integrated with automated testing for all deployment changes.
  • Resource optimization
    • Resource utilization is regularly reviewed using recommendations from Active Assist or performance data.
  • Culture of improvement
    • A documented strategy is in place for regularly reviewing and adapting cloud operations to industry advancements.
使用以下清单评估架构与卓越运营建议的契合度:
  • 运营就绪
    • 存在正式的框架或标准集,用于在生产部署前评估运营就绪状态。
    • 已明确定义服务水平目标(SLOs),并使用自动化工具进行监控。
  • 事件管理
    • 事件响应角色和沟通渠道已明确界定并形成文档。
    • 针对所有重大事件遵循结构化、无责的事后复盘流程。
  • 变更自动化
    • 所有基础设施变更均通过Infrastructure as Code (IaC)执行,以确保一致性。
    • CI/CD流水线已与自动化测试集成,覆盖所有部署变更。
  • 资源优化
    • 定期结合Active Assist的建议或性能数据审查资源利用率。
  • 改进文化
    • 已制定文档化策略,用于定期审查和调整云运营以适应行业发展。