gke-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GKE Observability

GKE可观测性

This reference covers monitoring, logging, and metrics configuration for GKE. The golden path enables comprehensive observability including control-plane metrics.

MCP Tools:

get_cluster

list_k8s_events

get_k8s_logs

get_k8s_cluster_info

describe_k8s_resource

. CLI-only:

gcloud container clusters update --monitoring=...

gcloud logging read

本参考文档涵盖GKE的监控、日志记录和指标配置。推荐路径（Golden Path）可实现包括控制平面指标在内的全面可观测性。

MCP工具：

get_cluster

、

list_k8s_events

、

get_k8s_logs

、

get_k8s_cluster_info

、

describe_k8s_resource

。仅CLI可用：

gcloud container clusters update --monitoring=...

、

gcloud logging read

Golden Path Observability Defaults

Setting	Golden Path Value	Notes
`loggingConfig` components	SYSTEM_COMPONENTS, WORKLOADS	Full workload logging
`monitoringConfig` components	SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER	Full suite including control-plane
`managedPrometheusConfig.enabled`	`true`	Google-managed Prometheus
`advancedDatapathObservabilityConfig.enableMetrics`	`true`	Dataplane V2 flow metrics
`loggingService`	`logging.googleapis.com/kubernetes`	Cloud Logging
`monitoringService`	`monitoring.googleapis.com/kubernetes`	Cloud Monitoring

设置项	推荐路径值	说明
`loggingConfig` 组件	SYSTEM_COMPONENTS、WORKLOADS	完整工作负载日志记录
`monitoringConfig` 组件	SYSTEM_COMPONENTS、STORAGE、POD、DEPLOYMENT、STATEFULSET、DAEMONSET、HPA、JOBSET、CADVISOR、KUBELET、DCGM、APISERVER、SCHEDULER、CONTROLLER_MANAGER	包含控制平面的完整组件集
`managedPrometheusConfig.enabled`	`true`	Google托管Prometheus
`advancedDatapathObservabilityConfig.enableMetrics`	`true`	数据平面V2流量指标
`loggingService`	`logging.googleapis.com/kubernetes`	Cloud Logging
`monitoringService`	`monitoring.googleapis.com/kubernetes`	Cloud Monitoring

Component	What It Monitors
`APISERVER`	API server request latency, error rates, admission
: : webhook performance :
`SCHEDULER`	Scheduling latency, pending pods, scheduling failures
`CONTROLLER_MANAGER`	Controller work queue depth, reconciliation latency

组件	监控内容
`APISERVER`	API服务器请求延迟、错误率、准入Webhook性能
`SCHEDULER`	调度延迟、待处理Pod、调度失败情况
`CONTROLLER_MANAGER`	控制器工作队列深度、协调延迟

Enabling Full Monitoring

启用完整监控

bash

undefined

bash

undefined

Enable golden path monitoring suite

启用推荐路径监控套件

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet

Enable Managed Prometheus

启用托管Prometheus

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-managed-prometheus
--quiet

Enable Dataplane V2 observability metrics

启用数据平面V2可观测性指标

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet

undefined

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet

undefined

Managed Prometheus

托管Prometheus

Golden path enables Google Managed Prometheus for metrics collection and querying.

Querying metrics:

Use Cloud Monitoring Metrics Explorer in the console
Use PromQL via the Prometheus UI or API
Grafana dashboards via Managed Grafana

Key GKE metrics:

Metric	Source	Use
`container_cpu_usage_seconds_total`	cAdvisor	Pod CPU usage
`container_memory_working_set_bytes`	cAdvisor	Pod memory
: : : usage :
`kube_pod_status_phase`	kube-state-metrics	Pod lifecycle
`apiserver_request_duration_seconds`	API Server	Control plane
: : : latency :
`scheduler_scheduling_duration_seconds`	Scheduler	Scheduling
: : : performance :
`node_cpu_seconds_total`	Kubelet	Node CPU
`DCGM_FI_DEV_GPU_UTIL`	DCGM	GPU
: : : utilization :

推荐路径启用了Google托管Prometheus用于指标收集和查询。

查询指标：

使用控制台中的Cloud Monitoring指标资源管理器
通过Prometheus UI或API使用PromQL
通过托管Grafana使用Grafana仪表板

关键GKE指标：

指标名称	来源	用途
`container_cpu_usage_seconds_total`	cAdvisor	Pod CPU使用率
`container_memory_working_set_bytes`	cAdvisor	Pod内存使用率
`kube_pod_status_phase`	kube-state-metrics	Pod生命周期
`apiserver_request_duration_seconds`	API Server	控制平面延迟
`scheduler_scheduling_duration_seconds`	Scheduler	调度性能
`node_cpu_seconds_total`	Kubelet	节点CPU使用率
`DCGM_FI_DEV_GPU_UTIL`	DCGM	GPU使用率

Live Resource Usage (kubectl-only)

实时资源使用情况（仅kubectl可用）

No MCP or gcloud equivalent exists for live resource usage. Use

kubectl top

bash

kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE>  # per-container breakdown

目前没有MCP或gcloud等效工具可查看实时资源使用情况，请使用

kubectl top

：

bash

kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE>  # 按容器拆分查看

Cloud Logging (gcloud-only)

Cloud Logging（仅gcloud可用）

Querying cluster logs (no MCP equivalent — use

gcloud logging read

bash

undefined

查询集群日志（无MCP等效工具，请使用

gcloud logging read

）：

bash

undefined

System component logs

系统组件日志

gcloud logging read
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet

Workload logs for a specific namespace

指定命名空间的工作负载日志

gcloud logging read
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet

Audit logs (who did what)

审计日志（操作记录）

gcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet

undefined

gcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet

undefined

Diagnostic Settings

诊断设置

For security monitoring and troubleshooting, enable control-plane audit logs:

bash

undefined

为了安全监控和故障排查，请启用控制平面审计日志：

bash

undefined

View current logging config

查看当前日志配置

gcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet

undefined

gcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet

undefined

Alerting

告警配置

Set up alerts for critical conditions:

Condition	Metric	Threshold
High API server latency	`apiserver_request_duration_seconds`	P99 > 5s
Pod crash loops	`kube_pod_container_status_restarts_total`	> 5 in 10min
Node not ready	`kube_node_status_condition`	condition=Ready, status!=True
High GPU utilization	`DCGM_FI_DEV_GPU_UTIL`	> 95% sustained
PVC near capacity	`kubelet_volume_stats_used_bytes / capacity`	> 85%
Scheduling failures	`scheduler_schedule_attempts_total{result="error"}`	> 0

为关键条件设置告警：

条件	指标	阈值
API服务器高延迟	`apiserver_request_duration_seconds`	P99 > 5秒
Pod循环崩溃	`kube_pod_container_status_restarts_total`	10分钟内重启>5次
节点未就绪	`kube_node_status_condition`	condition=Ready，status!=True
GPU高使用率	`DCGM_FI_DEV_GPU_UTIL`	持续>95%
PVC容量不足	`kubelet_volume_stats_used_bytes / capacity`	>85%
调度失败	`scheduler_schedule_attempts_total{result="error"}`	>0次

Proposing Dashboards & Alerts (Production Rules)

仪表板与告警建议（生产规则）

When designing or proposing alerting and dashboard strategies for GKE:

Always explicitly name Google Cloud Monitoring as the platform to implement these alerts and dashboards.
Always include API server latency (via
```
apiserver_request_duration_seconds
```
metric) on the dashboard as a critical indicator of control plane health, alongside node CPU/Memory and pod crash loops.

在为GKE设计或提出告警和仪表板策略时：

必须明确指定Google Cloud Monitoring作为实现这些告警和仪表板的平台。
必须在仪表板中包含API服务器延迟（通过
```
apiserver_request_duration_seconds
```
指标），将其作为控制平面健康状况的关键指标，与节点CPU/内存和Pod循环崩溃指标一同展示。

Cost Considerations

成本考量

Monitoring and logging have associated costs:

Cloud Logging: Charged per GiB ingested beyond free tier (50 GiB/project/month)
Cloud Monitoring: Free for GKE system metrics; custom metrics charged per time series
Managed Prometheus: Charged per samples ingested

To reduce costs in non-production:

bash

undefined

监控和日志记录会产生相关成本：

Cloud Logging：超出免费额度（每个项目每月50 GiB）的部分按每GiB计费
Cloud Monitoring：GKE系统指标免费；自定义指标按时间序列计费
托管Prometheus：按摄入的样本数量计费

如需降低非生产环境的成本：

bash

undefined

Reduce to system-only monitoring

仅保留系统监控

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet

undefined

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet

undefined

Distributed Tracing & Continuous Profiling (Recommended)

分布式追踪与持续分析（推荐）

Not golden path defaults — recommended for production microservice architectures and performance-sensitive workloads.

Cloud Trace: Add OpenTelemetry SDK to your app with the
```
opentelemetry-operations-go
```
(or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks.
Cloud Profiler: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions.

不属于推荐路径默认配置 —— 建议用于生产微服务架构和性能敏感型工作负载。

Cloud Trace：为应用添加OpenTelemetry SDK，并使用
```
opentelemetry-operations-go
```
（或等效工具）导出器。追踪数据将显示在Cloud Trace控制台中，可识别跨服务延迟瓶颈。
Cloud Profiler：为应用添加Cloud Profiler代理。在生产环境中以低开销分析CPU和内存使用情况，识别性能热点并跨版本进行比较。

LQL Query Examples

LQL查询示例

Common Logging Query Language patterns for GKE troubleshooting:

undefined

用于GKE故障排查的常见日志查询语言（LQL）模式：

undefined

Error logs for a specific container

指定容器的错误日志

resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR

OOMKilled events

OOMKilled事件

resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"

Pod scheduling failures

Pod调度失败事件

resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"

Audit logs (who did what)

审计日志（操作记录）

resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"

undefined

resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"

undefined

gke-observability

Original

Translation

GKE Observability

GKE可观测性

Golden Path Observability Defaults

推荐路径可观测性默认配置

Control-Plane Metrics (Golden Path Addition)

控制平面指标（推荐路径新增项）

Enabling Full Monitoring

启用完整监控

Enable golden path monitoring suite

启用推荐路径监控套件

Enable Managed Prometheus

启用托管Prometheus

Enable Dataplane V2 observability metrics

启用数据平面V2可观测性指标

Managed Prometheus

托管Prometheus

Live Resource Usage (kubectl-only)

实时资源使用情况（仅kubectl可用）

Cloud Logging (gcloud-only)

Cloud Logging（仅gcloud可用）

System component logs

系统组件日志

Workload logs for a specific namespace

指定命名空间的工作负载日志

Audit logs (who did what)

审计日志（操作记录）

Diagnostic Settings

诊断设置

View current logging config

查看当前日志配置

Alerting

告警配置

Proposing Dashboards & Alerts (Production Rules)

仪表板与告警建议（生产规则）

Cost Considerations

成本考量

Reduce to system-only monitoring

仅保留系统监控

Distributed Tracing & Continuous Profiling (Recommended)

分布式追踪与持续分析（推荐）

LQL Query Examples

LQL查询示例

Error logs for a specific container

指定容器的错误日志

OOMKilled events

OOMKilled事件

Pod scheduling failures

Pod调度失败事件

Audit logs (who did what)

审计日志（操作记录）