gke-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GKE Observability

GKE可观测性

This reference covers monitoring, logging, and metrics configuration for GKE. The golden path enables comprehensive observability including control-plane metrics.
MCP Tools:
get_cluster
,
list_k8s_events
,
get_k8s_logs
,
get_k8s_cluster_info
,
describe_k8s_resource
. CLI-only:
gcloud container clusters update --monitoring=...
,
gcloud logging read
本参考文档涵盖GKE的监控、日志记录和指标配置。推荐路径(Golden Path)可实现包括控制平面指标在内的全面可观测性。
MCP工具:
get_cluster
list_k8s_events
get_k8s_logs
get_k8s_cluster_info
describe_k8s_resource
仅CLI可用:
gcloud container clusters update --monitoring=...
gcloud logging read

Golden Path Observability Defaults

推荐路径可观测性默认配置

SettingGolden Path ValueNotes
loggingConfig
components
SYSTEM_COMPONENTS, WORKLOADSFull workload logging
monitoringConfig
components
SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGERFull suite including control-plane
managedPrometheusConfig.enabled
true
Google-managed Prometheus
advancedDatapathObservabilityConfig.enableMetrics
true
Dataplane V2 flow metrics
loggingService
logging.googleapis.com/kubernetes
Cloud Logging
monitoringService
monitoring.googleapis.com/kubernetes
Cloud Monitoring
设置项推荐路径值说明
loggingConfig
组件
SYSTEM_COMPONENTS、WORKLOADS完整工作负载日志记录
monitoringConfig
组件
SYSTEM_COMPONENTS、STORAGE、POD、DEPLOYMENT、STATEFULSET、DAEMONSET、HPA、JOBSET、CADVISOR、KUBELET、DCGM、APISERVER、SCHEDULER、CONTROLLER_MANAGER包含控制平面的完整组件集
managedPrometheusConfig.enabled
true
Google托管Prometheus
advancedDatapathObservabilityConfig.enableMetrics
true
数据平面V2流量指标
loggingService
logging.googleapis.com/kubernetes
Cloud Logging
monitoringService
monitoring.googleapis.com/kubernetes
Cloud Monitoring

Control-Plane Metrics (Golden Path Addition)

控制平面指标(推荐路径新增项)

The golden path adds three control-plane monitoring components not present in default clusters:
ComponentWhat It Monitors
APISERVER
API server request latency, error rates, admission
: : webhook performance :
SCHEDULER
Scheduling latency, pending pods, scheduling failures
CONTROLLER_MANAGER
Controller work queue depth, reconciliation latency
These are critical for diagnosing cluster-level issues (slow API responses, scheduling delays, stuck controllers).
推荐路径添加了默认集群中没有的三个控制平面监控组件:
组件监控内容
APISERVER
API服务器请求延迟、错误率、准入Webhook性能
SCHEDULER
调度延迟、待处理Pod、调度失败情况
CONTROLLER_MANAGER
控制器工作队列深度、协调延迟
这些组件对于诊断集群级问题(如API响应缓慢、调度延迟、控制器卡顿)至关重要。

Enabling Full Monitoring

启用完整监控

bash
undefined
bash
undefined

Enable golden path monitoring suite

启用推荐路径监控套件

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet

Enable Managed Prometheus

启用托管Prometheus

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-managed-prometheus
--quiet
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-managed-prometheus
--quiet

Enable Dataplane V2 observability metrics

启用数据平面V2可观测性指标

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet
undefined
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet
undefined

Managed Prometheus

托管Prometheus

Golden path enables Google Managed Prometheus for metrics collection and querying.
Querying metrics:
  • Use Cloud Monitoring Metrics Explorer in the console
  • Use PromQL via the Prometheus UI or API
  • Grafana dashboards via Managed Grafana
Key GKE metrics:
MetricSourceUse
container_cpu_usage_seconds_total
cAdvisorPod CPU usage
container_memory_working_set_bytes
cAdvisorPod memory
: : : usage :
kube_pod_status_phase
kube-state-metricsPod lifecycle
apiserver_request_duration_seconds
API ServerControl plane
: : : latency :
scheduler_scheduling_duration_seconds
SchedulerScheduling
: : : performance :
node_cpu_seconds_total
KubeletNode CPU
DCGM_FI_DEV_GPU_UTIL
DCGMGPU
: : : utilization :
推荐路径启用了Google托管Prometheus用于指标收集和查询。
查询指标:
  • 使用控制台中的Cloud Monitoring指标资源管理器
  • 通过Prometheus UI或API使用PromQL
  • 通过托管Grafana使用Grafana仪表板
关键GKE指标:
指标名称来源用途
container_cpu_usage_seconds_total
cAdvisorPod CPU使用率
container_memory_working_set_bytes
cAdvisorPod内存使用率
kube_pod_status_phase
kube-state-metricsPod生命周期
apiserver_request_duration_seconds
API Server控制平面延迟
scheduler_scheduling_duration_seconds
Scheduler调度性能
node_cpu_seconds_total
Kubelet节点CPU使用率
DCGM_FI_DEV_GPU_UTIL
DCGMGPU使用率

Live Resource Usage (kubectl-only)

实时资源使用情况(仅kubectl可用)

No MCP or gcloud equivalent exists for live resource usage. Use
kubectl top
:
bash
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE>  # per-container breakdown
目前没有MCP或gcloud等效工具可查看实时资源使用情况,请使用
kubectl top
bash
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE>  # 按容器拆分查看

Cloud Logging (gcloud-only)

Cloud Logging(仅gcloud可用)

Querying cluster logs (no MCP equivalent — use
gcloud logging read
):
bash
undefined
查询集群日志(无MCP等效工具,请使用
gcloud logging read
):
bash
undefined

System component logs

系统组件日志

gcloud logging read
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet
gcloud logging read
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet

Workload logs for a specific namespace

指定命名空间的工作负载日志

gcloud logging read
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet
gcloud logging read
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet

Audit logs (who did what)

审计日志(操作记录)

gcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
undefined
gcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
undefined

Diagnostic Settings

诊断设置

For security monitoring and troubleshooting, enable control-plane audit logs:
bash
undefined
为了安全监控和故障排查,请启用控制平面审计日志:
bash
undefined

View current logging config

查看当前日志配置

gcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet
undefined
gcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet
undefined

Alerting

告警配置

Set up alerts for critical conditions:
ConditionMetricThreshold
High API server latency
apiserver_request_duration_seconds
P99 > 5s
Pod crash loops
kube_pod_container_status_restarts_total
> 5 in 10min
Node not ready
kube_node_status_condition
condition=Ready, status!=True
High GPU utilization
DCGM_FI_DEV_GPU_UTIL
> 95% sustained
PVC near capacity
kubelet_volume_stats_used_bytes / capacity
> 85%
Scheduling failures
scheduler_schedule_attempts_total{result="error"}
> 0
为关键条件设置告警:
条件指标阈值
API服务器高延迟
apiserver_request_duration_seconds
P99 > 5秒
Pod循环崩溃
kube_pod_container_status_restarts_total
10分钟内重启>5次
节点未就绪
kube_node_status_condition
condition=Ready,status!=True
GPU高使用率
DCGM_FI_DEV_GPU_UTIL
持续>95%
PVC容量不足
kubelet_volume_stats_used_bytes / capacity
>85%
调度失败
scheduler_schedule_attempts_total{result="error"}
>0次

Proposing Dashboards & Alerts (Production Rules)

仪表板与告警建议(生产规则)

When designing or proposing alerting and dashboard strategies for GKE:
  1. Always explicitly name Google Cloud Monitoring as the platform to implement these alerts and dashboards.
  2. Always include API server latency (via
    apiserver_request_duration_seconds
    metric) on the dashboard as a critical indicator of control plane health, alongside node CPU/Memory and pod crash loops.
在为GKE设计或提出告警和仪表板策略时:
  1. 必须明确指定Google Cloud Monitoring作为实现这些告警和仪表板的平台。
  2. 必须在仪表板中包含API服务器延迟(通过
    apiserver_request_duration_seconds
    指标),将其作为控制平面健康状况的关键指标,与节点CPU/内存和Pod循环崩溃指标一同展示。

Cost Considerations

成本考量

Monitoring and logging have associated costs:
  • Cloud Logging: Charged per GiB ingested beyond free tier (50 GiB/project/month)
  • Cloud Monitoring: Free for GKE system metrics; custom metrics charged per time series
  • Managed Prometheus: Charged per samples ingested
To reduce costs in non-production:
bash
undefined
监控和日志记录会产生相关成本:
  • Cloud Logging:超出免费额度(每个项目每月50 GiB)的部分按每GiB计费
  • Cloud Monitoring:GKE系统指标免费;自定义指标按时间序列计费
  • 托管Prometheus:按摄入的样本数量计费
如需降低非生产环境的成本:
bash
undefined

Reduce to system-only monitoring

仅保留系统监控

gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet
undefined
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet
undefined

Distributed Tracing & Continuous Profiling (Recommended)

分布式追踪与持续分析(推荐)

Not golden path defaults — recommended for production microservice architectures and performance-sensitive workloads.
  • Cloud Trace: Add OpenTelemetry SDK to your app with the
    opentelemetry-operations-go
    (or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks.
  • Cloud Profiler: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions.
不属于推荐路径默认配置 —— 建议用于生产微服务架构和性能敏感型工作负载。
  • Cloud Trace:为应用添加OpenTelemetry SDK,并使用
    opentelemetry-operations-go
    (或等效工具)导出器。追踪数据将显示在Cloud Trace控制台中,可识别跨服务延迟瓶颈。
  • Cloud Profiler:为应用添加Cloud Profiler代理。在生产环境中以低开销分析CPU和内存使用情况,识别性能热点并跨版本进行比较。

LQL Query Examples

LQL查询示例

Common Logging Query Language patterns for GKE troubleshooting:
undefined
用于GKE故障排查的常见日志查询语言(LQL)模式:
undefined

Error logs for a specific container

指定容器的错误日志

resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR
resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR

OOMKilled events

OOMKilled事件

resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"
resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"

Pod scheduling failures

Pod调度失败事件

resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"
resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"

Audit logs (who did what)

审计日志(操作记录)

resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
undefined
resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
undefined