gke-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGKE Observability
GKE可观测性
This reference covers monitoring, logging, and metrics configuration for GKE.
The golden path enables comprehensive observability including control-plane
metrics.
MCP Tools:,get_cluster,list_k8s_events,get_k8s_logs,get_k8s_cluster_info. CLI-only:describe_k8s_resource,gcloud container clusters update --monitoring=...gcloud logging read
本参考文档涵盖GKE的监控、日志记录和指标配置。推荐路径(Golden Path)可实现包括控制平面指标在内的全面可观测性。
MCP工具:、get_cluster、list_k8s_events、get_k8s_logs、get_k8s_cluster_info。仅CLI可用:describe_k8s_resource、gcloud container clusters update --monitoring=...gcloud logging read
Golden Path Observability Defaults
推荐路径可观测性默认配置
| Setting | Golden Path Value | Notes |
|---|---|---|
| SYSTEM_COMPONENTS, WORKLOADS | Full workload logging |
| SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | Full suite including control-plane |
| | Google-managed Prometheus |
| | Dataplane V2 flow metrics |
| | Cloud Logging |
| | Cloud Monitoring |
| 设置项 | 推荐路径值 | 说明 |
|---|---|---|
| SYSTEM_COMPONENTS、WORKLOADS | 完整工作负载日志记录 |
| SYSTEM_COMPONENTS、STORAGE、POD、DEPLOYMENT、STATEFULSET、DAEMONSET、HPA、JOBSET、CADVISOR、KUBELET、DCGM、APISERVER、SCHEDULER、CONTROLLER_MANAGER | 包含控制平面的完整组件集 |
| | Google托管Prometheus |
| | 数据平面V2流量指标 |
| | Cloud Logging |
| | Cloud Monitoring |
Control-Plane Metrics (Golden Path Addition)
控制平面指标(推荐路径新增项)
The golden path adds three control-plane monitoring components not present in
default clusters:
| Component | What It Monitors |
|---|---|
| API server request latency, error rates, admission |
| : : webhook performance : | |
| Scheduling latency, pending pods, scheduling failures |
| Controller work queue depth, reconciliation latency |
These are critical for diagnosing cluster-level issues (slow API responses,
scheduling delays, stuck controllers).
推荐路径添加了默认集群中没有的三个控制平面监控组件:
| 组件 | 监控内容 |
|---|---|
| API服务器请求延迟、错误率、准入Webhook性能 |
| 调度延迟、待处理Pod、调度失败情况 |
| 控制器工作队列深度、协调延迟 |
这些组件对于诊断集群级问题(如API响应缓慢、调度延迟、控制器卡顿)至关重要。
Enabling Full Monitoring
启用完整监控
bash
undefinedbash
undefinedEnable golden path monitoring suite
启用推荐路径监控套件
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM
--quiet
Enable Managed Prometheus
启用托管Prometheus
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-managed-prometheus
--quiet
--enable-managed-prometheus
--quiet
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-managed-prometheus
--quiet
--enable-managed-prometheus
--quiet
Enable Dataplane V2 observability metrics
启用数据平面V2可观测性指标
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet
--enable-dataplane-v2-flow-observability
--quiet
undefinedgcloud container clusters update <CLUSTER_NAME> --region <REGION>
--enable-dataplane-v2-flow-observability
--quiet
--enable-dataplane-v2-flow-observability
--quiet
undefinedManaged Prometheus
托管Prometheus
Golden path enables Google Managed Prometheus for metrics collection and
querying.
Querying metrics:
- Use Cloud Monitoring Metrics Explorer in the console
- Use PromQL via the Prometheus UI or API
- Grafana dashboards via Managed Grafana
Key GKE metrics:
| Metric | Source | Use |
|---|---|---|
| cAdvisor | Pod CPU usage |
| cAdvisor | Pod memory |
| : : : usage : | ||
| kube-state-metrics | Pod lifecycle |
| API Server | Control plane |
| : : : latency : | ||
| Scheduler | Scheduling |
| : : : performance : | ||
| Kubelet | Node CPU |
| DCGM | GPU |
| : : : utilization : |
推荐路径启用了Google托管Prometheus用于指标收集和查询。
查询指标:
- 使用控制台中的Cloud Monitoring指标资源管理器
- 通过Prometheus UI或API使用PromQL
- 通过托管Grafana使用Grafana仪表板
关键GKE指标:
| 指标名称 | 来源 | 用途 |
|---|---|---|
| cAdvisor | Pod CPU使用率 |
| cAdvisor | Pod内存使用率 |
| kube-state-metrics | Pod生命周期 |
| API Server | 控制平面延迟 |
| Scheduler | 调度性能 |
| Kubelet | 节点CPU使用率 |
| DCGM | GPU使用率 |
Live Resource Usage (kubectl-only)
实时资源使用情况(仅kubectl可用)
No MCP or gcloud equivalent exists for live resource usage. Use :
kubectl topbash
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE> # per-container breakdown目前没有MCP或gcloud等效工具可查看实时资源使用情况,请使用:
kubectl topbash
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top nodes
kubectl top pods --containers -n <NAMESPACE> # 按容器拆分查看Cloud Logging (gcloud-only)
Cloud Logging(仅gcloud可用)
Querying cluster logs (no MCP equivalent — use ):
gcloud logging readbash
undefined查询集群日志(无MCP等效工具,请使用):
gcloud logging readbash
undefinedSystem component logs
系统组件日志
gcloud logging read
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet
gcloud logging read
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"'
--project <PROJECT_ID> --limit 50
--quiet
Workload logs for a specific namespace
指定命名空间的工作负载日志
gcloud logging read
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet
gcloud logging read
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"'
--project <PROJECT_ID> --limit 50
--quiet
Audit logs (who did what)
审计日志(操作记录)
gcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
undefinedgcloud logging read
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"'
--project <PROJECT_ID> --limit 50
--quiet
undefinedDiagnostic Settings
诊断设置
For security monitoring and troubleshooting, enable control-plane audit logs:
bash
undefined为了安全监控和故障排查,请启用控制平面审计日志:
bash
undefinedView current logging config
查看当前日志配置
gcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet
--format="yaml(loggingConfig)"
--quiet
undefinedgcloud container clusters describe <CLUSTER_NAME> --region <REGION>
--format="yaml(loggingConfig)"
--quiet
--format="yaml(loggingConfig)"
--quiet
undefinedAlerting
告警配置
Set up alerts for critical conditions:
| Condition | Metric | Threshold |
|---|---|---|
| High API server latency | | P99 > 5s |
| Pod crash loops | | > 5 in 10min |
| Node not ready | | condition=Ready, status!=True |
| High GPU utilization | | > 95% sustained |
| PVC near capacity | | > 85% |
| Scheduling failures | | > 0 |
为关键条件设置告警:
| 条件 | 指标 | 阈值 |
|---|---|---|
| API服务器高延迟 | | P99 > 5秒 |
| Pod循环崩溃 | | 10分钟内重启>5次 |
| 节点未就绪 | | condition=Ready,status!=True |
| GPU高使用率 | | 持续>95% |
| PVC容量不足 | | >85% |
| 调度失败 | | >0次 |
Proposing Dashboards & Alerts (Production Rules)
仪表板与告警建议(生产规则)
When designing or proposing alerting and dashboard strategies for GKE:
- Always explicitly name Google Cloud Monitoring as the platform to implement these alerts and dashboards.
- Always include API server latency (via
metric) on the dashboard as a critical indicator of control plane health, alongside node CPU/Memory and pod crash loops.
apiserver_request_duration_seconds
在为GKE设计或提出告警和仪表板策略时:
- 必须明确指定Google Cloud Monitoring作为实现这些告警和仪表板的平台。
- 必须在仪表板中包含API服务器延迟(通过指标),将其作为控制平面健康状况的关键指标,与节点CPU/内存和Pod循环崩溃指标一同展示。
apiserver_request_duration_seconds
Cost Considerations
成本考量
Monitoring and logging have associated costs:
- Cloud Logging: Charged per GiB ingested beyond free tier (50 GiB/project/month)
- Cloud Monitoring: Free for GKE system metrics; custom metrics charged per time series
- Managed Prometheus: Charged per samples ingested
To reduce costs in non-production:
bash
undefined监控和日志记录会产生相关成本:
- Cloud Logging:超出免费额度(每个项目每月50 GiB)的部分按每GiB计费
- Cloud Monitoring:GKE系统指标免费;自定义指标按时间序列计费
- 托管Prometheus:按摄入的样本数量计费
如需降低非生产环境的成本:
bash
undefinedReduce to system-only monitoring
仅保留系统监控
gcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet
--monitoring=SYSTEM
--quiet
undefinedgcloud container clusters update <CLUSTER_NAME> --region <REGION>
--monitoring=SYSTEM
--quiet
--monitoring=SYSTEM
--quiet
undefinedDistributed Tracing & Continuous Profiling (Recommended)
分布式追踪与持续分析(推荐)
Not golden path defaults — recommended for production microservice
architectures and performance-sensitive workloads.
- Cloud Trace: Add OpenTelemetry SDK to your app with the
(or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks.
opentelemetry-operations-go - Cloud Profiler: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions.
不属于推荐路径默认配置 —— 建议用于生产微服务架构和性能敏感型工作负载。
- Cloud Trace:为应用添加OpenTelemetry SDK,并使用(或等效工具)导出器。追踪数据将显示在Cloud Trace控制台中,可识别跨服务延迟瓶颈。
opentelemetry-operations-go - Cloud Profiler:为应用添加Cloud Profiler代理。在生产环境中以低开销分析CPU和内存使用情况,识别性能热点并跨版本进行比较。
LQL Query Examples
LQL查询示例
Common Logging Query Language patterns for GKE troubleshooting:
undefined用于GKE故障排查的常见日志查询语言(LQL)模式:
undefinedError logs for a specific container
指定容器的错误日志
resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR
resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR
OOMKilled events
OOMKilled事件
resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"
resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"
Pod scheduling failures
Pod调度失败事件
resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"
resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"
Audit logs (who did what)
审计日志(操作记录)
resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
undefinedresource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
undefined