infrastructure
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Cloud Infrastructure Monitoring
Grafana Cloud基础设施监控
Kubernetes Monitoring (k8s-monitoring Helm Chart)
Kubernetes监控(k8s-monitoring Helm Chart)
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo updateyaml
undefinedbash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo updateyaml
undefinedvalues.yaml
values.yaml
cluster:
name: production-us-east
externalServices:
prometheus:
host: https://prometheus-prod-xx.grafana.net
basicAuth:
username: "123456"
password:
secretName: grafana-cloud-secret
secretKey: api-key
loki:
host: https://logs-prod-xx.grafana.net
basicAuth:
username: "234567"
password:
secretName: grafana-cloud-secret
secretKey: api-key
tempo:
host: https://tempo-prod-xx.grafana.net:443
basicAuth:
username: "345678"
password:
secretName: grafana-cloud-secret
secretKey: api-key
metrics:
enabled: true
cost:
enabled: true # Kubernetes cost monitoring
podMonitors:
enabled: true
serviceMonitors:
enabled: true
kube-state-metrics:
enabled: true
node-exporter:
enabled: true
cadvisor:
enabled: true
logs:
pod_logs:
enabled: true
cluster_events:
enabled: true
traces:
enabled: true
profiles:
enabled: false
receivers:
grpc:
enabled: true
port: 4317
http:
enabled: true
port: 4318
```bash
kubectl create secret generic grafana-cloud-secret \
--from-literal=api-key=<your-api-key> \
-n monitoring
helm install k8s-monitoring grafana/k8s-monitoring \
-n monitoring --create-namespace \
-f values.yamlcluster:
name: production-us-east
externalServices:
prometheus:
host: https://prometheus-prod-xx.grafana.net
basicAuth:
username: "123456"
password:
secretName: grafana-cloud-secret
secretKey: api-key
loki:
host: https://logs-prod-xx.grafana.net
basicAuth:
username: "234567"
password:
secretName: grafana-cloud-secret
secretKey: api-key
tempo:
host: https://tempo-prod-xx.grafana.net:443
basicAuth:
username: "345678"
password:
secretName: grafana-cloud-secret
secretKey: api-key
metrics:
enabled: true
cost:
enabled: true # Kubernetes cost monitoring
podMonitors:
enabled: true
serviceMonitors:
enabled: true
kube-state-metrics:
enabled: true
node-exporter:
enabled: true
cadvisor:
enabled: true
logs:
pod_logs:
enabled: true
cluster_events:
enabled: true
traces:
enabled: true
profiles:
enabled: false
receivers:
grpc:
enabled: true
port: 4317
http:
enabled: true
port: 4318
```bash
kubectl create secret generic grafana-cloud-secret \
--from-literal=api-key=<your-api-key> \
-n monitoring
helm install k8s-monitoring grafana/k8s-monitoring \
-n monitoring --create-namespace \
-f values.yamlKey Kubernetes Metrics
关键Kubernetes指标
promql
undefinedpromql
undefinedCPU usage by pod
CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{
namespace="$namespace", container!=""}[5m])) by (pod)
sum(rate(container_cpu_usage_seconds_total{
namespace="$namespace", container!=""}[5m])) by (pod)
Memory usage by pod
Memory usage by pod
sum(container_memory_working_set_bytes{
namespace="$namespace", container!=""}) by (pod)
sum(container_memory_working_set_bytes{
namespace="$namespace", container!=""}) by (pod)
Node CPU pressure
Node CPU pressure
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
Pod restarts
Pod restarts
increase(kube_pod_container_status_restarts_total[1h])
increase(kube_pod_container_status_restarts_total[1h])
Deployment readiness
Deployment readiness
kube_deployment_status_replicas_ready / kube_deployment_spec_replicas
kube_deployment_status_replicas_ready / kube_deployment_spec_replicas
PVC usage
PVC usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
undefinedkubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
undefinedAWS CloudWatch Integration
AWS CloudWatch集成
yaml
undefinedyaml
undefinedAlloy config for AWS CloudWatch scraping
Alloy config for AWS CloudWatch scraping
prometheus.scrape "cloudwatch" {
targets = [{address = "cloudwatch-exporter:9106"}]
forward_to = [prometheus.remote_write.cloud.receiver]
}
Or use the CloudWatch datasource directly:
```yamlprometheus.scrape "cloudwatch" {
targets = [{address = "cloudwatch-exporter:9106"}]
forward_to = [prometheus.remote_write.cloud.receiver]
}
或者直接使用CloudWatch数据源:
```yamlprovisioning/datasources/cloudwatch.yaml
provisioning/datasources/cloudwatch.yaml
apiVersion: 1
datasources:
- name: CloudWatch
type: cloudwatch
jsonData:
defaultRegion: us-east-1
authType: default # uses EC2 instance role / ECS task role
Or explicit credentials:
authType: credentials
secureJsonData: accessKey: AKIAIOSFODNN7EXAMPLE secretKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
undefinedapiVersion: 1
datasources:
- name: CloudWatch
type: cloudwatch
jsonData:
defaultRegion: us-east-1
authType: default # uses EC2 instance role / ECS task role
Or explicit credentials:
authType: credentials
secureJsonData: accessKey: AKIAIOSFODNN7EXAMPLE secretKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
undefinedAzure Monitor Integration
Azure Monitor集成
yaml
undefinedyaml
undefinedprovisioning/datasources/azure.yaml
provisioning/datasources/azure.yaml
apiVersion: 1
datasources:
- name: Azure Monitor type: grafana-azure-monitor-datasource jsonData: cloudName: AzureCloud tenantId: your-tenant-id clientId: your-client-id secureJsonData: clientSecret: your-client-secret
undefinedapiVersion: 1
datasources:
- name: Azure Monitor type: grafana-azure-monitor-datasource jsonData: cloudName: AzureCloud tenantId: your-tenant-id clientId: your-client-id secureJsonData: clientSecret: your-client-secret
undefinedGCP / Google Cloud Monitoring
GCP / Google Cloud监控
yaml
undefinedyaml
undefinedprovisioning/datasources/google.yaml
provisioning/datasources/google.yaml
apiVersion: 1
datasources:
- name: Google Cloud Monitoring
type: stackdriver
jsonData:
authenticationType: gce # uses GCE metadata server
Or JWT:
authenticationType: jwt
secureJsonData: privateKey: | { "type": "service_account", ... }
undefinedapiVersion: 1
datasources:
- name: Google Cloud Monitoring
type: stackdriver
jsonData:
authenticationType: gce # uses GCE metadata server
Or JWT:
authenticationType: jwt
secureJsonData: privateKey: | { "type": "service_account", ... }
undefinedNode Exporter / Linux Host Monitoring
Node Exporter / Linux主机监控
alloy
// Alloy config for Linux host metrics
prometheus.exporter.unix "host" {
rootfs_path = "/"
enable_collectors = ["cpu", "diskstats", "filesystem", "loadavg", "meminfo", "netdev", "stat", "time", "uname"]
}
prometheus.scrape "node" {
targets = prometheus.exporter.unix.host.targets
forward_to = [prometheus.remote_write.cloud.receiver]
scrape_interval = "60s"
}alloy
// Alloy config for Linux host metrics
prometheus.exporter.unix "host" {
rootfs_path = "/"
enable_collectors = ["cpu", "diskstats", "filesystem", "loadavg", "meminfo", "netdev", "stat", "time", "uname"]
}
prometheus.scrape "node" {
targets = prometheus.exporter.unix.host.targets
forward_to = [prometheus.remote_write.cloud.receiver]
scrape_interval = "60s"
}Docker / Container Monitoring
Docker / 容器监控
alloy
// cAdvisor metrics via Alloy
prometheus.scrape "cadvisor" {
targets = [{"__address__" = "localhost:8080"}]
metrics_path = "/metrics"
forward_to = [prometheus.remote_write.cloud.receiver]
}
// Docker container logs
loki.source.docker "containers" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.containers.targets
forward_to = [loki.write.cloud.receiver]
}
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}alloy
// cAdvisor metrics via Alloy
prometheus.scrape "cadvisor" {
targets = [{"__address__" = "localhost:8080"}]
metrics_path = "/metrics"
forward_to = [prometheus.remote_write.cloud.receiver]
}
// Docker container logs
loki.source.docker "containers" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.containers.targets
forward_to = [loki.write.cloud.receiver]
}
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}Common Infrastructure Dashboards (Grafana Cloud)
常见基础设施仪表盘(Grafana Cloud)
Pre-built dashboards available from the integrations catalog:
- Kubernetes / Cluster (ID: 15520)
- Kubernetes / Namespace (ID: 15521)
- Kubernetes / Pod (ID: 15522)
- Node Exporter Full (ID: 1860)
- cAdvisor (ID: 14282)
- AWS EC2 (via CloudWatch integration)
- Azure VMs (via Azure Monitor integration)
集成目录中提供的预构建仪表盘:
- Kubernetes / 集群(ID: 15520)
- Kubernetes / 命名空间(ID: 15521)
- Kubernetes / Pod(ID: 15522)
- Node Exporter Full(ID: 1860)
- cAdvisor(ID: 14282)
- AWS EC2(通过CloudWatch集成)
- Azure VMs(通过Azure Monitor集成)
Alerting for Infrastructure
基础设施告警
yaml
undefinedyaml
undefinedCommon infrastructure alert rules
Common infrastructure alert rules
groups:
- name: kubernetes-alerts
rules:
-
alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"
-
alert: NodeMemoryPressure expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.instance }} low memory (<10% free)"
-
alert: PersistentVolumeAlmostFull expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} almost full"
-
undefinedgroups:
- name: kubernetes-alerts
rules:
-
alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} crash looping"
-
alert: NodeMemoryPressure expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.instance }} low memory (<10% free)"
-
alert: PersistentVolumeAlmostFull expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} almost full"
-
undefined