grafana-platform-dashboard

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Platform Dashboard

Grafana平台仪表盘

Design platform operations dashboards so operators see tenant-impacting risk first, then drill into service-specific health without overload.
设计平台运维仪表盘,让运维人员首先看到影响租户的风险,然后无需过载即可深入查看特定服务的健康状态。

Quick Start

快速开始

Use this skill when the user asks for platform dashboard updates and reliability checks.
  1. Confirm dashboard target:
bash
oc --context <ctx> get grafanadashboard -A | rg -i '<dashboard-name-or-theme>'
  1. Export dashboard and JSON:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh export \
  --context <ctx> \
  --namespace <ns> \
  --name <grafanadashboard-name> \
  --out-dir /tmp/<workspace>
  1. Edit the JSON and validate all PromQL:
bash
skills/grafana-platform-dashboard/scripts/promql_scan_thanos.sh \
  --context <ctx> \
  --dashboard-json /tmp/<workspace>/<name>.json
  1. Apply live safely:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh apply \
  --context <ctx> \
  --namespace <ns> \
  --name <grafanadashboard-name> \
  --json /tmp/<workspace>/<name>.json
当用户需要更新平台仪表盘和进行可靠性检查时使用本技能。
  1. 确认仪表盘目标:
bash
oc --context <ctx> get grafanadashboard -A | rg -i '<dashboard-name-or-theme>'
  1. 导出仪表盘和JSON:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh export \
  --context <ctx> \
  --namespace <ns> \
  --name <grafanadashboard-name> \
  --out-dir /tmp/<workspace>
  1. 编辑JSON并验证所有PromQL:
bash
skills/grafana-platform-dashboard/scripts/promql_scan_thanos.sh \
  --context <ctx> \
  --dashboard-json /tmp/<workspace>/<name>.json
  1. 安全地实时应用:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh apply \
  --context <ctx> \
  --namespace <ns> \
  --name <grafanadashboard-name> \
  --json /tmp/<workspace>/<name>.json

Workflow

工作流

1) Lock Scope From Platform Contracts

1) 根据平台协议锁定范围

Use the platform contract in platform-contract.md before editing panels.
  1. Keep L1 command view constrained to critical pre-tenant-impact signals.
  2. Use gate-aligned components first (critical CO gate, nodes, MCP, core API/etcd/ingress).
  3. Keep service-specific sections (Crossplane, Keycloak) below L1.
编辑面板前请参考platform-contract.md中的平台协议。
  1. 保持L1命令视图仅限定于会预先影响租户的关键信号。
  2. 优先使用与关口对齐的组件(关键CO关口、节点、MCP、核心API/etcd/ingress)。
  3. 将特定服务板块(Crossplane、Keycloak)放在L1下方。

2) Enforce Information Architecture

2) 执行信息架构规范

Use layout-guidelines.md:
  1. L1: critical-only, immediate action, minimal panel budget.
  2. L2: platform services by dependency domain.
  3. L3: deep dives (for example future GPU dashboard), not in L1.
参考layout-guidelines.md
  1. L1:仅展示关键内容、需立即处理的信息,严格控制面板数量。
  2. L2:按依赖领域划分的平台服务。
  3. L3:深度排查内容(例如未来的GPU仪表盘),不放在L1中。

3) Build Queries From Known Library

3) 从已知库构建查询

Use promql-library.md:
  1. Start from known-good queries and adapt labels minimally.
  2. Prefer counts and action tables over decorative charts.
  3. Filter alert noise explicitly (for example ArgoCD/GitOps) when requested.
参考promql-library.md
  1. 从已验证可用的查询开始,仅对标签做最小调整。
  2. 优先使用计数和操作表,而非装饰性图表。
  3. 如有需要,显式过滤告警噪音(例如ArgoCD/GitOps产生的)。

4) Validate Before Apply

4) 应用前验证

Always run the scan script after edits:
bash
skills/grafana-platform-dashboard/scripts/promql_scan_thanos.sh \
  --context <ctx> \
  --dashboard-json <file.json> \
  --output <scan.tsv>
Pass criteria: all queries report
success
, zero bad/parse errors.
编辑完成后务必运行扫描脚本:
bash
skills/grafana-platform-dashboard/scripts/promql_scan_thanos.sh \
  --context <ctx> \
  --dashboard-json <file.json> \
  --output <scan.tsv>
通过标准:所有查询返回
success
,无错误/解析错误。

5) Apply and Verify Sync

5) 应用并验证同步

Apply only after validation succeeds:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh apply ...
oc --context <ctx> -n <ns> get grafanadashboard <name> \
  -o jsonpath='{.status.conditions[?(@.type=="DashboardSynchronized")].status}{"|"}{.status.conditions[?(@.type=="DashboardSynchronized")].reason}{"\n"}'
仅在验证通过后应用变更:
bash
skills/grafana-platform-dashboard/scripts/grafanadashboard_roundtrip.sh apply ...
oc --context <ctx> -n <ns> get grafanadashboard <name> \
  -o jsonpath='{.status.conditions[?(@.type=="DashboardSynchronized")].status}{"|"}{.status.conditions[?(@.type=="DashboardSynchronized")].reason}{"\n"}'

6) Close With Operator-Focused Summary

6) 以运维为核心的收尾总结

Report:
  1. What changed (panel names and intent).
  2. Validation result (query count and failures).
  3. Sync status and any residual risk.
  4. Next step: promote live changes into GitOps-managed source.
报告内容:
  1. 变更内容(面板名称和调整目的)。
  2. 验证结果(查询数量和失败数)。
  3. 同步状态和所有剩余风险。
  4. 下一步:将实时变更推广到GitOps管理的源代码中。

Design Rules

设计规则

  1. Put critical tenant-impact predictors first.
  2. Every red panel must imply an action path.
  3. Avoid ambiguous panel names (for example replace “platform pods” with concrete namespace scope).
  4. Keep L1 low-noise; move detail below or to dedicated dashboards.
  5. Keep GPU deep diagnostics in a dedicated GPU dashboard, not mixed into L1.
  1. 优先放置影响租户的关键预测指标。
  2. 所有红色面板都必须对应明确的操作路径。
  3. 避免模糊的面板名称(例如将“平台pods”替换为具体的命名空间范围)。
  4. 保持L1低噪音;将详细内容移到下方或专属仪表盘。
  5. 将GPU深度诊断内容放在专属GPU仪表盘,不要混入L1。

References

参考资料

  1. Platform Contract
  2. PromQL Panel Library
  3. Layout Guidelines
  1. 平台协议
  2. PromQL面板库
  3. 布局规范