reviewing-cluster-health
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseReviewing Cluster Health
集群健康检查综述
Performs a comprehensive health check of a CockroachDB cluster. Before running diagnostics, this skill gathers deployment context to provide the right queries and tools for the operator's tier.
对CockroachDB集群执行全面的健康检查。在运行诊断之前,该技能会先收集部署上下文,以便为运维人员提供符合对应层级的查询语句和工具。
When to Use This Skill
适用场景
- Daily or shift-start operational health checks
- Before starting maintenance (Self-Hosted, Advanced, BYOC)
- After incidents to confirm recovery
- Verifying production readiness
- Monitoring capacity and performance
For live query issues: Use triaging-live-sql-activity.
For background jobs: Use monitoring-background-jobs.
For range analysis: Use analyzing-range-distribution.
- 日常或换班时的运行状态健康检查
- 开始维护前(Self-Hosted、Advanced、BYOC环境)
- 事件发生后确认恢复情况
- 验证生产环境就绪状态
- 监控容量与性能
针对实时查询问题: 使用triaging-live-sql-activity。
针对后台任务: 使用monitoring-background-jobs。
针对范围分析: 使用analyzing-range-distribution。
Step 1: Gather Context
步骤1:收集上下文
Required Context
必填上下文
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Determines available diagnostics and operator responsibilities |
| Reason for health check? | Daily check, Pre-maintenance, Post-incident, Pre-upgrade | Prioritizes which dimensions to check first |
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 部署层级? | Self-Hosted、Advanced、BYOC、Standard、Basic | 决定可用的诊断工具及运维人员职责 |
| 健康检查原因? | 日常检查、维护前、事件后、升级前 | 确定优先检查的维度 |
Additional Context (by tier)
各层级补充上下文
If Self-Hosted:
| Question | Options | Why It Matters |
|---|---|---|
| Access available? | SQL + CLI, SQL only | Determines which tools can be used |
| Cloud provider? | AWS, GCP, Azure, On-Premises | Affects infrastructure-level checks |
| Kubernetes deployment? | Yes (Operator, Helm, manual), No | Changes CLI commands and monitoring |
| Node count and regions? | e.g., 9 nodes, 3 regions | Sets expectations for query results |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Cloud provider? (BYOC only) | AWS, GCP, Azure | For infrastructure-level monitoring in your cloud account |
If Standard:
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for compute utilization assessment |
If Basic: No additional context needed.
若为Self-Hosted:
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 可用访问方式? | SQL + CLI、仅SQL | 决定可使用的工具 |
| 云服务商? | AWS、GCP、Azure、本地部署 | 影响基础设施级别的检查 |
| 是否基于Kubernetes部署? 是(Operator、Helm、手动)、否 | 会改变CLI命令与监控方式 | |
| 节点数量和区域? 例如:9节点,3个区域 | 为查询结果设定预期 |
若为Advanced或BYOC:
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 云服务商?(仅BYOC) | AWS、GCP、Azure | 用于在您的云账户中监控基础设施 |
若为Standard:
| 问题 | 选项 | 重要性说明 |
|---|---|---|
| 当前已配置的vCPU数量? | 具体数值 | 为计算资源利用率评估提供上下文 |
若为Basic: 无需补充上下文。
Context-Driven Routing
基于上下文的路由
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Health Check |
| Advanced | Advanced Health Check |
| BYOC | BYOC Health Check |
| Standard | Standard Health Check |
| Basic | Basic Health Check |
| 层级 | 跳转至 |
|---|---|
| Self-Hosted | 自托管集群健康检查 |
| Advanced | 高级版集群健康检查 |
| BYOC | BYOC集群健康检查 |
| Standard | 标准版集群健康检查 |
| Basic | 基础版集群健康检查 |
Self-Hosted Health Check
Self-Hosted集群健康检查
Applies when: Tier = Self-Hosted
适用场景: 层级 = Self-Hosted
Query 1: Node Liveness
查询1:节点存活状态
sql
SELECT
n.node_id, n.address, n.build_tag AS version, n.locality,
n.is_live, l.epoch,
CASE WHEN n.is_live THEN 'HEALTHY'
WHEN n.is_live IS NULL THEN 'UNKNOWN'
ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;- Any (from
is_live = false) requires immediate investigationgossip_nodes - High suggests repeated restarts (node flapping)
epoch
If CLI is available:
bash
cockroach node status --certs-dir=<certs-dir> --host=<node-address>sql
SELECT
n.node_id, n.address, n.build_tag AS version, n.locality,
n.is_live, l.epoch,
CASE WHEN n.is_live THEN 'HEALTHY'
WHEN n.is_live IS NULL THEN 'UNKNOWN'
ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;- 任何(来自
is_live = false)的情况都需要立即排查gossip_nodes - 较高的值表明节点反复重启(节点波动)
epoch
若CLI可用:
bash
cockroach node status --certs-dir=<certs-dir> --host=<node-address>Query 2: Version Consistency
查询2:版本一致性
sql
SELECT build_tag AS version, COUNT(*) AS node_count,
array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;- Single row = healthy. Two rows = acceptable during rolling upgrade. Three+ = investigate.
sql
SELECT build_tag AS version, COUNT(*) AS node_count,
array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;- 单行结果 = 健康状态。两行结果 = 滚动升级期间属于正常情况。三行及以上 = 需要排查。
Query 3: Storage Capacity
查询3:存储容量
sql
SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
ELSE 'OK' END AS capacity_status,
range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;sql
SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
ELSE 'OK' END AS capacity_status,
range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;Query 4: Range Health
查询4:数据范围健康状态
sql
SELECT
CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
ELSE 'unknown' END AS replication_status,
COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;sql
SELECT
CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
ELSE 'unknown' END AS replication_status,
COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;Query 5: Certificate Expiration
查询5:证书过期情况
sql
SELECT node_id,
to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
< now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;sql
SELECT node_id,
to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
< now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;Query 6: Critical Settings
查询6:关键配置
sql
SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;sql
SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;Query 7: Consolidated Summary
查询7:综合汇总
sql
SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];If reason = Pre-maintenance, also check for running jobs:
sql
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;sql
SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];若检查原因为维护前,还需检查运行中的任务:
sql
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;Query 8: Production Readiness Assessment
查询8:生产就绪状态评估
Use when verifying a cluster is ready for production workloads or during periodic operational reviews.
sql
-- Node count and replication (minimum 3 nodes for production)
SELECT COUNT(*) AS total_nodes,
COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);
-- Critical production settings check
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;
-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];See production-readiness reference for the full production readiness checklist.
用于验证集群是否准备好承接生产工作负载,或定期运行状态评审。
sql
-- 节点数量与副本情况(生产环境至少需要3个节点)
SELECT COUNT(*) AS total_nodes,
COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);
-- 关键生产配置检查
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;
-- 企业版许可证状态(仅Self-Hosted)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];完整的生产就绪检查清单请参阅production-readiness reference。
Advanced Health Check
Advanced集群健康检查
Applies when: Tier = Advanced
Advanced clusters are dedicated single-tenant clusters managed by Cockroach Labs. You have node-level visibility via both Cloud Console and SQL.
适用场景: 层级 = Advanced
Advanced集群是由Cockroach Labs管理的专用单租户集群。您可通过Cloud Console和SQL查看节点级别的状态。
Cloud Console Checks
Cloud Console检查项
- Cluster Overview — verify all nodes are live, check node count
- Metrics — CPU utilization, QPS, P99 latency, storage utilization
- Alerts — check for active alerts
- 集群概览 — 验证所有节点处于存活状态,检查节点数量
- 指标监控 — CPU利用率、QPS、P99延迟、存储利用率
- 告警信息 — 检查是否有活跃告警
SQL Checks
SQL检查项
sql
-- Node liveness (nodes are visible on Advanced)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Version consistency
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;
-- Range health
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;sql
-- 节点存活状态(Advanced集群可见节点)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- 版本一致性
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;
-- 数据范围健康状态
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;Cloud API
Cloud API
bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'BYOC Health Check
BYOC集群健康检查
Applies when: Tier = BYOC
BYOC clusters are dedicated and run in your cloud account. You have the same CockroachDB visibility as Advanced, plus direct access to the underlying infrastructure.
适用场景: 层级 = BYOC
BYOC集群是专用集群,运行在您的云账户中。您拥有与Advanced集群相同的CockroachDB可见性,还可直接访问底层基础设施。
CockroachDB Health
CockroachDB健康检查
Run all Advanced Health Check steps.
执行所有Advanced集群健康检查步骤。
Cloud Provider Infrastructure Checks
云服务商基础设施检查
If AWS:
bash
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"If GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"If Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"若为AWS:
bash
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"若为GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"若为Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"Additional BYOC Checks
BYOC补充检查项
- Verify VPC/network connectivity (PrivateLink, PSC, VPC Peering)
- Check IAM roles — CRL service account permissions still valid
- Review cloud provider monitoring for infrastructure-level anomalies
- 验证VPC/网络连通性(PrivateLink、PSC、VPC对等连接)
- 检查IAM角色 — CRL服务账号权限是否仍然有效
- 查看云服务商监控,排查基础设施层面的异常
Standard Health Check
Standard集群健康检查
Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes to monitor — Cockroach Labs manages all infrastructure, replication, and capacity. Health checking focuses on your workload performance and provisioned compute.
适用场景: 层级 = Standard
Standard是多租户托管服务,无需监控单个节点 — Cockroach Labs管理所有基础设施、副本和容量。健康检查重点关注工作负载性能和已配置的计算资源。
Cloud Console Checks
Cloud Console检查项
- Cluster Overview — verify cluster state is
RUNNING - SQL Activity — statement and transaction latency, error rates
- Storage — current usage
- Compute — provisioned vCPU utilization
- 集群概览 — 验证集群状态为
RUNNING - SQL活动 — 语句与事务延迟、错误率
- 存储 — 当前使用量
- 计算资源 — 已配置vCPU的利用率
SQL Checks
SQL检查项
sql
-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';sql
-- 验证连通性
SELECT 1;
-- 当前版本
SELECT version();
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';What to Monitor
监控重点
- P99 SQL latency — track via Cloud Console Metrics
- Error rates — check for spikes in statement errors
- Storage growth — plan based on usage trends
- Compute utilization — increase provisioned vCPUs if utilization is consistently high
Note: Node-level system tables (, , etc.) are not available on Standard. Use Cloud Console for all infrastructure health monitoring.
crdb_internal.gossip_nodeskv_store_status- P99 SQL延迟 — 通过Cloud Console指标跟踪
- 错误率 — 检查语句错误是否出现峰值
- 存储增长 — 根据使用趋势规划扩容
- 计算资源利用率 — 若利用率持续偏高,增加已配置vCPU数量
注意: Standard集群无法访问节点级系统表(、等)。基础设施健康监控请使用Cloud Console。
crdb_internal.gossip_nodeskv_store_statusBasic Health Check
Basic集群健康检查
Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to monitor. Cockroach Labs manages all infrastructure. Health checking focuses on connectivity, consumption, and spending.
适用场景: 层级 = Basic
Basic是无服务器自动扩缩容服务,无需监控节点或已配置计算资源。Cockroach Labs管理所有基础设施。健康检查重点关注连通性、资源消耗和支出情况。
Cloud Console Checks
Cloud Console检查项
- Cluster Overview — verify state is
RUNNING - Request Units — consumption rate and remaining budget
- Storage — current usage (10 GiB included free)
- Spending Limits — verify limits are configured to avoid unexpected charges
- 集群概览 — 验证状态为
RUNNING - 请求单元(Request Units) — 消耗速率与剩余预算
- 存储 — 当前使用量(包含10 GiB免费额度)
- 支出限额 — 验证已配置限额,避免意外收费
SQL Checks
SQL检查项
sql
-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';sql
-- 验证连通性
SELECT 1;
-- 当前版本
SELECT version();
-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';What to Monitor
监控重点
- Request Unit (RU) consumption — track via Cloud Console to stay within spending limits
- Storage usage — monitor growth relative to the 10 GiB free tier
- Query efficiency — optimize queries that consume excessive RUs
- Cold start latency — Basic clusters may scale to zero during inactivity; first connection after idle may have higher latency
- 请求单元(RU)消耗 — 通过Cloud Console跟踪,确保在支出限额内
- 存储使用量 — 监控增长情况,对比10 GiB免费额度
- 查询效率 — 优化消耗过多RU的查询
- 冷启动延迟 — Basic集群在闲置期间可能缩容至0;闲置后的首次连接可能延迟较高
Safety Considerations
注意事项
All queries in this skill are read-only. No data is modified.
- Self-Hosted: can be slow on large clusters — consider using
crdb_internal.ranges_no_leasesLIMIT - Advanced/BYOC: Some system tables may have restricted access depending on SQL user role
- Standard/Basic: Node-level system tables are not available — this is expected, not an error
本技能中的所有查询均为只读操作,不会修改任何数据。
- Self-Hosted: 在大型集群上可能运行缓慢 — 考虑使用
crdb_internal.ranges_no_leases限制结果LIMIT - Advanced/BYOC: 部分系统表可能因SQL用户角色限制无法访问
- Standard/Basic: 无法访问节点级系统表属于正常情况,并非错误
Troubleshooting
故障排查
| Issue | Tier | Fix |
|---|---|---|
| SH | Grant admin or VIEWCLUSTERMETADATA |
| STD/BAS | Expected — use Cloud Console |
| Node missing from gossip_nodes | SH | Check node process; verify --join address |
| Cloud Console shows degraded | ADV/BYOC | Check Cloud status page; contact support |
| High RU consumption | BAS | Profile queries; set spending limits |
| Cloud API returns 401 | ADV/BYOC | Regenerate API key |
| High latency on first connection | BAS | Expected cold start after idle period |
| 问题 | 层级 | 解决方法 |
|---|---|---|
| SH | 授予管理员权限或VIEWCLUSTERMETADATA权限 |
找不到 | STD/BAS | 正常情况 — 使用Cloud Console |
| 节点未出现在gossip_nodes中 | SH | 检查节点进程;验证--join地址 |
| Cloud Console显示集群状态降级 | ADV/BYOC | 查看云状态页面;联系支持团队 |
| RU消耗过高 | BAS | 分析查询性能;设置支出限额 |
| Cloud API返回401 | ADV/BYOC | 重新生成API密钥 |
| 首次连接延迟高 | BAS | 闲置后的冷启动属于正常情况 |
References
参考资料
Skill references:
- Production readiness checklist
Related skills:
- upgrading-cluster-version
- managing-cluster-capacity
- performing-cluster-maintenance
- monitoring-background-jobs
Official CockroachDB Documentation:
技能参考:
- 生产就绪检查清单
相关技能:
- upgrading-cluster-version
- managing-cluster-capacity
- performing-cluster-maintenance
- monitoring-background-jobs
官方CockroachDB文档: