reviewing-cluster-health

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Reviewing Cluster Health

集群健康检查综述

Performs a comprehensive health check of a CockroachDB cluster. Before running diagnostics, this skill gathers deployment context to provide the right queries and tools for the operator's tier.
对CockroachDB集群执行全面的健康检查。在运行诊断之前,该技能会先收集部署上下文,以便为运维人员提供符合对应层级的查询语句和工具。

When to Use This Skill

适用场景

  • Daily or shift-start operational health checks
  • Before starting maintenance (Self-Hosted, Advanced, BYOC)
  • After incidents to confirm recovery
  • Verifying production readiness
  • Monitoring capacity and performance
For live query issues: Use triaging-live-sql-activity. For background jobs: Use monitoring-background-jobs. For range analysis: Use analyzing-range-distribution.

  • 日常或换班时的运行状态健康检查
  • 开始维护前(Self-Hosted、Advanced、BYOC环境)
  • 事件发生后确认恢复情况
  • 验证生产环境就绪状态
  • 监控容量与性能
针对实时查询问题: 使用triaging-live-sql-activity针对后台任务: 使用monitoring-background-jobs针对范围分析: 使用analyzing-range-distribution

Step 1: Gather Context

步骤1:收集上下文

Required Context

必填上下文

QuestionOptionsWhy It Matters
Deployment tier?Self-Hosted, Advanced, BYOC, Standard, BasicDetermines available diagnostics and operator responsibilities
Reason for health check?Daily check, Pre-maintenance, Post-incident, Pre-upgradePrioritizes which dimensions to check first
问题选项重要性说明
部署层级?Self-Hosted、Advanced、BYOC、Standard、Basic决定可用的诊断工具及运维人员职责
健康检查原因?日常检查、维护前、事件后、升级前确定优先检查的维度

Additional Context (by tier)

各层级补充上下文

If Self-Hosted:
QuestionOptionsWhy It Matters
Access available?SQL + CLI, SQL onlyDetermines which tools can be used
Cloud provider?AWS, GCP, Azure, On-PremisesAffects infrastructure-level checks
Kubernetes deployment?Yes (Operator, Helm, manual), NoChanges CLI commands and monitoring
Node count and regions?e.g., 9 nodes, 3 regionsSets expectations for query results
If Advanced or BYOC:
QuestionOptionsWhy It Matters
Cloud provider? (BYOC only)AWS, GCP, AzureFor infrastructure-level monitoring in your cloud account
If Standard:
QuestionOptionsWhy It Matters
Current provisioned vCPUs?NumberContext for compute utilization assessment
If Basic: No additional context needed.
若为Self-Hosted:
问题选项重要性说明
可用访问方式?SQL + CLI、仅SQL决定可使用的工具
云服务商?AWS、GCP、Azure、本地部署影响基础设施级别的检查
是否基于Kubernetes部署? 是(Operator、Helm、手动)、否会改变CLI命令与监控方式
节点数量和区域? 例如:9节点,3个区域为查询结果设定预期
若为Advanced或BYOC:
问题选项重要性说明
云服务商?(仅BYOC)AWS、GCP、Azure用于在您的云账户中监控基础设施
若为Standard:
问题选项重要性说明
当前已配置的vCPU数量?具体数值为计算资源利用率评估提供上下文
若为Basic: 无需补充上下文。

Context-Driven Routing

基于上下文的路由

Self-Hosted Health Check

Self-Hosted集群健康检查

Applies when: Tier = Self-Hosted
适用场景: 层级 = Self-Hosted

Query 1: Node Liveness

查询1:节点存活状态

sql
SELECT
  n.node_id, n.address, n.build_tag AS version, n.locality,
  n.is_live, l.epoch,
  CASE WHEN n.is_live THEN 'HEALTHY'
       WHEN n.is_live IS NULL THEN 'UNKNOWN'
       ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;
  • Any
    is_live = false
    (from
    gossip_nodes
    ) requires immediate investigation
  • High
    epoch
    suggests repeated restarts (node flapping)
If CLI is available:
bash
cockroach node status --certs-dir=<certs-dir> --host=<node-address>
sql
SELECT
  n.node_id, n.address, n.build_tag AS version, n.locality,
  n.is_live, l.epoch,
  CASE WHEN n.is_live THEN 'HEALTHY'
       WHEN n.is_live IS NULL THEN 'UNKNOWN'
       ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;
  • 任何
    is_live = false
    (来自
    gossip_nodes
    )的情况都需要立即排查
  • 较高的
    epoch
    值表明节点反复重启(节点波动)
若CLI可用:
bash
cockroach node status --certs-dir=<certs-dir> --host=<node-address>

Query 2: Version Consistency

查询2:版本一致性

sql
SELECT build_tag AS version, COUNT(*) AS node_count,
  array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;
  • Single row = healthy. Two rows = acceptable during rolling upgrade. Three+ = investigate.
sql
SELECT build_tag AS version, COUNT(*) AS node_count,
  array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;
  • 单行结果 = 健康状态。两行结果 = 滚动升级期间属于正常情况。三行及以上 = 需要排查。

Query 3: Storage Capacity

查询3:存储容量

sql
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
  CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
       WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
       ELSE 'OK' END AS capacity_status,
  range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;
sql
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
  CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
       WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
       ELSE 'OK' END AS capacity_status,
  range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;

Query 4: Range Health

查询4:数据范围健康状态

sql
SELECT
  CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
       WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
       WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
       ELSE 'unknown' END AS replication_status,
  COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;
sql
SELECT
  CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
       WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
       WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
       ELSE 'unknown' END AS replication_status,
  COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;

Query 5: Certificate Expiration

查询5:证书过期情况

sql
SELECT node_id,
  to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
  to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
  CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
            < now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
       ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;
sql
SELECT node_id,
  to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
  to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
  CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
            < now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
       ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;

Query 6: Critical Settings

查询6:关键配置

sql
SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'server.time_until_store_dead', 'admission.kv.enabled',
  'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;
sql
SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'server.time_until_store_dead', 'admission.kv.enabled',
  'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;

Query 7: Consolidated Summary

查询7:综合汇总

sql
SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
  ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];
If reason = Pre-maintenance, also check for running jobs:
sql
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;
sql
SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
  ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];
若检查原因为维护前,还需检查运行中的任务:
sql
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;

Query 8: Production Readiness Assessment

查询8:生产就绪状态评估

Use when verifying a cluster is ready for production workloads or during periodic operational reviews.
sql
-- Node count and replication (minimum 3 nodes for production)
SELECT COUNT(*) AS total_nodes,
  COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
  COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);

-- Critical production settings check
SELECT variable, value,
  CASE
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
    WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
    WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
    ELSE 'OK'
  END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'admission.kv.enabled', 'cluster.preserve_downgrade_option',
  'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;

-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];
See production-readiness reference for the full production readiness checklist.

用于验证集群是否准备好承接生产工作负载,或定期运行状态评审。
sql
-- 节点数量与副本情况(生产环境至少需要3个节点)
SELECT COUNT(*) AS total_nodes,
  COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
  COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);

-- 关键生产配置检查
SELECT variable, value,
  CASE
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
    WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
    WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
    WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
    ELSE 'OK'
  END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
  'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
  'admission.kv.enabled', 'cluster.preserve_downgrade_option',
  'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;

-- 企业版许可证状态(仅Self-Hosted)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];
完整的生产就绪检查清单请参阅production-readiness reference

Advanced Health Check

Advanced集群健康检查

Applies when: Tier = Advanced
Advanced clusters are dedicated single-tenant clusters managed by Cockroach Labs. You have node-level visibility via both Cloud Console and SQL.
适用场景: 层级 = Advanced
Advanced集群是由Cockroach Labs管理的专用单租户集群。您可通过Cloud Console和SQL查看节点级别的状态。

Cloud Console Checks

Cloud Console检查项

  1. Cluster Overview — verify all nodes are live, check node count
  2. Metrics — CPU utilization, QPS, P99 latency, storage utilization
  3. Alerts — check for active alerts
  1. 集群概览 — 验证所有节点处于存活状态,检查节点数量
  2. 指标监控 — CPU利用率、QPS、P99延迟、存储利用率
  3. 告警信息 — 检查是否有活跃告警

SQL Checks

SQL检查项

sql
-- Node liveness (nodes are visible on Advanced)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Version consistency
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;

-- Range health
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;
sql
-- 节点存活状态(Advanced集群可见节点)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- 版本一致性
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;

-- 数据范围健康状态
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;

Cloud API

Cloud API

bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'

bash
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'

BYOC Health Check

BYOC集群健康检查

Applies when: Tier = BYOC
BYOC clusters are dedicated and run in your cloud account. You have the same CockroachDB visibility as Advanced, plus direct access to the underlying infrastructure.
适用场景: 层级 = BYOC
BYOC集群是专用集群,运行在您的云账户中。您拥有与Advanced集群相同的CockroachDB可见性,还可直接访问底层基础设施。

CockroachDB Health

CockroachDB健康检查

Run all Advanced Health Check steps.
执行所有Advanced集群健康检查步骤。

Cloud Provider Infrastructure Checks

云服务商基础设施检查

If AWS:
bash
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"
If GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"
If Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"
若为AWS:
bash
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"
若为GCP:
bash
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"
若为Azure:
bash
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"

Additional BYOC Checks

BYOC补充检查项

  • Verify VPC/network connectivity (PrivateLink, PSC, VPC Peering)
  • Check IAM roles — CRL service account permissions still valid
  • Review cloud provider monitoring for infrastructure-level anomalies

  • 验证VPC/网络连通性(PrivateLink、PSC、VPC对等连接)
  • 检查IAM角色 — CRL服务账号权限是否仍然有效
  • 查看云服务商监控,排查基础设施层面的异常

Standard Health Check

Standard集群健康检查

Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes to monitor — Cockroach Labs manages all infrastructure, replication, and capacity. Health checking focuses on your workload performance and provisioned compute.
适用场景: 层级 = Standard
Standard是多租户托管服务,无需监控单个节点 — Cockroach Labs管理所有基础设施、副本和容量。健康检查重点关注工作负载性能和已配置的计算资源。

Cloud Console Checks

Cloud Console检查项

  1. Cluster Overview — verify cluster state is
    RUNNING
  2. SQL Activity — statement and transaction latency, error rates
  3. Storage — current usage
  4. Compute — provisioned vCPU utilization
  1. 集群概览 — 验证集群状态为
    RUNNING
  2. SQL活动 — 语句与事务延迟、错误率
  3. 存储 — 当前使用量
  4. 计算资源 — 已配置vCPU的利用率

SQL Checks

SQL检查项

sql
-- Verify connectivity
SELECT 1;

-- Current version
SELECT version();

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';
sql
-- 验证连通性
SELECT 1;

-- 当前版本
SELECT version();

-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';

What to Monitor

监控重点

  • P99 SQL latency — track via Cloud Console Metrics
  • Error rates — check for spikes in statement errors
  • Storage growth — plan based on usage trends
  • Compute utilization — increase provisioned vCPUs if utilization is consistently high
Note: Node-level system tables (
crdb_internal.gossip_nodes
,
kv_store_status
, etc.) are not available on Standard. Use Cloud Console for all infrastructure health monitoring.

  • P99 SQL延迟 — 通过Cloud Console指标跟踪
  • 错误率 — 检查语句错误是否出现峰值
  • 存储增长 — 根据使用趋势规划扩容
  • 计算资源利用率 — 若利用率持续偏高,增加已配置vCPU数量
注意: Standard集群无法访问节点级系统表(
crdb_internal.gossip_nodes
kv_store_status
等)。基础设施健康监控请使用Cloud Console。

Basic Health Check

Basic集群健康检查

Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to monitor. Cockroach Labs manages all infrastructure. Health checking focuses on connectivity, consumption, and spending.
适用场景: 层级 = Basic
Basic是无服务器自动扩缩容服务,无需监控节点或已配置计算资源。Cockroach Labs管理所有基础设施。健康检查重点关注连通性、资源消耗和支出情况。

Cloud Console Checks

Cloud Console检查项

  1. Cluster Overview — verify state is
    RUNNING
  2. Request Units — consumption rate and remaining budget
  3. Storage — current usage (10 GiB included free)
  4. Spending Limits — verify limits are configured to avoid unexpected charges
  1. 集群概览 — 验证状态为
    RUNNING
  2. 请求单元(Request Units) — 消耗速率与剩余预算
  3. 存储 — 当前使用量(包含10 GiB免费额度)
  4. 支出限额 — 验证已配置限额,避免意外收费

SQL Checks

SQL检查项

sql
-- Verify connectivity
SELECT 1;

-- Current version
SELECT version();

-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';
sql
-- 验证连通性
SELECT 1;

-- 当前版本
SELECT version();

-- 近期失败的任务
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';

What to Monitor

监控重点

  • Request Unit (RU) consumption — track via Cloud Console to stay within spending limits
  • Storage usage — monitor growth relative to the 10 GiB free tier
  • Query efficiency — optimize queries that consume excessive RUs
  • Cold start latency — Basic clusters may scale to zero during inactivity; first connection after idle may have higher latency

  • 请求单元(RU)消耗 — 通过Cloud Console跟踪,确保在支出限额内
  • 存储使用量 — 监控增长情况,对比10 GiB免费额度
  • 查询效率 — 优化消耗过多RU的查询
  • 冷启动延迟 — Basic集群在闲置期间可能缩容至0;闲置后的首次连接可能延迟较高

Safety Considerations

注意事项

All queries in this skill are read-only. No data is modified.
  • Self-Hosted:
    crdb_internal.ranges_no_leases
    can be slow on large clusters — consider using
    LIMIT
  • Advanced/BYOC: Some system tables may have restricted access depending on SQL user role
  • Standard/Basic: Node-level system tables are not available — this is expected, not an error
本技能中的所有查询均为只读操作,不会修改任何数据。
  • Self-Hosted:
    crdb_internal.ranges_no_leases
    在大型集群上可能运行缓慢 — 考虑使用
    LIMIT
    限制结果
  • Advanced/BYOC: 部分系统表可能因SQL用户角色限制无法访问
  • Standard/Basic: 无法访问节点级系统表属于正常情况,并非错误

Troubleshooting

故障排查

IssueTierFix
crdb_internal.kv_node_status
empty
SHGrant admin or VIEWCLUSTERMETADATA
crdb_internal
table not found
STD/BASExpected — use Cloud Console
Node missing from gossip_nodesSHCheck node process; verify --join address
Cloud Console shows degradedADV/BYOCCheck Cloud status page; contact support
High RU consumptionBASProfile queries; set spending limits
Cloud API returns 401ADV/BYOCRegenerate API key
High latency on first connectionBASExpected cold start after idle period
问题层级解决方法
crdb_internal.kv_node_status
为空
SH授予管理员权限或VIEWCLUSTERMETADATA权限
找不到
crdb_internal
STD/BAS正常情况 — 使用Cloud Console
节点未出现在gossip_nodes中SH检查节点进程;验证--join地址
Cloud Console显示集群状态降级ADV/BYOC查看云状态页面;联系支持团队
RU消耗过高BAS分析查询性能;设置支出限额
Cloud API返回401ADV/BYOC重新生成API密钥
首次连接延迟高BAS闲置后的冷启动属于正常情况

References

参考资料

Skill references:
  • Production readiness checklist
Related skills:
  • upgrading-cluster-version
  • managing-cluster-capacity
  • performing-cluster-maintenance
  • monitoring-background-jobs
Official CockroachDB Documentation:
技能参考:
  • 生产就绪检查清单
相关技能:
  • upgrading-cluster-version
  • managing-cluster-capacity
  • performing-cluster-maintenance
  • monitoring-background-jobs
官方CockroachDB文档: