analyzing-schema-change-storage-risk

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Analyzing Schema Change Storage Risk

分析Schema变更存储风险

Estimates the storage headroom needed to safely run online schema changes. Mirrors the official guidance: some operations may temporarily require up to 3× the size of the affected table or index while the schema change is in flight.
For ongoing range-distribution monitoring, see analyzing-range-distribution.
估算安全执行在线Schema变更所需的存储余量。与官方指南一致:部分操作在Schema变更执行期间,可能临时需要最多3倍于受影响表或索引的大小的存储空间。
如需持续监控范围分布,请查看analyzing-range-distribution

When to Use This Skill

何时使用该技能

Run a quick estimate before issuing any of these operations on a table whose indexes are large (multiple GB per index, or many ranges per index):
  • CREATE INDEX
  • ADD COLUMN
    with
    INDEX
    or
    UNIQUE
  • ALTER PRIMARY KEY
  • CREATE MATERIALIZED VIEW
  • CREATE TABLE AS
  • REFRESH MATERIALIZED VIEW
  • ALTER TABLE ... SET LOCALITY
    (when the locality change rewrites data)
Tables whose indexes are small (kilobytes to a few megabytes) carry trivial storage risk; estimation is unnecessary.
当对索引较大的表(单索引占用数GB空间,或每个索引包含多个范围)执行以下任一操作前,先快速估算存储需求:
  • CREATE INDEX
  • INDEX
    UNIQUE
    ADD COLUMN
  • ALTER PRIMARY KEY
  • CREATE MATERIALIZED VIEW
  • CREATE TABLE AS
  • REFRESH MATERIALIZED VIEW
  • ALTER TABLE ... SET LOCALITY
    (当 locality 变更需要重写数据时)
索引较小的表(从千字节到几兆字节)存储风险极低,无需估算。

Background

背景知识

How much temporary space does a backfill actually need?

回填实际需要多少临时空间?

The honest answer depends on the operation:
  • CREATE INDEX
    /
    ADD COLUMN ... UNIQUE
    : needs roughly 1× the size of the new index — the indexed columns plus the primary key columns, written into a fresh index span. This is typically a small fraction of the table. Worst-case headroom is bounded by the size of that one index.
  • ALTER PRIMARY KEY
    : rewrites the primary index and any secondary indexes whose definitions depend on the old PK. Old data sticks around until GC, so peak on-disk usage during the change can approach the size of the table again.
  • All bulk-ingest backfills: extra MVCC versions and pre-compaction SSTs add overhead until Pebble compacts and GC runs.
The official docs round these up into a single conservative recommendation: plan for up to 3× the size of the affected table or index as free space. That figure is a safety bound, not a precise prediction. For most
CREATE INDEX
operations the real cost is much smaller; for
ALTER PRIMARY KEY
on a large table it is the right ballpark.
准确答案取决于具体操作:
  • CREATE INDEX
    /
    ADD COLUMN ... UNIQUE
    :大约需要1倍于新索引的大小——即索引列加上主键列,写入新的索引区间。这通常仅占表的一小部分。最坏情况下所需的余量以该单个索引的大小为上限。
  • ALTER PRIMARY KEY
    :会重写主索引以及所有依赖旧主键定义的二级索引。旧数据会保留直到垃圾回收(GC)完成,因此变更期间磁盘峰值使用量可能接近表的原始大小。
  • 所有批量导入回填操作:额外的MVCC版本和预压缩SST文件会增加开销,直到Pebble完成压缩并执行GC。
官方文档将这些情况汇总为一个保守的建议:预留最多3倍于受影响表或索引大小的可用空间。该数值是安全边界,而非精确预测。对于大多数
CREATE INDEX
操作,实际成本要小得多;而对于大型表的
ALTER PRIMARY KEY
操作,该数值则是合理的估算范围。

What happens if the cluster runs out of disk mid-backfill?

如果集群在回填过程中磁盘耗尽会发生什么?

Backfills bulk-ingest data via
AddSSTable
, which checks the per-store remaining capacity before each ingestion. If the remaining fraction falls below
kv.bulk_io_write.min_capacity_remaining_fraction
(default
0.05
, i.e. 5%), the ingest is rejected with
InsufficientSpaceError
. Both the legacy and declarative schema changers translate that error into a job pause request, so the schema change halts rather than wedging the cluster. To resume, free space (e.g. drop unused indexes, expand storage) and resume the paused job.
This is a reactive safety net, not a planning tool — by the time it fires, foreground writes on the affected store may already be unhealthy.
回填通过
AddSSTable
批量导入数据,每次导入前会检查每个存储节点的剩余容量。如果剩余容量比例低于
kv.bulk_io_write.min_capacity_remaining_fraction
(默认值为
0.05
,即5%),导入会被拒绝并返回
InsufficientSpaceError
。旧版和声明式schema变更器都会将该错误转换为作业暂停请求,因此schema变更会停止而非导致集群故障。如需恢复,需释放空间(例如删除未使用的索引、扩容存储)并恢复暂停的作业。
这是一个被动的安全保障,而非规划工具——当触发该机制时,受影响存储节点上的前台写入可能已经出现异常。

Estimating Capacity

容量估算步骤

Step 1 — Check free space per store

步骤1 — 检查每个存储节点的可用空间

The minimum free space across stores is what bounds the schema change, not the total cluster free space (replicas are spread across nodes).
sql
SELECT
  node_id,
  store_id,
  ROUND((capacity - used) / 1073741824.0, 2) AS free_gb,
  ROUND((used::FLOAT / capacity) * 100, 2)   AS used_pct
FROM crdb_internal.kv_store_status
ORDER BY free_gb ASC;
集群中所有存储节点的最小可用空间才是schema变更的限制因素,而非集群总可用空间(副本分布在不同节点上)。
sql
SELECT
  node_id,
  store_id,
  ROUND((capacity - used) / 1073741824.0, 2) AS free_gb,
  ROUND((used::FLOAT / capacity) * 100, 2)   AS used_pct
FROM crdb_internal.kv_store_status
ORDER BY free_gb ASC;

Step 2 — Estimate the affected table/index size

步骤2 — 估算受影响的表/索引大小

Use the docs-recommended form of
SHOW RANGES
:
sql
SHOW RANGES FROM TABLE <table> WITH DETAILS, KEYS, INDEXES;
The output includes one row per range, with
range_size_mb
and
index_name
. Aggregate by index for the per-index totals that matter for capacity planning:
sql
WITH r AS (SHOW RANGES FROM TABLE <table> WITH DETAILS, KEYS, INDEXES)
SELECT
  index_name,
  COUNT(*)                              AS range_count,
  ROUND(SUM(range_size_mb), 2)          AS index_size_mb,
  ROUND(SUM(range_size_mb) / 1024, 2)   AS index_size_gb
FROM r
GROUP BY index_name
ORDER BY index_size_mb DESC;
使用文档推荐的
SHOW RANGES
格式:
sql
SHOW RANGES FROM TABLE <table> WITH DETAILS, KEYS, INDEXES;
输出包含每个范围的一行数据,其中包含
range_size_mb
index_name
。按索引聚合得到每个索引的总大小,这对容量规划至关重要:
sql
WITH r AS (SHOW RANGES FROM TABLE <table> WITH DETAILS, KEYS, INDEXES)
SELECT
  index_name,
  COUNT(*)                              AS range_count,
  ROUND(SUM(range_size_mb), 2)          AS index_size_mb,
  ROUND(SUM(range_size_mb) / 1024, 2)   AS index_size_gb
FROM r
GROUP BY index_name
ORDER BY index_size_mb DESC;

Step 3 — Compare against the operation

步骤3 — 根据操作类型对比

OperationConservative free-space target (per store)
CREATE INDEX
/
ADD COLUMN ... UNIQUE
Up to 3× the size of the new index (its indexed + PK columns).
ALTER PRIMARY KEY
Up to 3× the size of the table (sum of the relevant indexes from step 2).
CREATE MATERIALIZED VIEW
/
CREATE TABLE AS
Up to 3× the expected size of the materialized result.
The new index does not exist yet, so estimate it from a comparable existing index (e.g. one on similarly typed columns) or from the source columns' contribution to the primary index.
If the smallest free-space figure from step 1 is well above the target, the operation is safe to run. If it is close, free space first (drop unused indexes, expand storage) before issuing the DDL.
操作类型保守可用空间目标(每个存储节点)
CREATE INDEX
/
ADD COLUMN ... UNIQUE
最多为索引大小的3倍(其索引列+主键列的大小)。
ALTER PRIMARY KEY
最多为大小的3倍(步骤2中相关索引的总和)。
CREATE MATERIALIZED VIEW
/
CREATE TABLE AS
最多为物化结果预期大小的3倍。
由于新索引尚未存在,可从类似的现有索引(例如具有相似类型列的索引)或源列在主索引中的占比来估算其大小。
如果步骤1中得到的最小可用空间远高于目标值,则操作可以安全执行。如果接近目标值,请先释放空间(删除未使用的索引、扩容存储),再执行DDL语句。

Operational Notes

操作注意事项

  • SHOW RANGES ... WITH DETAILS
    is expensive.
    It computes span statistics on demand. Always target a specific table, never run it cluster-wide, and prefer maintenance windows on tables with thousands of ranges.
  • Watch the job, not just disk. If a backfill pauses with
    InsufficientSpaceError
    , free disk on the affected store and resume the paused schema change job. Check with:
    sql
    SELECT job_id, status, error
    FROM crdb_internal.jobs
    WHERE job_type = 'SCHEMA CHANGE' AND status = 'paused';
  • Drop unused indexes first. Often the cheapest way to free headroom before a large backfill is to drop indexes that
    crdb_internal.index_usage_statistics
    shows are unused.
  • Statistics lag.
    range_size_mb
    is approximate and can lag actual disk usage; treat estimates as conservative ballparks, not exact figures.
  • SHOW RANGES ... WITH DETAILS
    开销较大
    。它会按需计算区间统计信息。请始终针对特定表执行,切勿在集群范围内运行,对于包含数千个范围的表,建议在维护窗口执行。
  • 不仅要监控磁盘,还要监控作业。如果回填因
    InsufficientSpaceError
    暂停,请在受影响的存储节点上释放磁盘空间,然后恢复暂停的schema变更作业。可通过以下查询检查:
    sql
    SELECT job_id, status, error
    FROM crdb_internal.jobs
    WHERE job_type = 'SCHEMA CHANGE' AND status = 'paused';
  • 先删除未使用的索引。在大型回填前释放余量最经济的方法通常是删除
    crdb_internal.index_usage_statistics
    显示未使用的索引。
  • 统计数据存在延迟
    range_size_mb
    是近似值,可能滞后于实际磁盘使用情况;请将估算值视为保守的大致范围,而非精确数值。

References

参考资料

Related Skills

相关技能

  • analyzing-range-distribution — range count, leaseholder placement, fragmentation
  • analyzing-range-distribution — 范围计数、租约持有者放置、碎片分析