auditing-table-statistics

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Auditing Table Statistics

表统计信息审计

Audits optimizer table statistics for staleness, missing column coverage, and row count drift to diagnose poor query performance caused by outdated or incomplete statistics. Uses

SHOW STATISTICS

for read-only SQL analysis of table-level and column-level statistics freshness, entirely without requiring DB Console access.

Complement to profiling-statement-fingerprints: This skill diagnoses optimizer statistics issues; for identifying historically slow queries, see profiling-statement-fingerprints.

检查优化器表统计信息的过期情况、列覆盖缺失以及行数偏差，以诊断由过期或不完整统计信息导致的查询性能低下问题。使用

SHOW STATISTICS

进行只读SQL分析，查看表级和列级统计信息的新鲜度，完全无需访问DB Console。

与profiling-statement-fingerprints互补： 本技能用于诊断优化器统计信息问题；如需识别历史慢查询，请查看profiling-statement-fingerprints。

When to Use This Skill

何时使用本技能

Query performance degrades after bulk INSERT, UPDATE, or DELETE operations
EXPLAIN plans show unexpected full table scans or suboptimal join orders
Plan instability: same query produces different execution plans over time
After schema changes: ADD COLUMN, DROP COLUMN, or CREATE INDEX operations
Tables experience >20-30% row count changes without statistics refresh
SQL-only diagnostics needed without DB Console access
Validating automatic statistics collection is working correctly

For historical query analysis: Use profiling-statement-fingerprints to identify slow statement patterns. For live query triage: Use triaging-live-sql-activity for immediate incident response.

批量执行INSERT、UPDATE或DELETE操作后查询性能下降
EXPLAIN计划显示意外的全表扫描或非最优连接顺序
计划不稳定：同一查询在不同时间生成不同的执行计划
模式变更后：执行ADD COLUMN、DROP COLUMN或CREATE INDEX操作
表的行数变化超过20-30%但未刷新统计信息
仅需SQL诊断，无需访问DB Console
验证自动统计信息收集功能是否正常工作

历史查询分析： 使用profiling-statement-fingerprints识别慢语句模式。 实时查询分类： 使用triaging-live-sql-activity进行即时事件响应。

Prerequisites

前提条件

SQL connection to CockroachDB cluster
Privilege requirement: Any privilege on target tables (SELECT, INSERT, UPDATE, DELETE, or admin role)
- Much less restrictive than VIEWACTIVITY: any table access grants statistics visibility for that table
Automatic statistics collection enabled (default):
```
sql.stats.automatic_collection.enabled = true
```

Check automatic collection status:

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- Should return: true

Verify table access:

sql

SHOW GRANTS ON TABLE database_name.table_name;

与CockroachDB集群的SQL连接
权限要求： 目标表的任意权限（SELECT、INSERT、UPDATE、DELETE或管理员角色）
- 比VIEWACTIVITY权限限制更少：拥有任意表访问权限即可查看该表的统计信息
已启用自动统计信息收集（默认设置）：
```
sql.stats.automatic_collection.enabled = true
```

检查自动收集状态：

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- 应返回：true

验证表访问权限：

sql

SHOW GRANTS ON TABLE database_name.table_name;

Core Concepts

核心概念

What Are Table Statistics?

什么是表统计信息？

Table statistics provide the optimizer with data distribution information to estimate query costs:

Statistic Type	Description	Impact on Optimizer
row_count	Total rows in table	Cardinality estimates for full scans
distinct_count	Unique values per column	Selectivity estimates for WHERE/JOIN predicates
null_count	NULL values per column	IS NULL / IS NOT NULL predicate costs
histogram	Value distribution buckets	Range scan selectivity (e.g., `WHERE age BETWEEN 20 AND 30` )

Multi-column statistics capture correlation between columns (e.g., city + state + zip) for more accurate multi-predicate estimates.

表统计信息为优化器提供数据分布信息，用于估算查询成本：

统计信息类型	描述	对优化器的影响
row_count	表中的总行数	全表扫描的基数估算
distinct_count	每列的唯一值数量	WHERE/JOIN谓词的选择性估算
null_count	每列的NULL值数量	IS NULL / IS NOT NULL谓词的成本估算
histogram	值分布桶	范围扫描的选择性（例如 `WHERE age BETWEEN 20 AND 30` ）

多列统计信息捕获列之间的相关性（例如city + state + zip），以更准确地估算多谓词查询的成本。

Statistics Lifecycle

统计信息生命周期

Automatic collection (default):

Triggered when row count changes by ~20% since last statistics collection
Runs as background job (non-blocking, but consumes resources)
May be delayed for very large tables (>10M rows)

Manual collection:

Explicit
```
CREATE STATISTICS
```
command for immediate refresh
Required for multi-column statistics (automatic collection only creates single-column stats)
Recommended after bulk data loads or significant schema changes

自动收集（默认）：

自上次统计信息收集以来行数变化约20%时触发
作为后台作业运行（非阻塞，但会消耗资源）
对于超大型表（>10M行）可能会延迟

手动收集：

使用显式
```
CREATE STATISTICS
```
命令即时刷新
多列统计信息需要手动创建（自动收集仅创建单列统计信息）
建议在批量数据加载或重大模式变更后执行

Staleness Indicators

过期指标

Indicator	Definition	Recommended Action
Age	Time since last statistics collection	Refresh if >7 days (OLTP) or >30 days (OLAP)
Row count drift	Percent difference between current and cached row_count	Refresh if >20-30% drift detected
Missing columns	Columns without statistics	CREATE STATISTICS for frequently queried columns
Missing histograms	Columns without distribution data	Automatic collection handles; may need manual refresh

See references/statistics-thresholds.md for workload-specific threshold guidance.

指标	定义	建议操作
Age	自上次统计信息收集以来的时间	如果超过7天（OLTP）或30天（OLAP）则刷新
行数偏差	当前行数与缓存row_count的百分比差异	如果检测到偏差超过20-30%则刷新
缺失列	没有统计信息的列	为频繁查询的列创建CREATE STATISTICS
缺失直方图	没有分布数据的列	自动收集会处理；可能需要手动刷新

有关特定工作负载的阈值指南，请参阅references/statistics-thresholds.md。

When Statistics Are Auto-Collected

自动收集统计信息的触发时机

Default trigger: ~20% row count change (controlled by

sql.stats.automatic_collection.fraction_stale_rows

)

Collection schedule:

Small tables (<10K rows): Immediate
Medium tables (10K-10M rows): Within minutes to hours
Large tables (>10M rows): May be delayed hours to avoid resource contention

Check pending jobs:

sql

SELECT job_id, description, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'AUTO CREATE STATS'
  AND status IN ('pending', 'running')
ORDER BY created DESC
LIMIT 20;

默认触发条件： 行数变化约20%（由

sql.stats.automatic_collection.fraction_stale_rows

控制）

收集计划：

小表（<10K行）：即时收集
中等表（10K-10M行）：数分钟至数小时内收集
大型表（>10M行）：可能延迟数小时以避免资源竞争

检查待处理作业：

sql

SELECT job_id, description, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'AUTO CREATE STATS'
  AND status IN ('pending', 'running')
ORDER BY created DESC
LIMIT 20;

Core Diagnostic Queries

核心诊断查询

Query 1: Identify Tables with Stale or Missing Statistics

查询1：识别统计信息过期或缺失的表

Finds tables with outdated statistics or no statistics at all, ranked by staleness.

sql

WITH table_stats AS (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    column_names,
    row_count,
    created,
    now() - created AS stats_age
  FROM [SHOW STATISTICS FOR TABLE database_name.*]  -- Replace database_name
  WHERE column_names = '{}'  -- Table-level stats only (empty array)
)
SELECT
  table_schema || '.' || table_name AS full_table_name,
  row_count,
  created AS stats_created_at,
  stats_age,
  CASE
    WHEN created IS NULL THEN 'Missing statistics'
    WHEN stats_age > INTERVAL '30 days' THEN 'Very stale (>30d)'
    WHEN stats_age > INTERVAL '7 days' THEN 'Stale (>7d)'
    ELSE 'Fresh'
  END AS staleness_status
FROM table_stats
WHERE stats_age > INTERVAL '7 days' OR created IS NULL  -- Adjust threshold
ORDER BY stats_age DESC NULLS FIRST
LIMIT 50;

Customization:

Replace
```
database_name.*
```
with specific schema pattern (e.g.,
```
mydb.public.*
```
)
Adjust staleness threshold:
```
INTERVAL '7 days'
```
for OLTP,
```
'30 days'
```
for OLAP
Increase
```
LIMIT
```
to see more tables

Key columns:

```
staleness_status
```
: Quick classification of statistics freshness
```
stats_age
```
: Exact time since last collection
```
row_count
```
: Last known table size

查找统计信息过期或完全缺失的表，按过期程度排序。

sql

WITH table_stats AS (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    column_names,
    row_count,
    created,
    now() - created AS stats_age
  FROM [SHOW STATISTICS FOR TABLE database_name.*]  -- 替换database_name
  WHERE column_names = '{}'  -- 仅表级统计信息（空数组）
)
SELECT
  table_schema || '.' || table_name AS full_table_name,
  row_count,
  created AS stats_created_at,
  stats_age,
  CASE
    WHEN created IS NULL THEN '缺失统计信息'
    WHEN stats_age > INTERVAL '30 days' THEN '严重过期(>30天)'
    WHEN stats_age > INTERVAL '7 days' THEN '过期(>7天)'
    ELSE '新鲜'
  END AS staleness_status
FROM table_stats
WHERE stats_age > INTERVAL '7 days' OR created IS NULL  -- 调整阈值
ORDER BY stats_age DESC NULLS FIRST
LIMIT 50;

自定义设置：

将
```
database_name.*
```
替换为特定模式（例如
```
mydb.public.*
```
）
调整过期阈值：OLTP工作负载用
```
INTERVAL '7 days'
```
，OLAP用
```
'30 days'
```
增大
```
LIMIT
```
以查看更多表

关键列：

```
staleness_status
```
：统计信息新鲜度的快速分类
```
stats_age
```
：自上次收集以来的准确时间
```
row_count
```
：上次记录的表大小

Query 2: Audit Statistics for Specific Table

查询2：审计特定表的统计信息

Shows all statistics for a single table, including table-level and per-column details.

sql

SELECT
  column_names,
  row_count,
  distinct_count,
  null_count,
  created,
  now() - created AS stats_age,
  CASE
    WHEN histogram_id IS NOT NULL THEN 'Yes'
    ELSE 'No'
  END AS has_histogram
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
ORDER BY
  CASE WHEN column_names = '{}' THEN 0 ELSE 1 END,  -- Table-level first
  created DESC;

Customization:

Replace
```
database_name.schema_name.table_name
```
with fully-qualified table name

Key columns:

```
column_names
```
: Empty
```
{}
```
= table-level, single element = column-level
```
distinct_count
```
: Cardinality for selectivity estimates
```
null_count
```
: NULL value count for IS NULL predicates
```
has_histogram
```
: Distribution data availability

Interpretation:

First row (column_names = '{}') shows table-level row_count
Subsequent rows show per-column statistics
Missing columns indicate no statistics collected yet

显示单个表的所有统计信息，包括表级和列级详情。

sql

SELECT
  column_names,
  row_count,
  distinct_count,
  null_count,
  created,
  now() - created AS stats_age,
  CASE
    WHEN histogram_id IS NOT NULL THEN '是'
    ELSE '否'
  END AS has_histogram
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
ORDER BY
  CASE WHEN column_names = '{}' THEN 0 ELSE 1 END,  -- 表级统计信息优先
  created DESC;

自定义设置：

将
```
database_name.schema_name.table_name
```
替换为全限定表名

关键列：

```
column_names
```
：空
```
{}
```
表示表级，单个元素表示列级
```
distinct_count
```
：用于选择性估算的基数
```
null_count
```
：IS NULL谓词的NULL值数量
```
has_histogram
```
：分布数据的可用性

解读：

第一行（column_names = '{}'）显示表级row_count
后续行显示列级统计信息
缺失的列表示尚未收集统计信息

Query 3: Detect Row Count Drift

查询3：检测行数偏差

Compares current table row count against cached statistics to identify significant drift.

sql

WITH current_count AS (
  SELECT count(*) AS actual_rows
  FROM database_name.schema_name.table_name  -- Replace with target table
),
stats_count AS (
  SELECT row_count, created
  FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
  WHERE column_names = '{}'  -- Table-level stats
  ORDER BY created DESC
  LIMIT 1
)
SELECT
  c.actual_rows,
  s.row_count AS stats_rows,
  s.created AS stats_created_at,
  now() - s.created AS stats_age,
  ABS(c.actual_rows - s.row_count) AS drift_absolute,
  ROUND(
    ABS(c.actual_rows - s.row_count)::NUMERIC /
    NULLIF(s.row_count, 0) * 100,
    2
  ) AS drift_pct,
  CASE
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.30 THEN 'High drift (>30%)'
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.20 THEN 'Medium drift (>20%)'
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.10 THEN 'Low drift (>10%)'
    ELSE 'Minimal drift (<10%)'
  END AS drift_status
FROM current_count c, stats_count s;

Customization:

Replace table name in both CTEs
Adjust drift thresholds (30%, 20%, 10%) based on workload tolerance

Key columns:

```
drift_pct
```
: Percentage difference between current and cached row count
```
drift_status
```
: Classification for prioritization
```
stats_age
```
: Time since statistics last refreshed

Interpretation:

>30% drift: Urgent refresh recommended, optimizer estimates likely very inaccurate
20-30% drift: Consider refresh if experiencing performance issues
10-20% drift: Monitor for trends, may trigger automatic collection soon
<10% drift: Normal variance, no action needed

比较当前表行数与缓存统计信息，识别显著偏差。

sql

WITH current_count AS (
  SELECT count(*) AS actual_rows
  FROM database_name.schema_name.table_name  -- 替换为目标表
),
stats_count AS (
  SELECT row_count, created
  FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
  WHERE column_names = '{}'  -- 表级统计信息
  ORDER BY created DESC
  LIMIT 1
)
SELECT
  c.actual_rows,
  s.row_count AS stats_rows,
  s.created AS stats_created_at,
  now() - s.created AS stats_age,
  ABS(c.actual_rows - s.row_count) AS drift_absolute,
  ROUND(
    ABS(c.actual_rows - s.row_count)::NUMERIC /
    NULLIF(s.row_count, 0) * 100,
    2
  ) AS drift_pct,
  CASE
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.30 THEN '高偏差(>30%)'
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.20 THEN '中偏差(>20%)'
    WHEN ABS(c.actual_rows - s.row_count)::NUMERIC / NULLIF(s.row_count, 0) > 0.10 THEN '低偏差(>10%)'
    ELSE '极小偏差(<10%)'
  END AS drift_status
FROM current_count c, stats_count s;

自定义设置：

在两个CTE中替换表名
根据工作负载容忍度调整偏差阈值（30%、20%、10%）

关键列：

```
drift_pct
```
：当前行数与缓存行数的百分比差异
```
drift_status
```
：用于优先级排序的分类
```
stats_age
```
：自上次统计信息刷新以来的时间

解读：

>30%偏差：建议立即刷新，优化器估算可能非常不准确
20-30%偏差：如果遇到性能问题，考虑刷新
10-20%偏差：监控趋势，可能很快触发自动收集
<10%偏差：正常波动，无需操作

Query 4: Identify Missing Column-Level Statistics

查询4：识别缺失的列级统计信息

Finds table columns without statistics, focusing on columns frequently used in WHERE/JOIN clauses.

sql

WITH table_columns AS (
  SELECT column_name
  FROM information_schema.columns
  WHERE table_schema = 'schema_name'  -- Replace
    AND table_name = 'table_name'    -- Replace
    AND is_hidden = 'NO'             -- Exclude internal columns
),
stats_columns AS (
  SELECT UNNEST(column_names) AS column_name
  FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
  WHERE column_names != '{}'  -- Exclude table-level stats
)
SELECT
  tc.column_name AS missing_column,
  'No statistics available' AS status
FROM table_columns tc
WHERE tc.column_name NOT IN (SELECT column_name FROM stats_columns)
ORDER BY tc.column_name;

Customization:

Replace schema_name, table_name, and database_name with target table

Interpretation:

Columns returned have no optimizer statistics
Prioritize creating statistics for columns used in:
- WHERE clause predicates (
```
WHERE user_id = 123
```
  )
- JOIN conditions (
```
JOIN orders ON users.id = orders.user_id
```
  )
- GROUP BY / ORDER BY expressions

Action: Generate CREATE STATISTICS commands (see Query 7)

查找没有统计信息的表列，重点关注WHERE/JOIN子句中频繁使用的列。

sql

WITH table_columns AS (
  SELECT column_name
  FROM information_schema.columns
  WHERE table_schema = 'schema_name'  -- 替换
    AND table_name = 'table_name'    -- 替换
    AND is_hidden = 'NO'             -- 排除内部列
),
stats_columns AS (
  SELECT UNNEST(column_names) AS column_name
  FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
  WHERE column_names != '{}'  -- 排除表级统计信息
)
SELECT
  tc.column_name AS missing_column,
  '无可用统计信息' AS status
FROM table_columns tc
WHERE tc.column_name NOT IN (SELECT column_name FROM stats_columns)
ORDER BY tc.column_name;

自定义设置：

将schema_name、table_name和database_name替换为目标表信息

解读：

返回的列没有优化器统计信息
优先为以下列创建统计信息：
- WHERE子句谓词（
```
WHERE user_id = 123
```
  ）
- JOIN条件（
```
JOIN orders ON users.id = orders.user_id
```
  ）
- GROUP BY / ORDER BY表达式

操作： 生成CREATE STATISTICS命令（见查询7）

Query 5: Histogram Coverage Analysis

查询5：直方图覆盖分析

Identifies columns with/without histogram data for range query optimization.

sql

SELECT
  UNNEST(column_names) AS column_name,
  created,
  now() - created AS stats_age,
  CASE
    WHEN histogram_id IS NOT NULL THEN 'Has histogram'
    ELSE 'Missing histogram'
  END AS histogram_status
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
WHERE column_names != '{}'  -- Exclude table-level stats
ORDER BY
  CASE WHEN histogram_id IS NULL THEN 0 ELSE 1 END,  -- Missing first
  created DESC;

Customization:

Replace database_name.schema_name.table_name

Key columns:

```
histogram_status
```
: Indicates distribution data availability
```
stats_age
```
: Time since histogram last updated

Interpretation:

Has histogram: Optimizer can estimate range scan selectivity (BETWEEN, >, <)
Missing histogram: Optimizer uses uniform distribution assumption (less accurate)
Automatic collection creates histograms; missing indicates very new column or disabled collection

识别有/无直方图数据的列，用于优化范围查询。

sql

SELECT
  UNNEST(column_names) AS column_name,
  created,
  now() - created AS stats_age,
  CASE
    WHEN histogram_id IS NOT NULL THEN '有直方图'
    ELSE '缺失直方图'
  END AS histogram_status
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
WHERE column_names != '{}'  -- 排除表级统计信息
ORDER BY
  CASE WHEN histogram_id IS NULL THEN 0 ELSE 1 END,  -- 缺失的优先
  created DESC;

自定义设置：

替换database_name.schema_name.table_name

关键列：

```
histogram_status
```
：分布数据的可用性
```
stats_age
```
：自上次直方图更新以来的时间

解读：

有直方图：优化器可以估算范围扫描的选择性（BETWEEN、>、<）
缺失直方图：优化器使用均匀分布假设（准确性较低）
自动收集会创建直方图；缺失表示列非常新或收集功能已禁用

Query 6: Multi-Column Statistics Detection

查询6：多列统计信息检测

Identifies existing multi-column (composite) statistics for correlated columns.

sql

SELECT
  column_names,
  created,
  now() - created AS stats_age,
  row_count,
  ARRAY_LENGTH(column_names, 1) AS column_count
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
WHERE ARRAY_LENGTH(column_names, 1) > 1  -- Multi-column only
ORDER BY created DESC;

Customization:

Replace database_name.schema_name.table_name

Key columns:

```
column_names
```
: Array of correlated columns
```
column_count
```
: Number of columns in composite statistic

Interpretation:

Present: Manual multi-column statistics exist (automatic collection only creates single-column)
Absent: May need manual creation for correlated columns (e.g., city + state + zip)
Common use case: Composite index columns that are queried together

See references/create-statistics-examples.md for multi-column creation patterns.

识别关联列的现有多列（复合）统计信息。

sql

SELECT
  column_names,
  created,
  now() - created AS stats_age,
  row_count,
  ARRAY_LENGTH(column_names, 1) AS column_count
FROM [SHOW STATISTICS FOR TABLE database_name.schema_name.table_name]
WHERE ARRAY_LENGTH(column_names, 1) > 1  -- 仅多列统计信息
ORDER BY created DESC;

自定义设置：

替换database_name.schema_name.table_name

关键列：

```
column_names
```
：关联列的数组
```
column_count
```
：复合统计信息中的列数

解读：

存在：手动创建的多列统计信息（自动收集仅创建单列统计信息）
不存在：可能需要为关联列手动创建（例如city + state + zip）
常见用例：一起查询的复合索引列

有关多列创建模式，请参阅references/create-statistics-examples.md。

Query 7: Generate CREATE STATISTICS Recommendations

查询7：生成CREATE STATISTICS建议

Produces ready-to-run CREATE STATISTICS commands for tables with stale or missing statistics.

sql

WITH stale_tables AS (
  SELECT
    table_schema,
    table_name,
    created,
    now() - created AS stats_age
  FROM [SHOW STATISTICS FOR TABLE database_name.*]
  WHERE column_names = '{}'
    AND (created IS NULL OR now() - created > INTERVAL '7 days')  -- Adjust threshold
)
SELECT
  table_schema || '.' || table_name AS full_table_name,
  stats_age,
  'CREATE STATISTICS __auto__ FROM ' || table_schema || '.' || table_name || ';' AS create_command
FROM stale_tables
ORDER BY stats_age DESC NULLS FIRST
LIMIT 50;

Customization:

Replace
```
database_name.*
```
with schema pattern
Adjust
```
INTERVAL '7 days'
```
staleness threshold
Increase
```
LIMIT
```
for more recommendations

Output:

```
create_command
```
: Copy-paste ready SQL command
```
__auto__
```
: Uses automatic column selection (recommended default)

Execution:

Review generated commands before execution
Run during low-traffic periods for large tables (>10M rows)
Monitor job progress (see Query 6 for job monitoring)

为统计信息过期或缺失的表生成可直接运行的CREATE STATISTICS命令。

sql

WITH stale_tables AS (
  SELECT
    table_schema,
    table_name,
    created,
    now() - created AS stats_age
  FROM [SHOW STATISTICS FOR TABLE database_name.*]
  WHERE column_names = '{}'
    AND (created IS NULL OR now() - created > INTERVAL '7 days')  -- 调整阈值
)
SELECT
  table_schema || '.' || table_name AS full_table_name,
  stats_age,
  'CREATE STATISTICS __auto__ FROM ' || table_schema || '.' || table_name || ';' AS create_command
FROM stale_tables
ORDER BY stats_age DESC NULLS FIRST
LIMIT 50;

自定义设置：

将
```
database_name.*
```
替换为模式匹配规则
调整
```
INTERVAL '7 days'
```
过期阈值
增大
```
LIMIT
```
以获取更多建议

输出：

```
create_command
```
：可直接复制粘贴的SQL命令
```
__auto__
```
：使用自动列选择（推荐默认设置）

执行：

执行前检查生成的命令
对于大型表（>10M行），在低流量时段运行
监控作业进度（见查询6的作业监控）

Common Workflows

常见工作流

Workflow 1: Post-Bulk-Load Statistics Audit

工作流1：批量加载后统计信息审计

Scenario: After bulk INSERT/COPY/IMPORT operation, validate statistics are current.

Steps:

Identify affected tables:

sql

-- List tables modified in last 24 hours
SELECT DISTINCT table_schema || '.' || table_name AS full_table_name
FROM [SHOW TABLES]
WHERE table_schema = 'target_schema';  -- Replace

Check row count drift (Query 3): Run drift detection query for each affected table.

Generate and execute refresh commands (Query 7):

sql

CREATE STATISTICS __auto__ FROM schema_name.table_name;  -- From Query 7 output

Monitor collection job:

sql

SELECT job_id, status, fraction_completed, running_status
FROM [SHOW JOBS]
WHERE job_type = 'CREATE STATS'
  AND created > now() - INTERVAL '1 hour'
ORDER BY created DESC
LIMIT 10;

Verify refresh (Query 2): Re-run statistics audit to confirm
```
created
```
timestamp updated.

Expected outcome: Statistics age <1 hour, drift_pct <5%.

场景： 批量执行INSERT/COPY/IMPORT操作后，验证统计信息是否为最新。

步骤：

识别受影响的表：

sql

-- 列出过去24小时内修改的表
SELECT DISTINCT table_schema || '.' || table_name AS full_table_name
FROM [SHOW TABLES]
WHERE table_schema = 'target_schema';  -- 替换

检查行数偏差（查询3）： 对每个受影响的表运行偏差检测查询。

生成并执行刷新命令（查询7）：

sql

CREATE STATISTICS __auto__ FROM schema_name.table_name;  -- 来自查询7的输出

监控收集作业：

sql

SELECT job_id, status, fraction_completed, running_status
FROM [SHOW JOBS]
WHERE job_type = 'CREATE STATS'
  AND created > now() - INTERVAL '1 hour'
ORDER BY created DESC
LIMIT 10;

验证刷新结果（查询2）： 重新运行统计信息审计，确认
```
created
```
时间戳已更新。

预期结果： 统计信息时长<1小时，drift_pct<5%。

Workflow 2: Diagnose Unexpected Query Plan Changes

工作流2：诊断意外的查询计划变更

Scenario: Query performance suddenly degrades; EXPLAIN shows different plan.

Steps:

Identify affected query from profiling-statement-fingerprints: Find query with latency spike or plan hash change.
Extract table references: Parse query text to identify tables in FROM/JOIN clauses.
Audit statistics for each table (Query 2): Check staleness and row count currency.

Compare historical vs current row counts:

sql

-- Example: Check if table grew significantly
SELECT row_count, created
FROM [SHOW STATISTICS FOR TABLE users]
WHERE column_names = '{}'
ORDER BY created DESC
LIMIT 5;  -- Last 5 collections

Refresh stale statistics (Query 7): Execute CREATE STATISTICS for tables with high drift.
Validate plan stability: Re-run EXPLAIN to verify plan returns to expected structure.

Expected outcome: Plan hash stabilizes, latency returns to baseline after statistics refresh.

场景： 查询性能突然下降；EXPLAIN显示不同的计划。

步骤：

从profiling-statement-fingerprints识别受影响的查询： 找到延迟飙升或计划哈希变更的查询。
提取表引用： 解析查询文本，识别FROM/JOIN子句中的表。
审计每个表的统计信息（查询2）： 检查过期情况和行数的时效性。

比较历史与当前行数：

sql

-- 示例：检查表是否显著增长
SELECT row_count, created
FROM [SHOW STATISTICS FOR TABLE users]
WHERE column_names = '{}'
ORDER BY created DESC
LIMIT 5;  -- 最近5次收集

刷新过期统计信息（查询7）： 对高偏差表执行CREATE STATISTICS。
验证计划稳定性： 重新运行EXPLAIN，确认计划恢复为预期结构。

预期结果： 计划哈希稳定，统计信息刷新后延迟恢复到基线水平。

Workflow 3: Routine Statistics Health Check

工作流3：定期统计信息健康检查

Scenario: Periodic audit to proactively identify statistics issues before performance degrades.

Steps:

Run cluster-wide staleness scan (Query 1):

sql

-- All databases
SHOW STATISTICS FOR TABLE *.*;  -- Warning: May be slow on large clusters

Prioritize critical tables: Focus on high-traffic tables from profiling-statement-fingerprints.

Check automatic collection is enabled:

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- Should be true

Review pending auto-collection jobs:

sql

SELECT job_id, description, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'AUTO CREATE STATS'
  AND status IN ('pending', 'running')
ORDER BY created DESC;

Generate batch refresh script (Query 7): Save output to file for scheduled execution.
Schedule refresh during maintenance window: Execute generated CREATE STATISTICS commands during low-traffic period.

Frequency: Weekly for OLTP, monthly for OLAP.

场景： 定期审计，在性能下降前主动识别统计信息问题。

步骤：

运行集群范围的过期扫描（查询1）：

sql

-- 所有数据库
SHOW STATISTICS FOR TABLE *.*;  -- 警告：在大型集群上可能较慢

优先处理关键表： 重点关注profiling-statement-fingerprints中的高流量表。

检查自动收集是否启用：

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- 应返回true

查看待处理的自动收集作业：

sql

SELECT job_id, description, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'AUTO CREATE STATS'
  AND status IN ('pending', 'running')
ORDER BY created DESC;

生成批量刷新脚本（查询7）： 将输出保存到文件，用于计划执行。
在维护窗口安排刷新： 在低流量时段执行生成的CREATE STATISTICS命令。

频率： OLTP工作负载每周一次，OLAP工作负载每月一次。

Safety Considerations

安全注意事项

Production-Safe Operations

生产环境安全操作

SHOW STATISTICS:

Impact: Read-only, no cluster impact
Safe for production: Yes, run anytime without restrictions

CREATE STATISTICS:

Impact: CPU/IO-intensive, non-blocking table scans
Safety:
- Does NOT lock table or block writes
- Consumes resources (CPU, network, disk I/O)
- May impact query performance during collection on large tables
Best practices:
- Run during low-traffic periods for tables >10M rows
- Stagger execution (avoid creating statistics on many tables simultaneously)
- Monitor job progress and resource utilization

SHOW STATISTICS：

影响： 只读操作，对集群无影响
生产环境安全性： 是，可随时运行，无限制

CREATE STATISTICS：

影响： 消耗CPU/IO资源，非阻塞表扫描
安全性：
- 不会锁定表或阻止写入
- 消耗资源（CPU、网络、磁盘I/O）
- 对大型表收集统计信息期间可能影响查询性能
最佳实践：
- 对>10M行的表，在低流量时段运行
- 交错执行（避免同时对多个表创建统计信息）
- 监控作业进度和资源使用情况

Resource Consumption

资源消耗

Small tables (<10K rows): Negligible impact, safe anytime

Medium tables (10K-10M rows): Seconds to minutes, minor impact

Large tables (>10M rows): Minutes to hours, plan accordingly:

Schedule during maintenance windows
Monitor cluster metrics (CPU, disk I/O) during collection
Use
```
SHOW JOBS
```
to track progress

Cancellation (if needed):

sql

-- Find job ID
SELECT job_id, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'CREATE STATS' AND status = 'running';

-- Cancel job (non-destructive, existing statistics remain)
CANCEL JOB 123456789012345678;

小表（<10K行）： 影响可忽略，随时安全运行

中等表（10K-10M行）： 耗时数秒至数分钟，影响轻微

大型表（>10M行）： 耗时数分钟至数小时，需提前规划：

在维护窗口安排
收集期间监控集群指标（CPU、磁盘I/O）
使用
```
SHOW JOBS
```
跟踪进度

如需取消（必要时）：

sql

-- 查找作业ID
SELECT job_id, status, fraction_completed
FROM [SHOW JOBS]
WHERE job_type = 'CREATE STATS' AND status = 'running';

-- 取消作业（非破坏性，现有统计信息保留）
CANCEL JOB 123456789012345678;

Batch Collection Best Practices

批量收集最佳实践

Avoid overwhelming cluster:

Collect statistics for 3-5 tables concurrently maximum
Wait for completion before starting next batch
Monitor cluster health metrics between batches

Example staggered script:

bash

undefined

避免压垮集群：

同时最多收集3-5个表的统计信息
等待一批完成后再开始下一批
批次之间监控集群健康指标

示例交错脚本：

bash

undefined

Collect statistics in batches with delays

分批收集统计信息并设置延迟

for table in table1 table2 table3; do cockroach sql -e "CREATE STATISTICS auto FROM $table;" & done wait # Wait for batch to complete

sleep 60 # Delay between batches

for table in table4 table5 table6; do cockroach sql -e "CREATE STATISTICS auto FROM $table;" & done wait


See [references/create-statistics-examples.md](references/create-statistics-examples.md) for detailed batch patterns.

for table in table1 table2 table3; do cockroach sql -e "CREATE STATISTICS auto FROM $table;" & done wait # 等待批处理完成

sleep 60 # 批次之间的延迟

for table in table4 table5 table6; do cockroach sql -e "CREATE STATISTICS auto FROM $table;" & done wait


有关详细的批量模式，请参阅[references/create-statistics-examples.md](references/create-statistics-examples.md)。

Troubleshooting

故障排除

Issue	Likely Cause	Fix
SHOW STATISTICS returns empty	No statistics ever collected	Run `CREATE STATISTICS __auto__ FROM table_name;`
row_count shows 0 for non-empty table	Statistics out of sync	Refresh: `CREATE STATISTICS __auto__ FROM table_name;`
Permission denied error	No privileges on table	Grant any privilege: `GRANT SELECT ON table_name TO user;`
CREATE STATISTICS job stuck	Large table with high write volume	Check `SHOW JOBS` status; consider `CANCEL JOB` and retry during low-traffic period
Automatic collection not triggering	Setting disabled or threshold not met	Verify `sql.stats.automatic_collection.enabled = true` and check row count drift
Statistics exist but query plans still poor	Stale statistics or missing multi-column stats	Refresh existing; create multi-column for correlated columns (see Query 6)
High drift but recent created timestamp	Extreme write volume between collections	Lower automatic collection threshold or increase manual refresh frequency

问题	可能原因	修复方法
SHOW STATISTICS返回空结果	从未收集过统计信息	运行 `CREATE STATISTICS __auto__ FROM table_name;`
非空表的row_count显示为0	统计信息不同步	刷新： `CREATE STATISTICS __auto__ FROM table_name;`
权限拒绝错误	无表权限	授予任意权限： `GRANT SELECT ON table_name TO user;`
CREATE STATISTICS作业卡住	高写入量的大型表	检查 `SHOW JOBS` 状态；考虑 `CANCEL JOB` 并在低流量时段重试
自动收集未触发	设置已禁用或未达到阈值	验证 `sql.stats.automatic_collection.enabled = true` 并检查行数偏差
统计信息存在但查询计划仍不佳	统计信息过期或缺失多列统计信息	刷新现有统计信息；为关联列创建多列统计信息（见查询6）
高偏差但created时间戳较新	收集间隔期间写入量极大	降低自动收集阈值或增加手动刷新频率

Defensive Query Patterns

防御性查询模式

Handle missing statistics:

sql

-- Use COALESCE for NULL created timestamps
SELECT COALESCE(created, '1970-01-01'::TIMESTAMP) AS stats_created_at
FROM [SHOW STATISTICS FOR TABLE table_name]
WHERE column_names = '{}';

Avoid division by zero in drift calculations:

sql

-- Use NULLIF to prevent divide-by-zero errors
SELECT
  ABS(actual - stats)::NUMERIC / NULLIF(stats, 0) * 100 AS drift_pct
FROM ...;

处理缺失的统计信息：

sql

-- 对NULL created时间戳使用COALESCE
SELECT COALESCE(created, '1970-01-01'::TIMESTAMP) AS stats_created_at
FROM [SHOW STATISTICS FOR TABLE table_name]
WHERE column_names = '{}';

避免偏差计算中的除零错误：

sql

-- 使用NULLIF防止除零错误
SELECT
  ABS(actual - stats)::NUMERIC / NULLIF(stats, 0) * 100 AS drift_pct
FROM ...;

Key Considerations

关键注意事项

Automatic vs Manual Collection

自动与手动收集对比

Automatic (default):

Pros: Zero maintenance, adapts to data changes, covers all single columns
Cons: May lag for very large tables, no multi-column statistics
Recommendation: Keep enabled for baseline coverage

Manual:

Pros: Immediate refresh, supports multi-column statistics, controlled timing
Cons: Requires monitoring and operational overhead
Recommendation: Use for critical tables, correlated columns, post-bulk-load scenarios

自动（默认）：

优点： 无需维护，适应数据变化，覆盖所有单列
缺点： 对超大型表可能延迟，不支持多列统计信息
建议： 保持启用以提供基线覆盖

手动：

优点： 即时刷新，支持多列统计信息，时间可控
缺点： 需要监控和运维开销
建议： 用于关键表、关联列、批量加载后的场景

Statistics Retention

统计信息保留

Default retention: Controlled by

sql.stats.persisted_rows.max

(default ~10 million rows across cluster)

Older statistics are pruned automatically
Typically 7-30 days of history retained depending on statistics volume

Historical analysis:

sql

-- View statistics history for a table
SELECT column_names, row_count, created
FROM [SHOW STATISTICS FOR TABLE table_name]
WHERE column_names = '{}'
ORDER BY created DESC
LIMIT 10;  -- Last 10 collections

默认保留期： 由

sql.stats.persisted_rows.max

控制（默认集群范围内约1000万行）

旧统计信息会自动清理
根据统计信息数量，通常保留7-30天的历史数据

历史分析：

sql

-- 查看表的统计信息历史
SELECT column_names, row_count, created
FROM [SHOW STATISTICS FOR TABLE table_name]
WHERE column_names = '{}'
ORDER BY created DESC
LIMIT 10;  -- 最近10次收集

Histogram Limitations

直方图限制

What histograms optimize:

Range queries:
```
WHERE age BETWEEN 20 AND 30
```
Inequality predicates:
```
WHERE price > 100
```
ORDER BY selectivity:
```
ORDER BY created_at LIMIT 10
```

What histograms don't optimize:

Exact equality:
```
WHERE id = 123
```
(uses distinct_count instead)
Multi-column predicates: Requires multi-column statistics
Very sparse columns: Histograms may be ineffective for highly skewed distributions

直方图优化的场景：

范围查询：
```
WHERE age BETWEEN 20 AND 30
```
不等式谓词：
```
WHERE price > 100
```
ORDER BY选择性：
```
ORDER BY created_at LIMIT 10
```

直方图不优化的场景：

精确相等：
```
WHERE id = 123
```
（使用distinct_count）
多列谓词：需要多列统计信息
极度稀疏的列：直方图对高度倾斜的分布可能无效

Multi-Column Statistics

多列统计信息

When to create:

Columns frequently queried together in WHERE clauses
Composite index columns with correlated values
Geographic columns (city + state + zip)
Time-based partitioning columns (year + month)

Creation:

sql

-- Example: Correlated columns
CREATE STATISTICS city_state_stats ON city, state FROM addresses;

Limitation: Only manual creation (automatic collection does NOT create multi-column statistics)

See references/create-statistics-examples.md for comprehensive patterns.

创建时机：

WHERE子句中频繁一起查询的列
具有关联值的复合索引列
地理列（city + state + zip）
基于时间的分区列（year + month）

创建方式：

sql

-- 示例：关联列
CREATE STATISTICS city_state_stats ON city, state FROM addresses;

限制： 仅支持手动创建（自动收集不会创建多列统计信息）

有关全面的模式，请参阅references/create-statistics-examples.md。

Performance Impact Mitigation

性能影响缓解策略

Large table strategies:

Schedule during maintenance windows (nights/weekends)
Use database-specific batching (one database at a time)
Monitor cluster metrics: CPU, disk I/O, network saturation
Consider
```
AS OF SYSTEM TIME
```
for historical analysis without impacting live traffic

Resource monitoring during collection:

sql

-- Check running statistics jobs
SELECT job_id, description, status, fraction_completed, running_status
FROM [SHOW JOBS]
WHERE job_type IN ('CREATE STATS', 'AUTO CREATE STATS')
  AND status = 'running';

大型表策略：

在维护窗口（夜间/周末）安排
按数据库分批处理（一次一个数据库）
监控集群指标：CPU、磁盘I/O、网络饱和度
考虑使用
```
AS OF SYSTEM TIME
```
进行历史分析，不影响实时流量

收集期间的资源监控：

sql

-- 检查运行中的统计信息作业
SELECT job_id, description, status, fraction_completed, running_status
FROM [SHOW JOBS]
WHERE job_type IN ('CREATE STATS', 'AUTO CREATE STATS')
  AND status = 'running';

Staleness Threshold Tuning

过期阈值调优

OLTP workloads:

Recommended refresh: 3-7 days
Rationale: Frequent updates, query patterns change rapidly

OLAP/Analytics workloads:

Recommended refresh: 14-30 days
Rationale: Batch-oriented, stable query patterns

Hybrid workloads:

Critical tables: 3-7 days
Archive/historical tables: 30+ days

See references/statistics-thresholds.md for detailed guidance.

OLTP工作负载：

建议刷新周期： 3-7天
理由： 更新频繁，查询模式变化快

OLAP/分析工作负载：

建议刷新周期： 14-30天
理由： 批量处理，查询模式稳定

混合工作负载：

关键表：3-7天
归档/历史表：30+天

有关详细指南，请参阅references/statistics-thresholds.md。

Privilege Requirements

权限要求

Required: Any privilege on table (SELECT, INSERT, UPDATE, DELETE, or admin role)

Comparison to other diagnostics:

Less restrictive than: VIEWACTIVITY (required for statement_statistics)
More restrictive than: Public cluster settings

Grant example:

sql

-- Grant SELECT (least privileged) for statistics visibility
GRANT SELECT ON TABLE database_name.table_name TO diagnostics_user;

必需： 表的任意权限（SELECT、INSERT、UPDATE、DELETE或管理员角色）

与其他诊断工具的对比：

比VIEWACTIVITY限制更少： VIEWACTIVITY是statement_statistics所需的权限
比公共集群设置限制更多：

授权示例：

sql

-- 授予SELECT（最低权限）以查看统计信息
GRANT SELECT ON TABLE database_name.table_name TO diagnostics_user;

References

参考资料

Official CockroachDB Documentation:

SHOW STATISTICS - Complete syntax and output schema
CREATE STATISTICS - Manual statistics collection guide
Cost-Based Optimizer - How optimizer uses statistics
Table Statistics - Statistics impact on query planning
SHOW JOBS - Job monitoring and management

Related Skills:

profiling-statement-fingerprints - Identify slow query patterns
triaging-live-sql-activity - Real-time query triage

Supplementary References:

Statistics Thresholds Guide - Workload-specific staleness and drift thresholds
CREATE STATISTICS Examples - Comprehensive collection patterns and batch strategies

官方CockroachDB文档：

SHOW STATISTICS - 完整语法和输出模式
CREATE STATISTICS - 手动统计信息收集指南
Cost-Based Optimizer - 优化器如何使用统计信息
Table Statistics - 统计信息对查询计划的影响
SHOW JOBS - 作业监控和管理

相关技能：

profiling-statement-fingerprints - 识别慢查询模式
triaging-live-sql-activity - 实时查询分类

补充参考：

Statistics Thresholds Guide - 特定工作负载的过期和偏差阈值
CREATE STATISTICS Examples - 全面的收集模式和批量策略