monitoring-background-jobs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring Background Jobs

监控后台任务

Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.

通过识别与用户查询不同的失败、暂停和长时间运行的任务，监控后台任务的健康状态。使用纯SQL接口（SHOW JOBS和SHOW AUTOMATIC JOBS），无需访问DB Console即可查看模式变更、备份/恢复、自动统计信息收集以及SQL统计信息压缩任务。

When to Use This Skill

何时使用该技能

Schema changes appear stuck or delayed (ALTER TABLE, CREATE INDEX, DROP operations)
Backups or restores are failing or taking longer than expected
Need to verify automatic statistics collection is running
Investigating "waiting for MVCC GC" status in schema change cleanup
Troubleshooting failed jobs without DB Console access
Monitoring long-running operations that don't appear in query metrics

For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.

For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.

模式变更（ALTER TABLE、CREATE INDEX、DROP操作）出现停滞或延迟
备份或恢复失败或耗时超出预期
需要验证自动统计信息收集是否在运行
排查模式变更清理中的“waiting for MVCC GC”状态
在无DB Console访问权限时排查失败任务
监控未出现在查询指标中的长时间运行操作

实时查询监控： 使用triaging-live-sql-activity监控当前正在执行的用户查询。注意，后台任务执行的语句可能不会出现在SHOW CLUSTER STATEMENTS中。

历史查询分析： 使用profiling-statement-fingerprints分析查询模式趋势。注意，后台任务不包含在语句统计信息中。

Prerequisites

前提条件

Required SQL access:

Connection to any CockroachDB node
For cluster-wide job visibility:
```
VIEWJOB
```
system privilege (read-only monitoring)
For job control operations:
```
CONTROLJOB
```
role option (pause/cancel/resume jobs)
Without these: Limited visibility into jobs you created

Check your privileges:

sql

SHOW GRANTS ON ROLE <username>;

Look for:

```
VIEWJOB
```
in the
```
privilege_type
```
column (system privilege)
```
CONTROLJOB
```
in role options (check with
```
SHOW USERS
```
)

See permissions reference for detailed RBAC setup.

所需SQL权限：

连接到任意CockroachDB节点
集群级任务可见性：
```
VIEWJOB
```
系统权限（只读监控）
任务控制操作：
```
CONTROLJOB
```
角色选项（暂停/取消/恢复任务）
无上述权限：仅能查看自己创建的任务

检查你的权限：

sql

SHOW GRANTS ON ROLE <username>;

查找：

```
privilege_type
```
列中的
```
VIEWJOB
```
（系统权限）
角色选项中的
```
CONTROLJOB
```
（使用
```
SHOW USERS
```
查看）

详细RBAC设置请参考权限参考文档。

Core Concepts

核心概念

Jobs vs Statements

任务与语句

Key distinction:

Statements: User-initiated SQL queries tracked by SHOW CLUSTER STATEMENTS and statement statistics
Background jobs: Long-running operations tracked separately by SHOW JOBS

Background jobs are excluded from:

```
SHOW CLUSTER STATEMENTS
```
(live query monitoring)
```
crdb_internal.statement_statistics
```
(historical query analysis)
Statement fingerprint metrics and DB Console Statements page

Common job types:

Category	Job Types	Examples
User-initiated	SCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEED	ALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE
Automatic	SCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTION	Post-DROP cleanup, table statistics refresh, stats table maintenance

See job types reference for complete catalog.

关键区别：

语句： 用户发起的SQL查询，由SHOW CLUSTER STATEMENTS和语句统计信息跟踪
后台任务： 长时间运行的操作，由SHOW JOBS单独跟踪

后台任务不包含在以下内容中：

```
SHOW CLUSTER STATEMENTS
```
（实时查询监控）
```
crdb_internal.statement_statistics
```
（历史查询分析）
语句指纹指标和DB Console的语句页面

常见任务类型：

类别	任务类型	示例
用户发起	SCHEMA CHANGE、BACKUP、RESTORE、IMPORT、CHANGEFEED	ALTER TABLE、CREATE INDEX、BACKUP DATABASE、RESTORE
自动任务	SCHEMA CHANGE GC、AUTO CREATE STATS、AUTO SQL STATS COMPACTION	DROP操作后的清理、表统计信息刷新、统计信息表维护

完整任务类型目录请参考任务类型参考文档。

SHOW JOBS vs SHOW AUTOMATIC JOBS

SHOW JOBS与SHOW AUTOMATIC JOBS对比

Interface	Scope	Time Window	Use Case
`SHOW JOBS`	User-initiated + automatic	Last 12 hours (default)	Monitor backups, schema changes, user operations
`SHOW AUTOMATIC JOBS`	Automatic only	Configurable (recommend 24h)	Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC

Time retention:

Default retention: 14 days in
```
crdb_internal.jobs
```
table

SHOW JOBS

display window: 12 hours (configurable with

SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...

)

SHOW AUTOMATIC JOBS

display window: Configurable with

WHERE created > now() - INTERVAL '...'

接口	范围	时间窗口	使用场景
`SHOW JOBS`	用户发起+自动任务	默认最近12小时	监控备份、模式变更、用户操作
`SHOW AUTOMATIC JOBS`	仅自动任务	可配置（推荐24小时）	监控AUTO CREATE STATS、AUTO SQL STATS COMPACTION、SCHEMA CHANGE GC

时间保留规则：

默认保留：
```
crdb_internal.jobs
```
表中保留14天

SHOW JOBS

显示窗口：12小时（可通过

SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...

配置）

SHOW AUTOMATIC JOBS

显示窗口：可通过

WHERE created > now() - INTERVAL '...'

配置

Job Status Values

任务状态值

Status	Meaning	Action Required
`running`	Job is actively executing	Monitor progress via `fraction_completed`
`succeeded`	Job completed successfully	None
`failed`	Job encountered an error	Investigate `error` column, may need to retry
`paused`	Job manually paused	Resume with `RESUME JOB` if appropriate
`canceled`	Job was canceled (terminal state)	Retry operation if needed
`pending`	Job queued but not started	Monitor; may indicate resource constraints
`reverting`	Job failed and is rolling back changes	Wait for completion; check error after

Running status sub-states:

```
performing backup
```
: Backup job actively transferring data
```
restoring
```
: Restore job actively applying data
```
waiting for MVCC GC
```
: SCHEMA CHANGE GC waiting for garbage collection eligibility

See job states reference for detailed state transitions and "waiting for MVCC GC" explanation.

状态	含义	所需操作
`running`	任务正在执行	通过 `fraction_completed` 监控进度
`succeeded`	任务成功完成	无
`failed`	任务遇到错误	查看 `error` 列，可能需要重试
`paused`	任务被手动暂停	若合适，使用 `RESUME JOB` 恢复
`canceled`	任务已被取消（终端状态）	若需要，重新执行操作
`pending`	任务已排队但未启动	监控；可能表示资源受限
`reverting`	任务失败并正在回滚变更	等待完成；之后检查错误

运行状态子状态：

```
performing backup
```
：备份任务正在传输数据
```
restoring
```
：恢复任务正在应用数据
```
waiting for MVCC GC
```
：SCHEMA CHANGE GC等待垃圾回收资格

详细状态转换和“waiting for MVCC GC”解释请参考任务状态参考文档。

Core Diagnostic Queries

核心诊断查询

Query 1: Failed Jobs (Last 12 Hours)

查询1：最近12小时内的失败任务

Identify jobs that failed with error messages:

sql

-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  finished,
  now() - created AS total_duration,
  error
FROM j
WHERE status = 'failed'
  AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;

Key columns:

```
error
```
: Failure reason (check for permission errors, disk space, network issues)
```
description
```
: Human-readable description of what the job was doing
```
total_duration
```
: How long the job ran before failing

Common failure patterns:

Permission denied: User lacks required privileges
Disk space: Backup destination full
Network timeout: External storage unreachable
Constraint violation: Restore conflicts with existing data

识别带有错误信息的失败任务：

sql

-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  finished,
  now() - created AS total_duration,
  error
FROM j
WHERE status = 'failed'
  AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;

关键列：

```
error
```
：失败原因（检查权限错误、磁盘空间、网络问题）
```
description
```
：任务执行内容的可读描述
```
total_duration
```
：任务失败前运行的时长

常见失败模式：

权限拒绝：用户缺少所需权限
磁盘空间不足：备份目标存储已满
网络超时：外部存储无法访问
约束冲突：恢复与现有数据冲突

Query 2: Long-Running Jobs

查询2：长时间运行的任务

Find jobs running longer than expected threshold:

sql

-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  status,
  running_status,
  created,
  now() - created AS running_for,
  fraction_completed,
  coordinator_id
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;

Key columns:

```
running_for
```
: Total elapsed time since job started
```
fraction_completed
```
: Progress estimate (0.0 to 1.0, NULL if unavailable)
```
running_status
```
: Sub-state details (e.g., "waiting for MVCC GC")

Customizable thresholds:

Schema changes: 30 minutes to several hours (depends on table size)
Backups: 1-6+ hours (depends on data volume)
Automatic jobs: Usually < 30 minutes

查找运行时长超出预期阈值的任务：

sql

-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  status,
  running_status,
  created,
  now() - created AS running_for,
  fraction_completed,
  coordinator_id
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;

关键列：

```
running_for
```
：任务启动后的总耗时
```
fraction_completed
```
：进度估计值（0.0到1.0，不可用时为NULL）
```
running_status
```
：子状态详情（如“waiting for MVCC GC”）

可自定义阈值：

模式变更：30分钟到数小时（取决于表大小）
备份：1-6小时以上（取决于数据量）
自动任务：通常少于30分钟

Query 3: Paused Jobs

查询3：暂停的任务

Identify jobs that are paused and may need attention:

sql

-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS paused_for,
  coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;

Action required: Resume with

RESUME JOB <job_id>

after verifying the pause reason.

Common reasons for paused jobs:

Manual user pause for maintenance
Resource constraints (cluster paused the job)
Error requiring manual intervention

识别已暂停且可能需要关注的任务：

sql

-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS paused_for,
  coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;

所需操作： 确认暂停原因后，使用

RESUME JOB <job_id>

恢复任务。

任务暂停的常见原因：

用户为维护手动暂停
资源受限（集群暂停任务）
需要手动干预的错误

Query 4: Schema Changes Waiting for MVCC GC

查询4：等待MVCC GC的模式变更任务

Find SCHEMA CHANGE GC jobs waiting for garbage collection:

sql

-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS waiting_for,
  running_status
FROM j
WHERE status = 'running'
  AND job_type = 'SCHEMA CHANGE GC'
  AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;

Interpretation:

Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on
```
gc.ttlseconds
```
setting (default 25 hours)
Expected duration: Up to
```
gc.ttlseconds
```
+ some overhead

When to worry: Waiting > 2x

gc.ttlseconds

(check setting with

SHOW CLUSTER SETTING gc.ttlseconds

)

Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.

See job states reference for detailed MVCC GC explanation.

查找等待垃圾回收的SCHEMA CHANGE GC任务：

sql

-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS waiting_for,
  running_status
FROM j
WHERE status = 'running'
  AND job_type = 'SCHEMA CHANGE GC'
  AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;

解读：

正常情况： SCHEMA CHANGE GC任务会根据
```
gc.ttlseconds
```
设置（默认25小时）等待数据变为可垃圾回收状态
预期时长： 最多
```
gc.ttlseconds
```
加上一些额外开销
需关注的情况： 等待时长超过
```
gc.ttlseconds
```
的2倍（使用
```
SHOW CLUSTER SETTING gc.ttlseconds
```
查看设置）

原因： 执行DROP TABLE/INDEX操作后，CockroachDB必须等待所有基于旧时间戳的读取完成，才能物理删除数据。这可以防止“时间旅行”查询失败。

详细MVCC GC解释请参考任务状态参考文档。

Query 5: Automatic Job Health (24h Window)

查询5：自动任务健康状态（24小时窗口）

Monitor automatic background jobs like statistics collection:

sql

-- Automatic jobs in last 24 hours
SELECT
  job_id,
  job_type,
  description,
  status,
  created,
  finished,
  COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;

Key job types:

```
AUTO CREATE STATS
```
: Automatic table statistics refresh (critical for query optimizer)
```
AUTO SQL STATS COMPACTION
```
: Periodic cleanup of statement/transaction statistics tables

Health indicators:

Healthy: Regular successful executions (every few hours)
Unhealthy: No recent executions, or high failure rate
Impact of failure: Stale statistics lead to poor query plans and slow queries

监控自动后台任务，如统计信息收集：

sql

-- Automatic jobs in last 24 hours
SELECT
  job_id,
  job_type,
  description,
  status,
  created,
  finished,
  COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;

关键任务类型：

```
AUTO CREATE STATS
```
：自动表统计信息刷新（对查询优化器至关重要）
```
AUTO SQL STATS COMPACTION
```
：定期清理语句/事务统计信息表

健康指标：

健康状态： 定期成功执行（每几小时一次）
不健康状态： 近期无执行记录，或失败率高
失败影响： 过时的统计信息会导致查询计划不佳和查询缓慢

Query 6: Jobs by Type and Status

查询6：按类型和状态分类的任务

Aggregated view for pattern analysis:

sql

-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
  job_type,
  status,
  COUNT(*) AS job_count,
  MIN(created) AS oldest,
  MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;

Use case:

Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
Spot anomalies (e.g., unusual job type volume)
Track job success rates by type

用于模式分析的聚合视图：

sql

-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
  job_type,
  status,
  COUNT(*) AS job_count,
  MIN(created) AS oldest,
  MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;

使用场景：

识别模式（如所有BACKUP任务失败、多个模式变更停滞）
发现异常（如异常的任务类型数量）
按类型跟踪任务成功率

Query 7: Backup and Restore Progress

查询7：备份与恢复进度

Track progress of backup/restore jobs:

sql

-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS running_for,
  ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
  CASE
    WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
      ((now() - created) / fraction_completed) - (now() - created)
    ELSE NULL
  END AS estimated_time_remaining,
  running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;

Key columns:

```
percent_complete
```
: Progress percentage (0-100)
```
estimated_time_remaining
```
: Rough estimate based on current progress rate
```
running_status
```
: Detailed status (e.g., "performing backup to s3://...")

Note:

fraction_completed

may be NULL for some job types or early in execution.

跟踪备份/恢复任务的进度：

sql

-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS running_for,
  ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
  CASE
    WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
      ((now() - created) / fraction_completed) - (now() - created)
    ELSE NULL
  END AS estimated_time_remaining,
  running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;

关键列：

```
percent_complete
```
：进度百分比（0-100）
```
estimated_time_remaining
```
：基于当前进度的大致剩余时间估计
```
running_status
```
：详细状态（如“performing backup to s3://...”）

注意： 部分任务类型或执行初期，

fraction_completed

可能为NULL。

Common Workflows

常见工作流

Workflow 1: Schema Change Stuck Investigation

工作流1：排查停滞的模式变更

Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.

Check for running schema changes:

sql

WITH j AS (SHOW JOBS)
SELECT job_id, description, created, now() - created AS running_for,
       fraction_completed, running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
ORDER BY created;

Identify MVCC GC waits:

sql

-- Use Query 4 to find "waiting for MVCC GC" jobs

Interpret results:
- If
```
running_status
```
  = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to
```
gc.ttlseconds
```
  )
- If long-running with low
```
fraction_completed
```
  : Check for contention, large table size, or resource constraints
- If failed: Check
```
error
```
  column for specific failure reason
Next steps:
- MVCC GC wait: Verify
```
SHOW CLUSTER SETTING gc.ttlseconds
```
  and wait
- Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation

场景： 用户反馈ALTER TABLE或CREATE INDEX似乎停滞。

检查运行中的模式变更：

sql

WITH j AS (SHOW JOBS)
SELECT job_id, description, created, now() - created AS running_for,
       fraction_completed, running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
ORDER BY created;

识别MVCC GC等待：

sql

-- 使用查询4查找“waiting for MVCC GC”任务

解读结果：
- 如果
```
running_status
```
  = "waiting for MVCC GC"：DROP操作后的清理属于正常情况（等待时长不超过
```
gc.ttlseconds
```
  ）
- 如果长时间运行且
```
fraction_completed
```
  较低：检查是否存在竞争、表过大或资源受限
- 如果任务失败：查看
```
error
```
  列获取具体失败原因
下一步操作：
- MVCC GC等待：验证
```
SHOW CLUSTER SETTING gc.ttlseconds
```
  并等待
- 资源受限：检查集群CPU/内存使用情况
- 任务失败：解决错误（权限、约束）并重试操作

Workflow 2: Failed Backup Triage

工作流2：排查失败的备份

Scenario: Scheduled backup job failed.

Find recent failed backups:

sql

-- Use Query 1 filtered for BACKUP job type
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, finished, error
FROM j
WHERE status = 'failed'
  AND job_type = 'BACKUP'
  AND created > now() - INTERVAL '24 hours'
ORDER BY created DESC;

Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable

Verify backup destination:

sql

-- Check SHOW BACKUP for successful backups to same destination
SHOW BACKUP 's3://bucket/path';

Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress

场景： 定时备份任务失败。

查找最近失败的备份：

sql

-- 使用查询1筛选BACKUP任务类型
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, finished, error
FROM j
WHERE status = 'failed'
  AND job_type = 'BACKUP'
  AND created > now() - INTERVAL '24 hours'
ORDER BY created DESC;

分析错误信息：
- "permission denied"：检查外部存储凭证
- "timeout"：与备份目标的网络连接问题
- "no space left"：目标存储已满
- "connection refused"：外部存储端点无法访问

验证备份目标：

sql

-- 检查同一目标的成功备份记录
SHOW BACKUP 's3://bucket/path';

修复并重试：
- 修复根本问题（凭证、存储、网络）
- 重新运行备份命令
- 使用查询7监控进度

Workflow 3: Automatic Job Health Check

工作流3：自动任务健康检查

Scenario: Proactive monitoring of automatic background jobs.

Check AUTO CREATE STATS frequency:

sql

-- Use Query 5 to see recent automatic statistics jobs
SELECT job_type, status, COUNT(*) AS job_count,
       MAX(created) AS most_recent
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type = 'AUTO CREATE STATS'
GROUP BY job_type, status;

Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded

Verify settings:

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- Should be true
SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;

场景： 主动监控自动后台任务。

检查AUTO CREATE STATS执行频率：

sql

-- 使用查询5查看最近的自动统计信息任务
SELECT job_type, status, COUNT(*) AS job_count,
       MAX(created) AS most_recent
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type = 'AUTO CREATE STATS'
GROUP BY job_type, status;

预期模式：
- 每天多次成功执行AUTO CREATE STATS任务（取决于表更新频率）
- 定期执行AUTO SQL STATS COMPACTION（通常每小时一次）
警告信号：
- 过去24小时内无AUTO CREATE STATS任务：统计信息收集可能已禁用
- 高失败率：检查集群资源受限或权限问题
- 无AUTO SQL STATS COMPACTION：统计信息表可能无限增长

验证设置：

sql

SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- 应设为true
SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;

Workflow 4: Long-Running Job Monitoring

工作流4：长时间运行任务监控

Scenario: Track progress of expected long-running operations.

Identify long-running jobs:

sql

-- Use Query 2 with custom threshold
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, description,
       now() - created AS running_for,
       fraction_completed
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '30 minutes'
ORDER BY created;

Monitor progress over time:

sql

-- Re-run every 10-15 minutes, track fraction_completed changes
-- Example: 0.25 → 0.40 → 0.55 indicates steady progress

Estimate completion:

sql

-- Use Query 7 for backup/restore jobs with time estimates

Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error

场景： 跟踪预期长时间运行操作的进度。

识别长时间运行的任务：

sql

-- 使用查询2并自定义阈值
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, description,
       now() - created AS running_for,
       fraction_completed
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '30 minutes'
ORDER BY created;

随时间监控进度：

sql

-- 每10-15分钟重新运行一次，跟踪fraction_completed的变化
-- 示例：0.25 → 0.40 → 0.55表示进度稳定

估计完成时间：

sql

-- 使用查询7获取备份/恢复任务的时间估计

决定操作：
- 进度稳定：继续监控
- 进度停滞（fraction_completed未增加）：使用triaging-live-sql-activity排查
- 任务失败：使用查询1检查错误

Safety Considerations

安全注意事项

Read-only operations (all diagnostic queries): All

SHOW JOBS

and

SHOW AUTOMATIC JOBS

queries are read-only and safe to run in production. No performance impact on cluster operations.

Job control operations (opt-in):

CAUTION: Pausing or canceling jobs can have data integrity implications

Only proceed with job control if:

You have
```
CONTROLJOB
```
role option
You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
You have authorization to interrupt cluster operations
You've verified the job is truly problematic (not just slow)

Job control commands:

sql

-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;

-- Resume a paused job
RESUME JOB <job_id>;

-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;

Risks by job type:

SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
RESTORE: Canceling may leave database partially restored; requires cleanup
AUTO CREATE STATS: Canceling is safe (will retry later automatically)

Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.

See permissions reference for CONTROLJOB role option setup.

只读操作（所有诊断查询）： 所有

SHOW JOBS

和

SHOW AUTOMATIC JOBS

查询均为只读，可安全在生产环境运行，不会对集群操作造成性能影响。

任务控制操作（可选）：

注意：暂停或取消任务可能影响数据完整性

仅在以下情况下执行任务控制：

你拥有
```
CONTROLJOB
```
角色选项
你了解潜在影响（如中途取消模式变更可能需要手动清理）
你有权中断集群操作
你已确认任务确实存在问题（并非只是运行缓慢）

任务控制命令：

sql

-- 暂停运行中的任务（之后可恢复）
PAUSE JOB <job_id>;

-- 恢复暂停的任务
RESUME JOB <job_id>;

-- 取消任务（终端操作 - 无法恢复）
CANCEL JOB <job_id>;

按任务类型划分的风险：

SCHEMA CHANGE： 取消可能导致模式处于不一致状态；优先选择暂停并排查问题
BACKUP： 取消是安全的（可重试）；临时问题优先选择暂停
RESTORE： 取消可能导致数据库部分恢复；需要清理
AUTO CREATE STATS： 取消是安全的（系统会自动重试）

最佳实践： 重点关注监控和诊断；仅在明确需要且获得授权时使用控制操作。

CONTROLJOB

角色选项设置请参考权限参考文档。

Troubleshooting

故障排查

Issue	Cause	Fix
`SHOW JOBS` returns empty	No jobs in last 12h, or insufficient privileges	Grant `VIEWJOB` privilege; verify cluster has recent job activity
"waiting for MVCC GC" for many hours	Normal behavior for SCHEMA CHANGE GC after DROP operations	Wait up to `gc.ttlseconds` (default 25h); check `SHOW CLUSTER SETTING gc.ttlseconds`
Can't pause/resume job: "permission denied"	Missing `CONTROLJOB` role option	Use `ALTER ROLE <username> WITH CONTROLJOB` (not GRANT SYSTEM)
Job stuck at same `fraction_completed`	Job may be processing large batch, or actually stuck	Wait 15-30 min and re-check; if no change, investigate with live query triage
No AUTO CREATE STATS jobs	Automatic collection disabled	Check `sql.stats.automatic_collection.enabled = true`
`SHOW AUTOMATIC JOBS` shows old jobs only	Need to filter by time window	Add `WHERE created > now() - INTERVAL '24 hours'`
Failed job with "schema change GC" error	Expected for post-DROP cleanup failures	Usually safe to ignore; job will retry automatically
Job error: "concurrent schema change"	Multiple schema changes on same table	Wait for first schema change to complete, then retry

问题	原因	解决方法
`SHOW JOBS` 返回空结果	最近12小时无任务，或权限不足	授予 `VIEWJOB` 权限；验证集群近期有任务活动
"waiting for MVCC GC"持续数小时	DROP操作后SCHEMA CHANGE GC的正常行为	等待时长不超过 `gc.ttlseconds` （默认25小时）；使用 `SHOW CLUSTER SETTING gc.ttlseconds` 查看设置
无法暂停/恢复任务："permission denied"	缺少 `CONTROLJOB` 角色选项	使用 `ALTER ROLE <username> WITH CONTROLJOB` （非GRANT SYSTEM）
任务的 `fraction_completed` 保持不变	任务可能在处理大数据批，或确实停滞	等待15-30分钟后重新检查；若无变化，使用实时查询排查工具
无AUTO CREATE STATS任务	自动收集已禁用	检查 `sql.stats.automatic_collection.enabled = true`
`SHOW AUTOMATIC JOBS` 仅显示旧任务	需要按时间窗口筛选	添加 `WHERE created > now() - INTERVAL '24 hours'`
任务出现"schema change GC"错误	DROP操作后清理失败的预期情况	通常可忽略；任务会自动重试
任务错误："concurrent schema change"	同一表存在多个模式变更	等待第一个模式变更完成后重试

Key Considerations

关键注意事项

Jobs vs queries: Background jobs execute statements that don't appear in SHOW STATEMENTS or statement statistics
Time windows: SHOW JOBS default 12h retention; use
```
crdb_internal.jobs
```
for up to 14 days
MVCC GC waiting: Normal and expected for post-DROP cleanup; duration tied to
```
gc.ttlseconds
```
LIMIT clauses: Always include for production safety (prevents overwhelming output)
Privilege model: VIEWJOB (system privilege) for read-only; CONTROLJOB (role option) for control
Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
Progress estimates:
```
fraction_completed
```
may be NULL or sparse for some job types
Job control risks: PAUSE is safer than CANCEL; some cancellations require manual cleanup

任务与查询： 后台任务执行的语句不会出现在SHOW STATEMENTS或语句统计信息中
时间窗口： SHOW JOBS默认保留12小时；使用
```
crdb_internal.jobs
```
可查看最多14天的任务
MVCC GC等待： DROP操作后的清理属于正常预期行为；时长与
```
gc.ttlseconds
```
相关
LIMIT子句： 生产环境中务必包含该子句（防止输出过多）
权限模型： VIEWJOB（系统权限）用于只读访问；CONTROLJOB（角色选项）用于控制操作
自动任务健康： 定期执行AUTO CREATE STATS对查询优化器性能至关重要
进度估计： 部分任务类型的
```
fraction_completed
```
可能为NULL或不连续
任务控制风险： 暂停比取消更安全；部分取消操作需要手动清理

References

参考资料

Skill references:

RBAC and privilege setup
Job states and transitions
Job types catalog
SQL query variations

Related skills:

triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
profiling-statement-fingerprints - For historical query analysis (background jobs excluded)

Official CockroachDB Documentation:

技能参考：

RBAC与权限设置
任务状态与转换
任务类型目录
SQL查询变体

相关技能：

triaging-live-sql-activity - 用于实时查询监控（任务执行的语句可能不会显示）
profiling-statement-fingerprints - 用于历史查询分析（不包含后台任务）

官方CockroachDB文档：