monitoring-background-jobs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Background Jobs

监控后台任务

Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
通过识别与用户查询不同的失败、暂停和长时间运行的任务,监控后台任务的健康状态。使用纯SQL接口(SHOW JOBS和SHOW AUTOMATIC JOBS),无需访问DB Console即可查看模式变更、备份/恢复、自动统计信息收集以及SQL统计信息压缩任务。

When to Use This Skill

何时使用该技能

  • Schema changes appear stuck or delayed (ALTER TABLE, CREATE INDEX, DROP operations)
  • Backups or restores are failing or taking longer than expected
  • Need to verify automatic statistics collection is running
  • Investigating "waiting for MVCC GC" status in schema change cleanup
  • Troubleshooting failed jobs without DB Console access
  • Monitoring long-running operations that don't appear in query metrics
For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.
For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.
  • 模式变更(ALTER TABLE、CREATE INDEX、DROP操作)出现停滞或延迟
  • 备份或恢复失败或耗时超出预期
  • 需要验证自动统计信息收集是否在运行
  • 排查模式变更清理中的“waiting for MVCC GC”状态
  • 在无DB Console访问权限时排查失败任务
  • 监控未出现在查询指标中的长时间运行操作
实时查询监控: 使用triaging-live-sql-activity监控当前正在执行的用户查询。注意,后台任务执行的语句可能不会出现在SHOW CLUSTER STATEMENTS中。
历史查询分析: 使用profiling-statement-fingerprints分析查询模式趋势。注意,后台任务不包含在语句统计信息中。

Prerequisites

前提条件

Required SQL access:
  • Connection to any CockroachDB node
  • For cluster-wide job visibility:
    VIEWJOB
    system privilege (read-only monitoring)
  • For job control operations:
    CONTROLJOB
    role option (pause/cancel/resume jobs)
  • Without these: Limited visibility into jobs you created
Check your privileges:
sql
SHOW GRANTS ON ROLE <username>;
Look for:
  • VIEWJOB
    in the
    privilege_type
    column (system privilege)
  • CONTROLJOB
    in role options (check with
    SHOW USERS
    )
See permissions reference for detailed RBAC setup.
所需SQL权限:
  • 连接到任意CockroachDB节点
  • 集群级任务可见性:
    VIEWJOB
    系统权限(只读监控)
  • 任务控制操作:
    CONTROLJOB
    角色选项(暂停/取消/恢复任务)
  • 无上述权限:仅能查看自己创建的任务
检查你的权限:
sql
SHOW GRANTS ON ROLE <username>;
查找:
  • privilege_type
    列中的
    VIEWJOB
    (系统权限)
  • 角色选项中的
    CONTROLJOB
    (使用
    SHOW USERS
    查看)
详细RBAC设置请参考权限参考文档

Core Concepts

核心概念

Jobs vs Statements

任务与语句

Key distinction:
  • Statements: User-initiated SQL queries tracked by SHOW CLUSTER STATEMENTS and statement statistics
  • Background jobs: Long-running operations tracked separately by SHOW JOBS
Background jobs are excluded from:
  • SHOW CLUSTER STATEMENTS
    (live query monitoring)
  • crdb_internal.statement_statistics
    (historical query analysis)
  • Statement fingerprint metrics and DB Console Statements page
Common job types:
CategoryJob TypesExamples
User-initiatedSCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEEDALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE
AutomaticSCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTIONPost-DROP cleanup, table statistics refresh, stats table maintenance
See job types reference for complete catalog.
关键区别:
  • 语句: 用户发起的SQL查询,由SHOW CLUSTER STATEMENTS和语句统计信息跟踪
  • 后台任务: 长时间运行的操作,由SHOW JOBS单独跟踪
后台任务不包含在以下内容中:
  • SHOW CLUSTER STATEMENTS
    (实时查询监控)
  • crdb_internal.statement_statistics
    (历史查询分析)
  • 语句指纹指标和DB Console的语句页面
常见任务类型:
类别任务类型示例
用户发起SCHEMA CHANGE、BACKUP、RESTORE、IMPORT、CHANGEFEEDALTER TABLE、CREATE INDEX、BACKUP DATABASE、RESTORE
自动任务SCHEMA CHANGE GC、AUTO CREATE STATS、AUTO SQL STATS COMPACTIONDROP操作后的清理、表统计信息刷新、统计信息表维护
完整任务类型目录请参考任务类型参考文档

SHOW JOBS vs SHOW AUTOMATIC JOBS

SHOW JOBS与SHOW AUTOMATIC JOBS对比

InterfaceScopeTime WindowUse Case
SHOW JOBS
User-initiated + automaticLast 12 hours (default)Monitor backups, schema changes, user operations
SHOW AUTOMATIC JOBS
Automatic onlyConfigurable (recommend 24h)Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC
Time retention:
  • Default retention: 14 days in
    crdb_internal.jobs
    table
  • SHOW JOBS
    display window: 12 hours (configurable with
    SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...
    )
  • SHOW AUTOMATIC JOBS
    display window: Configurable with
    WHERE created > now() - INTERVAL '...'
接口范围时间窗口使用场景
SHOW JOBS
用户发起+自动任务默认最近12小时监控备份、模式变更、用户操作
SHOW AUTOMATIC JOBS
仅自动任务可配置(推荐24小时)监控AUTO CREATE STATS、AUTO SQL STATS COMPACTION、SCHEMA CHANGE GC
时间保留规则:
  • 默认保留:
    crdb_internal.jobs
    表中保留14天
  • SHOW JOBS
    显示窗口:12小时(可通过
    SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...
    配置)
  • SHOW AUTOMATIC JOBS
    显示窗口:可通过
    WHERE created > now() - INTERVAL '...'
    配置

Job Status Values

任务状态值

StatusMeaningAction Required
running
Job is actively executingMonitor progress via
fraction_completed
succeeded
Job completed successfullyNone
failed
Job encountered an errorInvestigate
error
column, may need to retry
paused
Job manually pausedResume with
RESUME JOB
if appropriate
canceled
Job was canceled (terminal state)Retry operation if needed
pending
Job queued but not startedMonitor; may indicate resource constraints
reverting
Job failed and is rolling back changesWait for completion; check error after
Running status sub-states:
  • performing backup
    : Backup job actively transferring data
  • restoring
    : Restore job actively applying data
  • waiting for MVCC GC
    : SCHEMA CHANGE GC waiting for garbage collection eligibility
See job states reference for detailed state transitions and "waiting for MVCC GC" explanation.
状态含义所需操作
running
任务正在执行通过
fraction_completed
监控进度
succeeded
任务成功完成
failed
任务遇到错误查看
error
列,可能需要重试
paused
任务被手动暂停若合适,使用
RESUME JOB
恢复
canceled
任务已被取消(终端状态)若需要,重新执行操作
pending
任务已排队但未启动监控;可能表示资源受限
reverting
任务失败并正在回滚变更等待完成;之后检查错误
运行状态子状态:
  • performing backup
    :备份任务正在传输数据
  • restoring
    :恢复任务正在应用数据
  • waiting for MVCC GC
    :SCHEMA CHANGE GC等待垃圾回收资格
详细状态转换和“waiting for MVCC GC”解释请参考任务状态参考文档

Core Diagnostic Queries

核心诊断查询

Query 1: Failed Jobs (Last 12 Hours)

查询1:最近12小时内的失败任务

Identify jobs that failed with error messages:
sql
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  finished,
  now() - created AS total_duration,
  error
FROM j
WHERE status = 'failed'
  AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
Key columns:
  • error
    : Failure reason (check for permission errors, disk space, network issues)
  • description
    : Human-readable description of what the job was doing
  • total_duration
    : How long the job ran before failing
Common failure patterns:
  • Permission denied: User lacks required privileges
  • Disk space: Backup destination full
  • Network timeout: External storage unreachable
  • Constraint violation: Restore conflicts with existing data
识别带有错误信息的失败任务:
sql
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  finished,
  now() - created AS total_duration,
  error
FROM j
WHERE status = 'failed'
  AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
关键列:
  • error
    :失败原因(检查权限错误、磁盘空间、网络问题)
  • description
    :任务执行内容的可读描述
  • total_duration
    :任务失败前运行的时长
常见失败模式:
  • 权限拒绝:用户缺少所需权限
  • 磁盘空间不足:备份目标存储已满
  • 网络超时:外部存储无法访问
  • 约束冲突:恢复与现有数据冲突

Query 2: Long-Running Jobs

查询2:长时间运行的任务

Find jobs running longer than expected threshold:
sql
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  status,
  running_status,
  created,
  now() - created AS running_for,
  fraction_completed,
  coordinator_id
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
Key columns:
  • running_for
    : Total elapsed time since job started
  • fraction_completed
    : Progress estimate (0.0 to 1.0, NULL if unavailable)
  • running_status
    : Sub-state details (e.g., "waiting for MVCC GC")
Customizable thresholds:
  • Schema changes: 30 minutes to several hours (depends on table size)
  • Backups: 1-6+ hours (depends on data volume)
  • Automatic jobs: Usually < 30 minutes
查找运行时长超出预期阈值的任务:
sql
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  status,
  running_status,
  created,
  now() - created AS running_for,
  fraction_completed,
  coordinator_id
FROM j
WHERE status = 'running'
  AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
关键列:
  • running_for
    :任务启动后的总耗时
  • fraction_completed
    :进度估计值(0.0到1.0,不可用时为NULL)
  • running_status
    :子状态详情(如“waiting for MVCC GC”)
可自定义阈值:
  • 模式变更:30分钟到数小时(取决于表大小)
  • 备份:1-6小时以上(取决于数据量)
  • 自动任务:通常少于30分钟

Query 3: Paused Jobs

查询3:暂停的任务

Identify jobs that are paused and may need attention:
sql
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS paused_for,
  coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
Action required: Resume with
RESUME JOB <job_id>
after verifying the pause reason.
Common reasons for paused jobs:
  • Manual user pause for maintenance
  • Resource constraints (cluster paused the job)
  • Error requiring manual intervention
识别已暂停且可能需要关注的任务:
sql
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS paused_for,
  coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
所需操作: 确认暂停原因后,使用
RESUME JOB <job_id>
恢复任务。
任务暂停的常见原因:
  • 用户为维护手动暂停
  • 资源受限(集群暂停任务)
  • 需要手动干预的错误

Query 4: Schema Changes Waiting for MVCC GC

查询4:等待MVCC GC的模式变更任务

Find SCHEMA CHANGE GC jobs waiting for garbage collection:
sql
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS waiting_for,
  running_status
FROM j
WHERE status = 'running'
  AND job_type = 'SCHEMA CHANGE GC'
  AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
Interpretation:
  • Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on
    gc.ttlseconds
    setting (default 25 hours)
  • Expected duration: Up to
    gc.ttlseconds
    + some overhead
  • When to worry: Waiting > 2x
    gc.ttlseconds
    (check setting with
    SHOW CLUSTER SETTING gc.ttlseconds
    )
Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
查找等待垃圾回收的SCHEMA CHANGE GC任务:
sql
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS waiting_for,
  running_status
FROM j
WHERE status = 'running'
  AND job_type = 'SCHEMA CHANGE GC'
  AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
解读:
  • 正常情况: SCHEMA CHANGE GC任务会根据
    gc.ttlseconds
    设置(默认25小时)等待数据变为可垃圾回收状态
  • 预期时长: 最多
    gc.ttlseconds
    加上一些额外开销
  • 需关注的情况: 等待时长超过
    gc.ttlseconds
    的2倍(使用
    SHOW CLUSTER SETTING gc.ttlseconds
    查看设置)
原因: 执行DROP TABLE/INDEX操作后,CockroachDB必须等待所有基于旧时间戳的读取完成,才能物理删除数据。这可以防止“时间旅行”查询失败。
详细MVCC GC解释请参考任务状态参考文档

Query 5: Automatic Job Health (24h Window)

查询5:自动任务健康状态(24小时窗口)

Monitor automatic background jobs like statistics collection:
sql
-- Automatic jobs in last 24 hours
SELECT
  job_id,
  job_type,
  description,
  status,
  created,
  finished,
  COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
Key job types:
  • AUTO CREATE STATS
    : Automatic table statistics refresh (critical for query optimizer)
  • AUTO SQL STATS COMPACTION
    : Periodic cleanup of statement/transaction statistics tables
Health indicators:
  • Healthy: Regular successful executions (every few hours)
  • Unhealthy: No recent executions, or high failure rate
  • Impact of failure: Stale statistics lead to poor query plans and slow queries
监控自动后台任务,如统计信息收集:
sql
-- Automatic jobs in last 24 hours
SELECT
  job_id,
  job_type,
  description,
  status,
  created,
  finished,
  COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
  AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
关键任务类型:
  • AUTO CREATE STATS
    :自动表统计信息刷新(对查询优化器至关重要)
  • AUTO SQL STATS COMPACTION
    :定期清理语句/事务统计信息表
健康指标:
  • 健康状态: 定期成功执行(每几小时一次)
  • 不健康状态: 近期无执行记录,或失败率高
  • 失败影响: 过时的统计信息会导致查询计划不佳和查询缓慢

Query 6: Jobs by Type and Status

查询6:按类型和状态分类的任务

Aggregated view for pattern analysis:
sql
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
  job_type,
  status,
  COUNT(*) AS job_count,
  MIN(created) AS oldest,
  MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
Use case:
  • Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
  • Spot anomalies (e.g., unusual job type volume)
  • Track job success rates by type
用于模式分析的聚合视图:
sql
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
  job_type,
  status,
  COUNT(*) AS job_count,
  MIN(created) AS oldest,
  MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
使用场景:
  • 识别模式(如所有BACKUP任务失败、多个模式变更停滞)
  • 发现异常(如异常的任务类型数量)
  • 按类型跟踪任务成功率

Query 7: Backup and Restore Progress

查询7:备份与恢复进度

Track progress of backup/restore jobs:
sql
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS running_for,
  ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
  CASE
    WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
      ((now() - created) / fraction_completed) - (now() - created)
    ELSE NULL
  END AS estimated_time_remaining,
  running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
Key columns:
  • percent_complete
    : Progress percentage (0-100)
  • estimated_time_remaining
    : Rough estimate based on current progress rate
  • running_status
    : Detailed status (e.g., "performing backup to s3://...")
Note:
fraction_completed
may be NULL for some job types or early in execution.
跟踪备份/恢复任务的进度:
sql
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
  job_id,
  job_type,
  description,
  created,
  now() - created AS running_for,
  ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
  CASE
    WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
      ((now() - created) / fraction_completed) - (now() - created)
    ELSE NULL
  END AS estimated_time_remaining,
  running_status
FROM j
WHERE status = 'running'
  AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
关键列:
  • percent_complete
    :进度百分比(0-100)
  • estimated_time_remaining
    :基于当前进度的大致剩余时间估计
  • running_status
    :详细状态(如“performing backup to s3://...”)
注意: 部分任务类型或执行初期,
fraction_completed
可能为NULL。

Common Workflows

常见工作流

Workflow 1: Schema Change Stuck Investigation

工作流1:排查停滞的模式变更

Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
  1. Check for running schema changes:
    sql
    WITH j AS (SHOW JOBS)
    SELECT job_id, description, created, now() - created AS running_for,
           fraction_completed, running_status
    FROM j
    WHERE status = 'running'
      AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
    ORDER BY created;
  2. Identify MVCC GC waits:
    sql
    -- Use Query 4 to find "waiting for MVCC GC" jobs
  3. Interpret results:
    • If
      running_status
      = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to
      gc.ttlseconds
      )
    • If long-running with low
      fraction_completed
      : Check for contention, large table size, or resource constraints
    • If failed: Check
      error
      column for specific failure reason
  4. Next steps:
    • MVCC GC wait: Verify
      SHOW CLUSTER SETTING gc.ttlseconds
      and wait
    • Resource constraints: Check cluster CPU/memory usage
    • Failed job: Address error (permissions, constraints) and retry operation
场景: 用户反馈ALTER TABLE或CREATE INDEX似乎停滞。
  1. 检查运行中的模式变更:
    sql
    WITH j AS (SHOW JOBS)
    SELECT job_id, description, created, now() - created AS running_for,
           fraction_completed, running_status
    FROM j
    WHERE status = 'running'
      AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
    ORDER BY created;
  2. 识别MVCC GC等待:
    sql
    -- 使用查询4查找“waiting for MVCC GC”任务
  3. 解读结果:
    • 如果
      running_status
      = "waiting for MVCC GC":DROP操作后的清理属于正常情况(等待时长不超过
      gc.ttlseconds
    • 如果长时间运行且
      fraction_completed
      较低:检查是否存在竞争、表过大或资源受限
    • 如果任务失败:查看
      error
      列获取具体失败原因
  4. 下一步操作:
    • MVCC GC等待:验证
      SHOW CLUSTER SETTING gc.ttlseconds
      并等待
    • 资源受限:检查集群CPU/内存使用情况
    • 任务失败:解决错误(权限、约束)并重试操作

Workflow 2: Failed Backup Triage

工作流2:排查失败的备份

Scenario: Scheduled backup job failed.
  1. Find recent failed backups:
    sql
    -- Use Query 1 filtered for BACKUP job type
    WITH j AS (SHOW JOBS)
    SELECT job_id, description, created, finished, error
    FROM j
    WHERE status = 'failed'
      AND job_type = 'BACKUP'
      AND created > now() - INTERVAL '24 hours'
    ORDER BY created DESC;
  2. Analyze error messages:
    • "permission denied": Check external storage credentials
    • "timeout": Network connectivity to backup destination
    • "no space left": Destination storage full
    • "connection refused": External storage endpoint unreachable
  3. Verify backup destination:
    sql
    -- Check SHOW BACKUP for successful backups to same destination
    SHOW BACKUP 's3://bucket/path';
  4. Remediate and retry:
    • Fix underlying issue (credentials, storage, network)
    • Re-run backup command
    • Monitor with Query 7 for progress
场景: 定时备份任务失败。
  1. 查找最近失败的备份:
    sql
    -- 使用查询1筛选BACKUP任务类型
    WITH j AS (SHOW JOBS)
    SELECT job_id, description, created, finished, error
    FROM j
    WHERE status = 'failed'
      AND job_type = 'BACKUP'
      AND created > now() - INTERVAL '24 hours'
    ORDER BY created DESC;
  2. 分析错误信息:
    • "permission denied":检查外部存储凭证
    • "timeout":与备份目标的网络连接问题
    • "no space left":目标存储已满
    • "connection refused":外部存储端点无法访问
  3. 验证备份目标:
    sql
    -- 检查同一目标的成功备份记录
    SHOW BACKUP 's3://bucket/path';
  4. 修复并重试:
    • 修复根本问题(凭证、存储、网络)
    • 重新运行备份命令
    • 使用查询7监控进度

Workflow 3: Automatic Job Health Check

工作流3:自动任务健康检查

Scenario: Proactive monitoring of automatic background jobs.
  1. Check AUTO CREATE STATS frequency:
    sql
    -- Use Query 5 to see recent automatic statistics jobs
    SELECT job_type, status, COUNT(*) AS job_count,
           MAX(created) AS most_recent
    FROM [SHOW AUTOMATIC JOBS]
    WHERE created > now() - INTERVAL '24 hours'
      AND job_type = 'AUTO CREATE STATS'
    GROUP BY job_type, status;
  2. Expected pattern:
    • Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
    • Regular AUTO SQL STATS COMPACTION (typically once per hour)
  3. Warning signs:
    • No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
    • High failure rate: Check cluster resource constraints or permission issues
    • No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
  4. Verify settings:
    sql
    SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- Should be true
    SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
场景: 主动监控自动后台任务。
  1. 检查AUTO CREATE STATS执行频率:
    sql
    -- 使用查询5查看最近的自动统计信息任务
    SELECT job_type, status, COUNT(*) AS job_count,
           MAX(created) AS most_recent
    FROM [SHOW AUTOMATIC JOBS]
    WHERE created > now() - INTERVAL '24 hours'
      AND job_type = 'AUTO CREATE STATS'
    GROUP BY job_type, status;
  2. 预期模式:
    • 每天多次成功执行AUTO CREATE STATS任务(取决于表更新频率)
    • 定期执行AUTO SQL STATS COMPACTION(通常每小时一次)
  3. 警告信号:
    • 过去24小时内无AUTO CREATE STATS任务:统计信息收集可能已禁用
    • 高失败率:检查集群资源受限或权限问题
    • 无AUTO SQL STATS COMPACTION:统计信息表可能无限增长
  4. 验证设置:
    sql
    SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled;  -- 应设为true
    SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;

Workflow 4: Long-Running Job Monitoring

工作流4:长时间运行任务监控

Scenario: Track progress of expected long-running operations.
  1. Identify long-running jobs:
    sql
    -- Use Query 2 with custom threshold
    WITH j AS (SHOW JOBS)
    SELECT job_id, job_type, description,
           now() - created AS running_for,
           fraction_completed
    FROM j
    WHERE status = 'running'
      AND created < now() - INTERVAL '30 minutes'
    ORDER BY created;
  2. Monitor progress over time:
    sql
    -- Re-run every 10-15 minutes, track fraction_completed changes
    -- Example: 0.25 → 0.40 → 0.55 indicates steady progress
  3. Estimate completion:
    sql
    -- Use Query 7 for backup/restore jobs with time estimates
  4. Decide on action:
    • Steady progress: Continue monitoring
    • Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
    • Failed: Use Query 1 to check error
场景: 跟踪预期长时间运行操作的进度。
  1. 识别长时间运行的任务:
    sql
    -- 使用查询2并自定义阈值
    WITH j AS (SHOW JOBS)
    SELECT job_id, job_type, description,
           now() - created AS running_for,
           fraction_completed
    FROM j
    WHERE status = 'running'
      AND created < now() - INTERVAL '30 minutes'
    ORDER BY created;
  2. 随时间监控进度:
    sql
    -- 每10-15分钟重新运行一次,跟踪fraction_completed的变化
    -- 示例:0.25 → 0.40 → 0.55表示进度稳定
  3. 估计完成时间:
    sql
    -- 使用查询7获取备份/恢复任务的时间估计
  4. 决定操作:
    • 进度稳定:继续监控
    • 进度停滞(fraction_completed未增加):使用triaging-live-sql-activity排查
    • 任务失败:使用查询1检查错误

Safety Considerations

安全注意事项

Read-only operations (all diagnostic queries): All
SHOW JOBS
and
SHOW AUTOMATIC JOBS
queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
  • You have
    CONTROLJOB
    role option
  • You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
  • You have authorization to interrupt cluster operations
  • You've verified the job is truly problematic (not just slow)
Job control commands:
sql
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;

-- Resume a paused job
RESUME JOB <job_id>;

-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;
Risks by job type:
  • SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
  • BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
  • RESTORE: Canceling may leave database partially restored; requires cleanup
  • AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
只读操作(所有诊断查询): 所有
SHOW JOBS
SHOW AUTOMATIC JOBS
查询均为只读,可安全在生产环境运行,不会对集群操作造成性能影响。
任务控制操作(可选):
注意:暂停或取消任务可能影响数据完整性
仅在以下情况下执行任务控制:
  • 你拥有
    CONTROLJOB
    角色选项
  • 你了解潜在影响(如中途取消模式变更可能需要手动清理)
  • 你有权中断集群操作
  • 你已确认任务确实存在问题(并非只是运行缓慢)
任务控制命令:
sql
-- 暂停运行中的任务(之后可恢复)
PAUSE JOB <job_id>;

-- 恢复暂停的任务
RESUME JOB <job_id>;

-- 取消任务(终端操作 - 无法恢复)
CANCEL JOB <job_id>;
按任务类型划分的风险:
  • SCHEMA CHANGE: 取消可能导致模式处于不一致状态;优先选择暂停并排查问题
  • BACKUP: 取消是安全的(可重试);临时问题优先选择暂停
  • RESTORE: 取消可能导致数据库部分恢复;需要清理
  • AUTO CREATE STATS: 取消是安全的(系统会自动重试)
最佳实践: 重点关注监控和诊断;仅在明确需要且获得授权时使用控制操作。
CONTROLJOB
角色选项设置请参考权限参考文档

Troubleshooting

故障排查

IssueCauseFix
SHOW JOBS
returns empty
No jobs in last 12h, or insufficient privilegesGrant
VIEWJOB
privilege; verify cluster has recent job activity
"waiting for MVCC GC" for many hoursNormal behavior for SCHEMA CHANGE GC after DROP operationsWait up to
gc.ttlseconds
(default 25h); check
SHOW CLUSTER SETTING gc.ttlseconds
Can't pause/resume job: "permission denied"Missing
CONTROLJOB
role option
Use
ALTER ROLE <username> WITH CONTROLJOB
(not GRANT SYSTEM)
Job stuck at same
fraction_completed
Job may be processing large batch, or actually stuckWait 15-30 min and re-check; if no change, investigate with live query triage
No AUTO CREATE STATS jobsAutomatic collection disabledCheck
sql.stats.automatic_collection.enabled = true
SHOW AUTOMATIC JOBS
shows old jobs only
Need to filter by time windowAdd
WHERE created > now() - INTERVAL '24 hours'
Failed job with "schema change GC" errorExpected for post-DROP cleanup failuresUsually safe to ignore; job will retry automatically
Job error: "concurrent schema change"Multiple schema changes on same tableWait for first schema change to complete, then retry
问题原因解决方法
SHOW JOBS
返回空结果
最近12小时无任务,或权限不足授予
VIEWJOB
权限;验证集群近期有任务活动
"waiting for MVCC GC"持续数小时DROP操作后SCHEMA CHANGE GC的正常行为等待时长不超过
gc.ttlseconds
(默认25小时);使用
SHOW CLUSTER SETTING gc.ttlseconds
查看设置
无法暂停/恢复任务:"permission denied"缺少
CONTROLJOB
角色选项
使用
ALTER ROLE <username> WITH CONTROLJOB
(非GRANT SYSTEM)
任务的
fraction_completed
保持不变
任务可能在处理大数据批,或确实停滞等待15-30分钟后重新检查;若无变化,使用实时查询排查工具
无AUTO CREATE STATS任务自动收集已禁用检查
sql.stats.automatic_collection.enabled = true
SHOW AUTOMATIC JOBS
仅显示旧任务
需要按时间窗口筛选添加
WHERE created > now() - INTERVAL '24 hours'
任务出现"schema change GC"错误DROP操作后清理失败的预期情况通常可忽略;任务会自动重试
任务错误:"concurrent schema change"同一表存在多个模式变更等待第一个模式变更完成后重试

Key Considerations

关键注意事项

  • Jobs vs queries: Background jobs execute statements that don't appear in SHOW STATEMENTS or statement statistics
  • Time windows: SHOW JOBS default 12h retention; use
    crdb_internal.jobs
    for up to 14 days
  • MVCC GC waiting: Normal and expected for post-DROP cleanup; duration tied to
    gc.ttlseconds
  • LIMIT clauses: Always include for production safety (prevents overwhelming output)
  • Privilege model: VIEWJOB (system privilege) for read-only; CONTROLJOB (role option) for control
  • Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
  • Progress estimates:
    fraction_completed
    may be NULL or sparse for some job types
  • Job control risks: PAUSE is safer than CANCEL; some cancellations require manual cleanup
  • 任务与查询: 后台任务执行的语句不会出现在SHOW STATEMENTS或语句统计信息中
  • 时间窗口: SHOW JOBS默认保留12小时;使用
    crdb_internal.jobs
    可查看最多14天的任务
  • MVCC GC等待: DROP操作后的清理属于正常预期行为;时长与
    gc.ttlseconds
    相关
  • LIMIT子句: 生产环境中务必包含该子句(防止输出过多)
  • 权限模型: VIEWJOB(系统权限)用于只读访问;CONTROLJOB(角色选项)用于控制操作
  • 自动任务健康: 定期执行AUTO CREATE STATS对查询优化器性能至关重要
  • 进度估计: 部分任务类型的
    fraction_completed
    可能为NULL或不连续
  • 任务控制风险: 暂停比取消更安全;部分取消操作需要手动清理

References

参考资料

Skill references:
  • RBAC and privilege setup
  • Job states and transitions
  • Job types catalog
  • SQL query variations
Related skills:
  • triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
  • profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation:
技能参考:
  • RBAC与权限设置
  • 任务状态与转换
  • 任务类型目录
  • SQL查询变体
相关技能:
  • triaging-live-sql-activity - 用于实时查询监控(任务执行的语句可能不会显示)
  • profiling-statement-fingerprints - 用于历史查询分析(不包含后台任务)
官方CockroachDB文档: