Monitoring Background Jobs
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
When to Use This Skill
- Schema changes appear stuck or delayed (ALTER TABLE, CREATE INDEX, DROP operations)
- Backups or restores are failing or taking longer than expected
- Need to verify automatic statistics collection is running
- Investigating "waiting for MVCC GC" status in schema change cleanup
- Troubleshooting failed jobs without DB Console access
- Monitoring long-running operations that don't appear in query metrics
For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.
For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.
Prerequisites
Required SQL access:
- Connection to any CockroachDB node
- For cluster-wide job visibility: system privilege (read-only monitoring)
- For job control operations: role option (pause/cancel/resume jobs)
- Without these: Limited visibility into jobs you created
Check your privileges:
sql
SHOW GRANTS ON ROLE <username>;
Look for:
- in the column (system privilege)
- in role options (check with )
See permissions reference for detailed RBAC setup.
Core Concepts
Jobs vs Statements
Key distinction:
- Statements: User-initiated SQL queries tracked by SHOW CLUSTER STATEMENTS and statement statistics
- Background jobs: Long-running operations tracked separately by SHOW JOBS
Background jobs are excluded from:
- (live query monitoring)
crdb_internal.statement_statistics
(historical query analysis)
- Statement fingerprint metrics and DB Console Statements page
Common job types:
| Category | Job Types | Examples |
|---|
| User-initiated | SCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEED | ALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE |
| Automatic | SCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTION | Post-DROP cleanup, table statistics refresh, stats table maintenance |
See job types reference for complete catalog.
SHOW JOBS vs SHOW AUTOMATIC JOBS
| Interface | Scope | Time Window | Use Case |
|---|
| User-initiated + automatic | Last 12 hours (default) | Monitor backups, schema changes, user operations |
| Automatic only | Configurable (recommend 24h) | Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC |
Time retention:
- Default retention: 14 days in table
- display window: 12 hours (configurable with
SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...
)
- display window: Configurable with
WHERE created > now() - INTERVAL '...'
Job Status Values
| Status | Meaning | Action Required |
|---|
| Job is actively executing | Monitor progress via |
| Job completed successfully | None |
| Job encountered an error | Investigate column, may need to retry |
| Job manually paused | Resume with if appropriate |
| Job was canceled (terminal state) | Retry operation if needed |
| Job queued but not started | Monitor; may indicate resource constraints |
| Job failed and is rolling back changes | Wait for completion; check error after |
Running status sub-states:
- : Backup job actively transferring data
- : Restore job actively applying data
- : SCHEMA CHANGE GC waiting for garbage collection eligibility
See job states reference for detailed state transitions and "waiting for MVCC GC" explanation.
Core Diagnostic Queries
Query 1: Failed Jobs (Last 12 Hours)
Identify jobs that failed with error messages:
sql
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
Key columns:
- : Failure reason (check for permission errors, disk space, network issues)
- : Human-readable description of what the job was doing
- : How long the job ran before failing
Common failure patterns:
- Permission denied: User lacks required privileges
- Disk space: Backup destination full
- Network timeout: External storage unreachable
- Constraint violation: Restore conflicts with existing data
Query 2: Long-Running Jobs
Find jobs running longer than expected threshold:
sql
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
Key columns:
- : Total elapsed time since job started
- : Progress estimate (0.0 to 1.0, NULL if unavailable)
- : Sub-state details (e.g., "waiting for MVCC GC")
Customizable thresholds:
- Schema changes: 30 minutes to several hours (depends on table size)
- Backups: 1-6+ hours (depends on data volume)
- Automatic jobs: Usually < 30 minutes
Query 3: Paused Jobs
Identify jobs that are paused and may need attention:
sql
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
Action required:
Resume with
after verifying the pause reason.
Common reasons for paused jobs:
- Manual user pause for maintenance
- Resource constraints (cluster paused the job)
- Error requiring manual intervention
Query 4: Schema Changes Waiting for MVCC GC
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
sql
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
Interpretation:
- Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on setting (default 25 hours)
- Expected duration: Up to + some overhead
- When to worry: Waiting > 2x (check setting with
SHOW CLUSTER SETTING gc.ttlseconds
)
Why this happens:
After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
Query 5: Automatic Job Health (24h Window)
Monitor automatic background jobs like statistics collection:
sql
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
Key job types:
- : Automatic table statistics refresh (critical for query optimizer)
AUTO SQL STATS COMPACTION
: Periodic cleanup of statement/transaction statistics tables
Health indicators:
- Healthy: Regular successful executions (every few hours)
- Unhealthy: No recent executions, or high failure rate
- Impact of failure: Stale statistics lead to poor query plans and slow queries
Query 6: Jobs by Type and Status
Aggregated view for pattern analysis:
sql
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
Use case:
- Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
- Spot anomalies (e.g., unusual job type volume)
- Track job success rates by type
Query 7: Backup and Restore Progress
Track progress of backup/restore jobs:
sql
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
Key columns:
- : Progress percentage (0-100)
- : Rough estimate based on current progress rate
- : Detailed status (e.g., "performing backup to s3://...")
Note: may be NULL for some job types or early in execution.
Common Workflows
Workflow 1: Schema Change Stuck Investigation
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
-
Check for running schema changes:
sql
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, now() - created AS running_for,
fraction_completed, running_status
FROM j
WHERE status = 'running'
AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
ORDER BY created;
-
Identify MVCC GC waits:
sql
-- Use Query 4 to find "waiting for MVCC GC" jobs
-
Interpret results:
- If = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to )
- If long-running with low : Check for contention, large table size, or resource constraints
- If failed: Check column for specific failure reason
-
Next steps:
- MVCC GC wait: Verify
SHOW CLUSTER SETTING gc.ttlseconds
and wait
- Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation
Workflow 2: Failed Backup Triage
Scenario: Scheduled backup job failed.
-
Find recent failed backups:
sql
-- Use Query 1 filtered for BACKUP job type
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, finished, error
FROM j
WHERE status = 'failed'
AND job_type = 'BACKUP'
AND created > now() - INTERVAL '24 hours'
ORDER BY created DESC;
-
Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable
-
Verify backup destination:
sql
-- Check SHOW BACKUP for successful backups to same destination
SHOW BACKUP 's3://bucket/path';
-
Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress
Workflow 3: Automatic Job Health Check
Scenario: Proactive monitoring of automatic background jobs.
-
Check AUTO CREATE STATS frequency:
sql
-- Use Query 5 to see recent automatic statistics jobs
SELECT job_type, status, COUNT(*) AS job_count,
MAX(created) AS most_recent
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type = 'AUTO CREATE STATS'
GROUP BY job_type, status;
-
Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
-
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
-
Verify settings:
sql
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true
SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
Workflow 4: Long-Running Job Monitoring
Scenario: Track progress of expected long-running operations.
-
Identify long-running jobs:
sql
-- Use Query 2 with custom threshold
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, description,
now() - created AS running_for,
fraction_completed
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '30 minutes'
ORDER BY created;
-
Monitor progress over time:
sql
-- Re-run every 10-15 minutes, track fraction_completed changes
-- Example: 0.25 → 0.40 → 0.55 indicates steady progress
-
Estimate completion:
sql
-- Use Query 7 for backup/restore jobs with time estimates
-
Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error
Safety Considerations
Read-only operations (all diagnostic queries):
All
and
queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
- You have role option
- You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
- You have authorization to interrupt cluster operations
- You've verified the job is truly problematic (not just slow)
Job control commands:
sql
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;
Risks by job type:
- SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
- BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
- RESTORE: Canceling may leave database partially restored; requires cleanup
- AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
Troubleshooting
| Issue | Cause | Fix |
|---|
| returns empty | No jobs in last 12h, or insufficient privileges | Grant privilege; verify cluster has recent job activity |
| "waiting for MVCC GC" for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to (default 25h); check SHOW CLUSTER SETTING gc.ttlseconds
|
| Can't pause/resume job: "permission denied" | Missing role option | Use ALTER ROLE <username> WITH CONTROLJOB
(not GRANT SYSTEM) |
| Job stuck at same | Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check sql.stats.automatic_collection.enabled = true
|
| shows old jobs only | Need to filter by time window | Add WHERE created > now() - INTERVAL '24 hours'
|
| Failed job with "schema change GC" error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: "concurrent schema change" | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
Key Considerations
- Jobs vs queries: Background jobs execute statements that don't appear in SHOW STATEMENTS or statement statistics
- Time windows: SHOW JOBS default 12h retention; use for up to 14 days
- MVCC GC waiting: Normal and expected for post-DROP cleanup; duration tied to
- LIMIT clauses: Always include for production safety (prevents overwhelming output)
- Privilege model: VIEWJOB (system privilege) for read-only; CONTROLJOB (role option) for control
- Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
- Progress estimates: may be NULL or sparse for some job types
- Job control risks: PAUSE is safer than CANCEL; some cancellations require manual cleanup
References
Skill references:
- RBAC and privilege setup
- Job states and transitions
- Job types catalog
- SQL query variations
Related skills:
- triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
- profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation: