triaging-live-sql-activity

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triaging Live SQL Activity

实时SQL活动排查

Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and

crdb_internal

views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.

仅通过SQL接口（SHOW语句和

crdb_internal

视图）识别当前正在运行的长时间查询、繁忙会话和活跃事务，诊断集群实时性能问题，无需依赖DB Console、HTTP端点或Prometheus访问即可快速排查。

When to Use This Skill

何时使用该技能

Users report "the cluster is slow right now"
High CPU or memory usage on cluster nodes
Need to identify runaway queries or stuck transactions
Want to find which applications/users are consuming resources
Require immediate triage without DB Console access
Need to generate SQL to cancel problematic sessions/queries

For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries. For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition. For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don't appear in SHOW CLUSTER STATEMENTS.

用户反馈“当前集群运行缓慢”
集群节点CPU或内存占用过高
需要识别失控查询或卡住的事务
想要查找消耗资源的应用/用户
无法访问DB Console时需要快速排查
需要生成SQL来终止有问题的会话/查询

历史性能分析场景： 使用profiling-statement-fingerprints分析一段时间内的查询模式，识别缓慢查询特征，无需实时查询即可排查趋势。 事务级分析场景： 使用profiling-transaction-fingerprints分析历史事务重试模式、提交延迟趋势和语句构成。 后台任务监控场景： 使用monitoring-background-jobs监控模式变更、备份以及未出现在SHOW CLUSTER STATEMENTS中的自动任务。

Prerequisites

前提条件

Required SQL access:

Connection to any CockroachDB node
For cluster-wide visibility:
```
VIEWACTIVITY
```
or
```
VIEWACTIVITYREDACTED
```
privilege
- ```
VIEWACTIVITYREDACTED
```
  : Redacts constants in other users' queries (recommended for privacy)
- ```
VIEWACTIVITY
```
  : Shows full query text for all users
- Without these: Only see your own sessions/queries
Basic understanding of SQL query execution
(Optional)
```
CANCELQUERY
```
/
```
CANCELSESSION
```
privileges for cancellation operations

Check your privileges:

sql

SHOW GRANTS ON ROLE <username>;

See permissions reference for detailed RBAC setup.

所需SQL权限：

可连接到任意CockroachDB节点
如需集群级可见性：
```
VIEWACTIVITY
```
或
```
VIEWACTIVITYREDACTED
```
权限
- ```
VIEWACTIVITYREDACTED
```
  ：屏蔽其他用户查询中的常量（出于隐私保护，推荐使用）
- ```
VIEWACTIVITY
```
  ：查看所有用户的完整查询文本
- 无上述权限：仅能查看自己的会话/查询
具备SQL查询执行的基础理解
（可选）执行终止操作所需的
```
CANCELQUERY
```
/
```
CANCELSESSION
```
权限

检查权限：

sql

SHOW GRANTS ON ROLE <username>;

如需详细的RBAC设置，请查看权限参考文档。

Core Diagnostic Approach

核心诊断方法

CockroachDB provides SQL-only interfaces for live activity triage:

Interface	Purpose	Cluster-wide?
`SHOW CLUSTER STATEMENTS`	Currently executing queries	Yes (with VIEWACTIVITY)
`SHOW CLUSTER SESSIONS`	Active client sessions	Yes (with VIEWACTIVITY)
`crdb_internal.cluster_transactions`	In-progress transactions	Yes (with VIEWACTIVITY)

Triage workflow:

Identify long-running queries (> 5-10 minutes)
Correlate to sessions and applications
Check transaction retry counts (high retries = contention)
Drill down by app/user/client
(Optional) Cancel runaway work

Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.

CockroachDB提供纯SQL接口用于实时活动排查：

接口	用途	支持集群级查看？
`SHOW CLUSTER STATEMENTS`	当前正在执行的查询	是（需VIEWACTIVITY权限）
`SHOW CLUSTER SESSIONS`	活跃客户端会话	是（需VIEWACTIVITY权限）
`crdb_internal.cluster_transactions`	进行中的事务	是（需VIEWACTIVITY权限）

排查流程：

识别运行时间超过5-10分钟的长时间查询
关联到对应的会话和应用
检查事务重试次数（高重试次数表示存在资源竞争）
按应用/用户/客户端深入排查
（可选）终止失控任务

安全性： 所有诊断查询均为只读操作。终止操作需手动选择并带有明确警告。

Core Diagnostic Queries

核心诊断查询

Long-Running Queries

长时间运行的查询

Identify queries running longer than a specified threshold:

sql

-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

```
running_for
```
: How long the query has been executing
```
query_preview
```
: First 200 characters (protects against massive queries)
```
phase
```
: execution phase (preparing, executing, etc.)
```
distributed
```
: whether query spans multiple nodes

Customizable thresholds:

Change

INTERVAL '5 minutes'

'10 minutes'

'30 seconds'

, etc.

Adjust
```
LIMIT
```
based on cluster size and expected load

识别运行时间超过指定阈值的查询：

sql

-- 运行时间超过5分钟的查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

关键列说明：

```
running_for
```
：查询已执行时长
```
query_preview
```
：查询前200个字符（避免返回过大的查询内容）
```
phase
```
：执行阶段（准备中、执行中等）
```
distributed
```
：查询是否跨多个节点

可自定义阈值：

将

INTERVAL '5 minutes'

修改为

'10 minutes'

、

'30 seconds'

等

根据集群规模和预期负载调整
```
LIMIT
```
值

Active Sessions

活跃会话

Find sessions with long-running active queries:

sql

-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;

Key columns:

```
active_query_for
```
: Duration of current active query
```
application_name
```
: Source application for drill-down
```
client_address
```
: Client IP/hostname for troubleshooting
```
status
```
: Session state (Idle, Active, etc.)

查找包含长时间运行查询的会话：

sql

-- 包含运行时长超过5分钟的活跃查询的会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;

关键列说明：

```
active_query_for
```
：当前活跃查询的持续时长
```
application_name
```
：来源应用，用于深入排查
```
client_address
```
：客户端IP/主机名，用于故障排查
```
status
```
：会话状态（空闲、活跃等）

Active Transactions

活跃事务

Identify long-running transactions (potential blockers):

sql

-- Transactions running > 5 minutes
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

```
num_retries
```
/
```
num_auto_retries
```
: High retry counts indicate contention
```
num_stmts
```
: Number of statements in transaction (large = potentially problematic)
```
txn_string
```
: Transaction fingerprint

Production safety note:

crdb_internal.cluster_transactions

is production-approved and safe for triage.

识别长时间运行的事务（潜在阻塞源）：

sql

-- 运行时间超过5分钟的事务
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

关键列说明：

```
num_retries
```
/
```
num_auto_retries
```
：高重试次数表示存在资源竞争
```
num_stmts
```
：事务中的语句数量（数量过大可能存在问题）
```
txn_string
```
：事务特征码

生产环境安全提示：

crdb_internal.cluster_transactions

已通过生产环境验证，可安全用于排查。

Drill-Down by Application, User, or Client

按应用、用户或客户端深入排查

Once you identify suspicious activity, drill down by filtering:

识别到可疑活动后，可通过过滤条件深入排查：

Filter by Application

按应用过滤

sql

-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;

sql

-- 特定应用的所有活动
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;

Filter by User

按用户过滤

sql

-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;

sql

-- 特定用户的所有活动
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;

Filter by Client Address

按客户端地址过滤

sql

-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;

sql

-- 特定客户端IP的所有会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;

Combined Filters

组合过滤

sql

-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;

sql

-- 特定应用和用户的长时间查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;

Safety Considerations

安全注意事项

Read-only operations: All diagnostic queries (

SHOW

statements,

crdb_internal.cluster_transactions

) are read-only and safe to run in production.

Cancellation operations (opt-in):

CAUTION: Canceling queries/sessions terminates user work

Only proceed if:

You've confirmed the query/session is runaway or stuck
You have authorization to interrupt user workloads
You've notified stakeholders if appropriate
You have
```
CANCELQUERY
```
or
```
CANCELSESSION
```
privileges

只读操作： 所有诊断查询（

SHOW

语句、

crdb_internal.cluster_transactions

）均为只读操作，可在生产环境安全运行。

终止操作（需手动选择）：

注意：终止查询/会话会中断用户工作

仅在以下情况下执行：

已确认查询/会话属于失控或卡住状态
您有权中断用户工作负载
已通知相关利益相关方（如适用）
具备
```
CANCELQUERY
```
或
```
CANCELSESSION
```
权限

Canceling Runaway Work (Opt-In)

终止失控任务（需手动选择）

Cancel a Specific Query

终止特定查询

sql

-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';

Example:

sql

CANCEL QUERY '15f9e0e91f072f0f0000000000000001';

sql

-- 1. 从上述排查查询中获取query_id
-- 2. 执行终止操作
CANCEL QUERY '<query_id>';

示例：

sql

CANCEL QUERY '15f9e0e91f072f0f0000000000000001';

Cancel an Entire Session

终止整个会话

sql

-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';

Example:

sql

CANCEL SESSION '15f9e0e91f072f0f';

Verification: After canceling, re-run the triage queries to confirm the query/session is gone.

Required privileges:

```
CANCELQUERY
```
system privilege to cancel queries
```
CANCELSESSION
```
system privilege to cancel sessions
Admin role has both by default

See permissions reference for granting these privileges.

sql

-- 1. 从上述排查查询中获取session_id
-- 2. 终止该会话中的所有查询
CANCEL SESSION '<session_id>';

示例：

sql

CANCEL SESSION '15f9e0e91f072f0f';

验证： 终止后，重新运行排查查询确认查询/会话已消失。

所需权限：

```
CANCELQUERY
```
系统权限：用于终止查询
```
CANCELSESSION
```
系统权限：用于终止会话
管理员角色默认具备上述两种权限

如需授予这些权限，请查看权限参考文档。

Common Triage Workflows

常见排查流程

Workflow 1: "Cluster is slow" investigation

流程1：“集群运行缓慢”排查

Scenario: Users report general slowness.

Check for long-running queries:

sql

-- Run the "Long-Running Queries" diagnostic
-- Look for queries running > 5-10 minutes

Identify source applications:

sql

-- Group by application to find culprits
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT application_name, COUNT(*) AS num_queries,
       AVG(now() - start) AS avg_duration
FROM q
WHERE start < now() - INTERVAL '5 minutes'
GROUP BY application_name
ORDER BY num_queries DESC;

Drill down into specific app:

sql

-- Filter by top application from step 2
-- Use "Filter by Application" query

Decide on action:
- Contact app team to investigate query patterns
- Cancel specific runaway queries if critical
- Check for schema/index issues if queries are legitimate

场景： 用户反馈整体运行缓慢。

检查长时间运行的查询：

sql

-- 运行“长时间运行的查询”诊断语句
-- 查找运行时长超过5-10分钟的查询

识别来源应用：

sql

-- 按应用分组查找问题来源
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT application_name, COUNT(*) AS num_queries,
       AVG(now() - start) AS avg_duration
FROM q
WHERE start < now() - INTERVAL '5 minutes'
GROUP BY application_name
ORDER BY num_queries DESC;

深入排查特定应用：

sql

-- 过滤步骤2中排名靠前的应用
-- 使用“按应用过滤”查询语句

决定操作：
- 联系应用团队排查查询模式
- 若情况紧急，终止特定失控查询
- 若查询合法，检查模式/索引问题

Workflow 2: Find high-retry transactions

流程2：查找高重试事务

Scenario: Suspect contention issues.

Check for high retry counts:

sql

SELECT application_name, AVG(num_retries) AS avg_retries,
       MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
GROUP BY application_name
HAVING AVG(num_retries) > 5
ORDER BY avg_retries DESC;

Investigate specific transactions:

sql

-- Find transactions with >10 retries
SELECT id, application_name, num_retries, num_stmts,
       substring(txn_string, 1, 200) AS txn_preview
FROM crdb_internal.cluster_transactions
WHERE num_retries > 10
ORDER BY num_retries DESC;

Next steps:
- Review transaction patterns for contention
- Check for lock conflicts or hotspots
- Consider schema changes to reduce contention

场景： 怀疑存在资源竞争问题。

检查高重试次数：

sql

SELECT application_name, AVG(num_retries) AS avg_retries,
       MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
GROUP BY application_name
HAVING AVG(num_retries) > 5
ORDER BY avg_retries DESC;

深入排查特定事务：

sql

-- 查找重试次数超过10次的事务
SELECT id, application_name, num_retries, num_stmts,
       substring(txn_string, 1, 200) AS txn_preview
FROM crdb_internal.cluster_transactions
WHERE num_retries > 10
ORDER BY num_retries DESC;

后续步骤：
- 审查事务模式以排查资源竞争
- 检查锁冲突或热点问题
- 考虑修改模式以减少资源竞争

Workflow 3: Identify resource hogs by user

流程3：按用户识别资源消耗大户

Scenario: Need to attribute load to specific users.

Count active queries per user:

sql

WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT user_name, COUNT(*) AS num_active_queries,
       AVG(now() - start) AS avg_duration
FROM q
GROUP BY user_name
ORDER BY num_active_queries DESC;

Drill down to specific user's activity:
sql
```
-- Use "Filter by User" query
```
Take action:
- Contact user if unexpected load
- Review user's query patterns
- Cancel if clearly runaway

场景： 需要将负载归因于特定用户。

统计每个用户的活跃查询数量：

sql

WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT user_name, COUNT(*) AS num_active_queries,
       AVG(now() - start) AS avg_duration
FROM q
GROUP BY user_name
ORDER BY num_active_queries DESC;

深入排查特定用户的活动：

sql

-- 使用“按用户过滤”查询语句

采取行动：
- 若负载异常，联系对应用户
- 审查用户的查询模式
- 若明确为失控查询，执行终止操作

Troubleshooting

故障排除

Issue	Cause	Fix
`SHOW CLUSTER STATEMENTS` returns empty	No active queries, or insufficient privileges	Grant `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` ; verify cluster has active load
Query text shows `<hidden>`	Using `VIEWACTIVITYREDACTED` privilege	This is expected for privacy; use `VIEWACTIVITY` if full text needed
Can't cancel query: "permission denied"	Missing `CANCELQUERY` privilege	Grant `CANCELQUERY` system privilege to your user
`crdb_internal.cluster_transactions` slow	High transaction volume on cluster	Add filters (application_name, time threshold) to reduce result set
"relation does not exist" error	Typo in table name or old CockroachDB version	Verify you're using production-approved tables; check CockroachDB version compatibility
Triage queries themselves are slow	Cluster under extreme load	Use more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first

问题	原因	解决方法
`SHOW CLUSTER STATEMENTS` 返回空结果	无活跃查询，或权限不足	授予 `VIEWACTIVITY` 或 `VIEWACTIVITYREDACTED` 权限；确认集群存在活跃负载
查询文本显示 `<hidden>`	使用了 `VIEWACTIVITYREDACTED` 权限	这是隐私保护的预期效果；如需完整文本，使用 `VIEWACTIVITY` 权限
无法终止查询：“permission denied”	缺少 `CANCELQUERY` 权限	为您的用户授予 `CANCELQUERY` 系统权限
`crdb_internal.cluster_transactions` 查询缓慢	集群事务量过高	添加过滤条件（应用名称、时间阈值）以减少结果集
出现“relation does not exist”错误	表名拼写错误或CockroachDB版本过旧	确认使用的是生产环境认可的表；检查CockroachDB版本兼容性
排查查询本身运行缓慢	集群处于极端负载状态	使用更严格的过滤条件（更短的时间窗口、特定应用）；优先终止明显的失控任务

Key Considerations

关键注意事项

Privacy: Use
```
VIEWACTIVITYREDACTED
```
instead of
```
VIEWACTIVITY
```
to protect sensitive query constants in multi-tenant environments
Performance impact: Triage queries are read-only and lightweight, but avoid running them in tight loops during extreme load
LIMIT clause: Always include
```
LIMIT
```
to prevent overwhelming output on large clusters
Time thresholds: Adjust
```
INTERVAL
```
based on your workload (5 minutes is a reasonable default, but fast OLTP may need 30 seconds)
Cancellation is disruptive: Only cancel queries/sessions after confirming they're problematic; coordinate with application teams when possible
Not for historical analysis: These queries show current state only; for trends over time, use DB Console or Prometheus metrics

Production-approved sources: Only use

SHOW CLUSTER STATEMENTS

SHOW CLUSTER SESSIONS

, and

crdb_internal.cluster_transactions

for production triage

隐私保护： 在多租户环境中，使用
```
VIEWACTIVITYREDACTED
```
而非
```
VIEWACTIVITY
```
以保护敏感查询常量
性能影响： 排查查询为只读且轻量，但在极端负载期间避免频繁循环执行
LIMIT子句： 始终添加
```
LIMIT
```
以避免在大型集群中返回过多结果
时间阈值： 根据您的工作负载调整
```
INTERVAL
```
（5分钟为合理默认值，但快速OLTP场景可能需要30秒）
终止操作具有破坏性： 仅在确认查询/会话存在问题后执行终止操作；尽可能与应用团队协调
不适用于历史分析： 这些查询仅显示当前状态；如需分析时间趋势，使用DB Console或Prometheus指标

生产环境认可的来源： 仅使用

SHOW CLUSTER STATEMENTS

、

SHOW CLUSTER SESSIONS

和

crdb_internal.cluster_transactions

进行生产环境排查

References

参考资料

Skill references:

SQL query variations and examples
RBAC and privilege setup

Related skills:

profiling-statement-fingerprints - For historical performance pattern analysis and trend identification
profiling-transaction-fingerprints - For historical transaction-level analysis including retry storms and commit latency

Official CockroachDB Documentation:

技能参考：

SQL查询变体及示例
RBAC及权限设置

相关技能：

profiling-statement-fingerprints - 用于历史性能模式分析和趋势识别
profiling-transaction-fingerprints - 用于历史事务级分析，包括重试风暴和提交延迟

官方CockroachDB文档：