triaging-live-sql-activity

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triaging Live SQL Activity

实时SQL活动排查

Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and
crdb_internal
views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.
仅通过SQL接口(SHOW语句和
crdb_internal
视图)识别当前正在运行的长时间查询、繁忙会话和活跃事务,诊断集群实时性能问题,无需依赖DB Console、HTTP端点或Prometheus访问即可快速排查。

When to Use This Skill

何时使用该技能

  • Users report "the cluster is slow right now"
  • High CPU or memory usage on cluster nodes
  • Need to identify runaway queries or stuck transactions
  • Want to find which applications/users are consuming resources
  • Require immediate triage without DB Console access
  • Need to generate SQL to cancel problematic sessions/queries
For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries. For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition. For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don't appear in SHOW CLUSTER STATEMENTS.
  • 用户反馈“当前集群运行缓慢”
  • 集群节点CPU或内存占用过高
  • 需要识别失控查询或卡住的事务
  • 想要查找消耗资源的应用/用户
  • 无法访问DB Console时需要快速排查
  • 需要生成SQL来终止有问题的会话/查询
历史性能分析场景: 使用profiling-statement-fingerprints分析一段时间内的查询模式,识别缓慢查询特征,无需实时查询即可排查趋势。 事务级分析场景: 使用profiling-transaction-fingerprints分析历史事务重试模式、提交延迟趋势和语句构成。 后台任务监控场景: 使用monitoring-background-jobs监控模式变更、备份以及未出现在SHOW CLUSTER STATEMENTS中的自动任务。

Prerequisites

前提条件

Required SQL access:
  • Connection to any CockroachDB node
  • For cluster-wide visibility:
    VIEWACTIVITY
    or
    VIEWACTIVITYREDACTED
    privilege
    • VIEWACTIVITYREDACTED
      : Redacts constants in other users' queries (recommended for privacy)
    • VIEWACTIVITY
      : Shows full query text for all users
    • Without these: Only see your own sessions/queries
  • Basic understanding of SQL query execution
  • (Optional)
    CANCELQUERY
    /
    CANCELSESSION
    privileges for cancellation operations
Check your privileges:
sql
SHOW GRANTS ON ROLE <username>;
See permissions reference for detailed RBAC setup.
所需SQL权限:
  • 可连接到任意CockroachDB节点
  • 如需集群级可见性:
    VIEWACTIVITY
    VIEWACTIVITYREDACTED
    权限
    • VIEWACTIVITYREDACTED
      :屏蔽其他用户查询中的常量(出于隐私保护,推荐使用)
    • VIEWACTIVITY
      :查看所有用户的完整查询文本
    • 无上述权限:仅能查看自己的会话/查询
  • 具备SQL查询执行的基础理解
  • (可选)执行终止操作所需的
    CANCELQUERY
    /
    CANCELSESSION
    权限
检查权限:
sql
SHOW GRANTS ON ROLE <username>;
如需详细的RBAC设置,请查看权限参考文档

Core Diagnostic Approach

核心诊断方法

CockroachDB provides SQL-only interfaces for live activity triage:
InterfacePurposeCluster-wide?
SHOW CLUSTER STATEMENTS
Currently executing queriesYes (with VIEWACTIVITY)
SHOW CLUSTER SESSIONS
Active client sessionsYes (with VIEWACTIVITY)
crdb_internal.cluster_transactions
In-progress transactionsYes (with VIEWACTIVITY)
Triage workflow:
  1. Identify long-running queries (> 5-10 minutes)
  2. Correlate to sessions and applications
  3. Check transaction retry counts (high retries = contention)
  4. Drill down by app/user/client
  5. (Optional) Cancel runaway work
Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.
CockroachDB提供纯SQL接口用于实时活动排查:
接口用途支持集群级查看?
SHOW CLUSTER STATEMENTS
当前正在执行的查询是(需VIEWACTIVITY权限)
SHOW CLUSTER SESSIONS
活跃客户端会话是(需VIEWACTIVITY权限)
crdb_internal.cluster_transactions
进行中的事务是(需VIEWACTIVITY权限)
排查流程:
  1. 识别运行时间超过5-10分钟的长时间查询
  2. 关联到对应的会话和应用
  3. 检查事务重试次数(高重试次数表示存在资源竞争)
  4. 按应用/用户/客户端深入排查
  5. (可选)终止失控任务
安全性: 所有诊断查询均为只读操作。终止操作需手动选择并带有明确警告。

Core Diagnostic Queries

核心诊断查询

Long-Running Queries

长时间运行的查询

Identify queries running longer than a specified threshold:
sql
-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
Key columns:
  • running_for
    : How long the query has been executing
  • query_preview
    : First 200 characters (protects against massive queries)
  • phase
    : execution phase (preparing, executing, etc.)
  • distributed
    : whether query spans multiple nodes
Customizable thresholds:
  • Change
    INTERVAL '5 minutes'
    to
    '10 minutes'
    ,
    '30 seconds'
    , etc.
  • Adjust
    LIMIT
    based on cluster size and expected load
识别运行时间超过指定阈值的查询:
sql
-- 运行时间超过5分钟的查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
关键列说明:
  • running_for
    :查询已执行时长
  • query_preview
    :查询前200个字符(避免返回过大的查询内容)
  • phase
    :执行阶段(准备中、执行中等)
  • distributed
    :查询是否跨多个节点
可自定义阈值:
  • INTERVAL '5 minutes'
    修改为
    '10 minutes'
    '30 seconds'
  • 根据集群规模和预期负载调整
    LIMIT

Active Sessions

活跃会话

Find sessions with long-running active queries:
sql
-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;
Key columns:
  • active_query_for
    : Duration of current active query
  • application_name
    : Source application for drill-down
  • client_address
    : Client IP/hostname for troubleshooting
  • status
    : Session state (Idle, Active, etc.)
查找包含长时间运行查询的会话:
sql
-- 包含运行时长超过5分钟的活跃查询的会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;
关键列说明:
  • active_query_for
    :当前活跃查询的持续时长
  • application_name
    :来源应用,用于深入排查
  • client_address
    :客户端IP/主机名,用于故障排查
  • status
    :会话状态(空闲、活跃等)

Active Transactions

活跃事务

Identify long-running transactions (potential blockers):
sql
-- Transactions running > 5 minutes
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
Key columns:
  • num_retries
    /
    num_auto_retries
    : High retry counts indicate contention
  • num_stmts
    : Number of statements in transaction (large = potentially problematic)
  • txn_string
    : Transaction fingerprint
Production safety note:
crdb_internal.cluster_transactions
is production-approved and safe for triage.
识别长时间运行的事务(潜在阻塞源):
sql
-- 运行时间超过5分钟的事务
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
关键列说明:
  • num_retries
    /
    num_auto_retries
    :高重试次数表示存在资源竞争
  • num_stmts
    :事务中的语句数量(数量过大可能存在问题)
  • txn_string
    :事务特征码
生产环境安全提示:
crdb_internal.cluster_transactions
已通过生产环境验证,可安全用于排查。

Drill-Down by Application, User, or Client

按应用、用户或客户端深入排查

Once you identify suspicious activity, drill down by filtering:
识别到可疑活动后,可通过过滤条件深入排查:

Filter by Application

按应用过滤

sql
-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;
sql
-- 特定应用的所有活动
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;

Filter by User

按用户过滤

sql
-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;
sql
-- 特定用户的所有活动
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;

Filter by Client Address

按客户端地址过滤

sql
-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;
sql
-- 特定客户端IP的所有会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;

Combined Filters

组合过滤

sql
-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;
sql
-- 特定应用和用户的长时间查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;

Safety Considerations

安全注意事项

Read-only operations: All diagnostic queries (
SHOW
statements,
crdb_internal.cluster_transactions
) are read-only and safe to run in production.
Cancellation operations (opt-in):
CAUTION: Canceling queries/sessions terminates user work
Only proceed if:
  • You've confirmed the query/session is runaway or stuck
  • You have authorization to interrupt user workloads
  • You've notified stakeholders if appropriate
  • You have
    CANCELQUERY
    or
    CANCELSESSION
    privileges
只读操作: 所有诊断查询(
SHOW
语句、
crdb_internal.cluster_transactions
)均为只读操作,可在生产环境安全运行。
终止操作(需手动选择):
注意:终止查询/会话会中断用户工作
仅在以下情况下执行:
  • 已确认查询/会话属于失控或卡住状态
  • 您有权中断用户工作负载
  • 已通知相关利益相关方(如适用)
  • 具备
    CANCELQUERY
    CANCELSESSION
    权限

Canceling Runaway Work (Opt-In)

终止失控任务(需手动选择)

Cancel a Specific Query

终止特定查询

sql
-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';
Example:
sql
CANCEL QUERY '15f9e0e91f072f0f0000000000000001';
sql
-- 1. 从上述排查查询中获取query_id
-- 2. 执行终止操作
CANCEL QUERY '<query_id>';
示例:
sql
CANCEL QUERY '15f9e0e91f072f0f0000000000000001';

Cancel an Entire Session

终止整个会话

sql
-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';
Example:
sql
CANCEL SESSION '15f9e0e91f072f0f';
Verification: After canceling, re-run the triage queries to confirm the query/session is gone.
Required privileges:
  • CANCELQUERY
    system privilege to cancel queries
  • CANCELSESSION
    system privilege to cancel sessions
  • Admin role has both by default
See permissions reference for granting these privileges.
sql
-- 1. 从上述排查查询中获取session_id
-- 2. 终止该会话中的所有查询
CANCEL SESSION '<session_id>';
示例:
sql
CANCEL SESSION '15f9e0e91f072f0f';
验证: 终止后,重新运行排查查询确认查询/会话已消失。
所需权限:
  • CANCELQUERY
    系统权限:用于终止查询
  • CANCELSESSION
    系统权限:用于终止会话
  • 管理员角色默认具备上述两种权限
如需授予这些权限,请查看权限参考文档

Common Triage Workflows

常见排查流程

Workflow 1: "Cluster is slow" investigation

流程1:“集群运行缓慢”排查

Scenario: Users report general slowness.
  1. Check for long-running queries:
    sql
    -- Run the "Long-Running Queries" diagnostic
    -- Look for queries running > 5-10 minutes
  2. Identify source applications:
    sql
    -- Group by application to find culprits
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT application_name, COUNT(*) AS num_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    ORDER BY num_queries DESC;
  3. Drill down into specific app:
    sql
    -- Filter by top application from step 2
    -- Use "Filter by Application" query
  4. Decide on action:
    • Contact app team to investigate query patterns
    • Cancel specific runaway queries if critical
    • Check for schema/index issues if queries are legitimate
场景: 用户反馈整体运行缓慢。
  1. 检查长时间运行的查询:
    sql
    -- 运行“长时间运行的查询”诊断语句
    -- 查找运行时长超过5-10分钟的查询
  2. 识别来源应用:
    sql
    -- 按应用分组查找问题来源
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT application_name, COUNT(*) AS num_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    ORDER BY num_queries DESC;
  3. 深入排查特定应用:
    sql
    -- 过滤步骤2中排名靠前的应用
    -- 使用“按应用过滤”查询语句
  4. 决定操作:
    • 联系应用团队排查查询模式
    • 若情况紧急,终止特定失控查询
    • 若查询合法,检查模式/索引问题

Workflow 2: Find high-retry transactions

流程2:查找高重试事务

Scenario: Suspect contention issues.
  1. Check for high retry counts:
    sql
    SELECT application_name, AVG(num_retries) AS avg_retries,
           MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
    FROM crdb_internal.cluster_transactions
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    HAVING AVG(num_retries) > 5
    ORDER BY avg_retries DESC;
  2. Investigate specific transactions:
    sql
    -- Find transactions with >10 retries
    SELECT id, application_name, num_retries, num_stmts,
           substring(txn_string, 1, 200) AS txn_preview
    FROM crdb_internal.cluster_transactions
    WHERE num_retries > 10
    ORDER BY num_retries DESC;
  3. Next steps:
    • Review transaction patterns for contention
    • Check for lock conflicts or hotspots
    • Consider schema changes to reduce contention
场景: 怀疑存在资源竞争问题。
  1. 检查高重试次数:
    sql
    SELECT application_name, AVG(num_retries) AS avg_retries,
           MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
    FROM crdb_internal.cluster_transactions
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    HAVING AVG(num_retries) > 5
    ORDER BY avg_retries DESC;
  2. 深入排查特定事务:
    sql
    -- 查找重试次数超过10次的事务
    SELECT id, application_name, num_retries, num_stmts,
           substring(txn_string, 1, 200) AS txn_preview
    FROM crdb_internal.cluster_transactions
    WHERE num_retries > 10
    ORDER BY num_retries DESC;
  3. 后续步骤:
    • 审查事务模式以排查资源竞争
    • 检查锁冲突或热点问题
    • 考虑修改模式以减少资源竞争

Workflow 3: Identify resource hogs by user

流程3:按用户识别资源消耗大户

Scenario: Need to attribute load to specific users.
  1. Count active queries per user:
    sql
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT user_name, COUNT(*) AS num_active_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    GROUP BY user_name
    ORDER BY num_active_queries DESC;
  2. Drill down to specific user's activity:
    sql
    -- Use "Filter by User" query
  3. Take action:
    • Contact user if unexpected load
    • Review user's query patterns
    • Cancel if clearly runaway
场景: 需要将负载归因于特定用户。
  1. 统计每个用户的活跃查询数量:
    sql
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT user_name, COUNT(*) AS num_active_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    GROUP BY user_name
    ORDER BY num_active_queries DESC;
  2. 深入排查特定用户的活动:
    sql
    -- 使用“按用户过滤”查询语句
  3. 采取行动:
    • 若负载异常,联系对应用户
    • 审查用户的查询模式
    • 若明确为失控查询,执行终止操作

Troubleshooting

故障排除

IssueCauseFix
SHOW CLUSTER STATEMENTS
returns empty
No active queries, or insufficient privilegesGrant
VIEWACTIVITY
or
VIEWACTIVITYREDACTED
; verify cluster has active load
Query text shows
<hidden>
Using
VIEWACTIVITYREDACTED
privilege
This is expected for privacy; use
VIEWACTIVITY
if full text needed
Can't cancel query: "permission denied"Missing
CANCELQUERY
privilege
Grant
CANCELQUERY
system privilege to your user
crdb_internal.cluster_transactions
slow
High transaction volume on clusterAdd filters (application_name, time threshold) to reduce result set
"relation does not exist" errorTypo in table name or old CockroachDB versionVerify you're using production-approved tables; check CockroachDB version compatibility
Triage queries themselves are slowCluster under extreme loadUse more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first
问题原因解决方法
SHOW CLUSTER STATEMENTS
返回空结果
无活跃查询,或权限不足授予
VIEWACTIVITY
VIEWACTIVITYREDACTED
权限;确认集群存在活跃负载
查询文本显示
<hidden>
使用了
VIEWACTIVITYREDACTED
权限
这是隐私保护的预期效果;如需完整文本,使用
VIEWACTIVITY
权限
无法终止查询:“permission denied”缺少
CANCELQUERY
权限
为您的用户授予
CANCELQUERY
系统权限
crdb_internal.cluster_transactions
查询缓慢
集群事务量过高添加过滤条件(应用名称、时间阈值)以减少结果集
出现“relation does not exist”错误表名拼写错误或CockroachDB版本过旧确认使用的是生产环境认可的表;检查CockroachDB版本兼容性
排查查询本身运行缓慢集群处于极端负载状态使用更严格的过滤条件(更短的时间窗口、特定应用);优先终止明显的失控任务

Key Considerations

关键注意事项

  • Privacy: Use
    VIEWACTIVITYREDACTED
    instead of
    VIEWACTIVITY
    to protect sensitive query constants in multi-tenant environments
  • Performance impact: Triage queries are read-only and lightweight, but avoid running them in tight loops during extreme load
  • LIMIT clause: Always include
    LIMIT
    to prevent overwhelming output on large clusters
  • Time thresholds: Adjust
    INTERVAL
    based on your workload (5 minutes is a reasonable default, but fast OLTP may need 30 seconds)
  • Cancellation is disruptive: Only cancel queries/sessions after confirming they're problematic; coordinate with application teams when possible
  • Not for historical analysis: These queries show current state only; for trends over time, use DB Console or Prometheus metrics
  • Production-approved sources: Only use
    SHOW CLUSTER STATEMENTS
    ,
    SHOW CLUSTER SESSIONS
    , and
    crdb_internal.cluster_transactions
    for production triage
  • 隐私保护: 在多租户环境中,使用
    VIEWACTIVITYREDACTED
    而非
    VIEWACTIVITY
    以保护敏感查询常量
  • 性能影响: 排查查询为只读且轻量,但在极端负载期间避免频繁循环执行
  • LIMIT子句: 始终添加
    LIMIT
    以避免在大型集群中返回过多结果
  • 时间阈值: 根据您的工作负载调整
    INTERVAL
    (5分钟为合理默认值,但快速OLTP场景可能需要30秒)
  • 终止操作具有破坏性: 仅在确认查询/会话存在问题后执行终止操作;尽可能与应用团队协调
  • 不适用于历史分析: 这些查询仅显示当前状态;如需分析时间趋势,使用DB Console或Prometheus指标
  • 生产环境认可的来源: 仅使用
    SHOW CLUSTER STATEMENTS
    SHOW CLUSTER SESSIONS
    crdb_internal.cluster_transactions
    进行生产环境排查

References

参考资料

Skill references:
  • SQL query variations and examples
  • RBAC and privilege setup
Related skills:
  • profiling-statement-fingerprints - For historical performance pattern analysis and trend identification
  • profiling-transaction-fingerprints - For historical transaction-level analysis including retry storms and commit latency
Official CockroachDB Documentation:
技能参考:
  • SQL查询变体及示例
  • RBAC及权限设置
相关技能:
  • profiling-statement-fingerprints - 用于历史性能模式分析和趋势识别
  • profiling-transaction-fingerprints - 用于历史事务级分析,包括重试风暴和提交延迟
官方CockroachDB文档: