triaging-live-sql-activity
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriaging Live SQL Activity
实时SQL活动排查
Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.
crdb_internal仅通过SQL接口(SHOW语句和视图)识别当前正在运行的长时间查询、繁忙会话和活跃事务,诊断集群实时性能问题,无需依赖DB Console、HTTP端点或Prometheus访问即可快速排查。
crdb_internalWhen to Use This Skill
何时使用该技能
- Users report "the cluster is slow right now"
- High CPU or memory usage on cluster nodes
- Need to identify runaway queries or stuck transactions
- Want to find which applications/users are consuming resources
- Require immediate triage without DB Console access
- Need to generate SQL to cancel problematic sessions/queries
For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries.
For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition.
For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don't appear in SHOW CLUSTER STATEMENTS.
- 用户反馈“当前集群运行缓慢”
- 集群节点CPU或内存占用过高
- 需要识别失控查询或卡住的事务
- 想要查找消耗资源的应用/用户
- 无法访问DB Console时需要快速排查
- 需要生成SQL来终止有问题的会话/查询
历史性能分析场景: 使用profiling-statement-fingerprints分析一段时间内的查询模式,识别缓慢查询特征,无需实时查询即可排查趋势。
事务级分析场景: 使用profiling-transaction-fingerprints分析历史事务重试模式、提交延迟趋势和语句构成。
后台任务监控场景: 使用monitoring-background-jobs监控模式变更、备份以及未出现在SHOW CLUSTER STATEMENTS中的自动任务。
Prerequisites
前提条件
Required SQL access:
- Connection to any CockroachDB node
- For cluster-wide visibility: or
VIEWACTIVITYprivilegeVIEWACTIVITYREDACTED- : Redacts constants in other users' queries (recommended for privacy)
VIEWACTIVITYREDACTED - : Shows full query text for all users
VIEWACTIVITY - Without these: Only see your own sessions/queries
- Basic understanding of SQL query execution
- (Optional) /
CANCELQUERYprivileges for cancellation operationsCANCELSESSION
Check your privileges:
sql
SHOW GRANTS ON ROLE <username>;See permissions reference for detailed RBAC setup.
所需SQL权限:
- 可连接到任意CockroachDB节点
- 如需集群级可见性:或
VIEWACTIVITY权限VIEWACTIVITYREDACTED- :屏蔽其他用户查询中的常量(出于隐私保护,推荐使用)
VIEWACTIVITYREDACTED - :查看所有用户的完整查询文本
VIEWACTIVITY - 无上述权限:仅能查看自己的会话/查询
- 具备SQL查询执行的基础理解
- (可选)执行终止操作所需的/
CANCELQUERY权限CANCELSESSION
检查权限:
sql
SHOW GRANTS ON ROLE <username>;如需详细的RBAC设置,请查看权限参考文档。
Core Diagnostic Approach
核心诊断方法
CockroachDB provides SQL-only interfaces for live activity triage:
| Interface | Purpose | Cluster-wide? |
|---|---|---|
| Currently executing queries | Yes (with VIEWACTIVITY) |
| Active client sessions | Yes (with VIEWACTIVITY) |
| In-progress transactions | Yes (with VIEWACTIVITY) |
Triage workflow:
- Identify long-running queries (> 5-10 minutes)
- Correlate to sessions and applications
- Check transaction retry counts (high retries = contention)
- Drill down by app/user/client
- (Optional) Cancel runaway work
Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.
CockroachDB提供纯SQL接口用于实时活动排查:
| 接口 | 用途 | 支持集群级查看? |
|---|---|---|
| 当前正在执行的查询 | 是(需VIEWACTIVITY权限) |
| 活跃客户端会话 | 是(需VIEWACTIVITY权限) |
| 进行中的事务 | 是(需VIEWACTIVITY权限) |
排查流程:
- 识别运行时间超过5-10分钟的长时间查询
- 关联到对应的会话和应用
- 检查事务重试次数(高重试次数表示存在资源竞争)
- 按应用/用户/客户端深入排查
- (可选)终止失控任务
安全性: 所有诊断查询均为只读操作。终止操作需手动选择并带有明确警告。
Core Diagnostic Queries
核心诊断查询
Long-Running Queries
长时间运行的查询
Identify queries running longer than a specified threshold:
sql
-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
query_id,
node_id,
session_id,
user_name,
client_address,
application_name,
start,
now() - start AS running_for,
substring(query, 1, 200) AS query_preview,
distributed,
phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;Key columns:
- : How long the query has been executing
running_for - : First 200 characters (protects against massive queries)
query_preview - : execution phase (preparing, executing, etc.)
phase - : whether query spans multiple nodes
distributed
Customizable thresholds:
- Change to
INTERVAL '5 minutes','10 minutes', etc.'30 seconds' - Adjust based on cluster size and expected load
LIMIT
识别运行时间超过指定阈值的查询:
sql
-- 运行时间超过5分钟的查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
query_id,
node_id,
session_id,
user_name,
client_address,
application_name,
start,
now() - start AS running_for,
substring(query, 1, 200) AS query_preview,
distributed,
phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;关键列说明:
- :查询已执行时长
running_for - :查询前200个字符(避免返回过大的查询内容)
query_preview - :执行阶段(准备中、执行中等)
phase - :查询是否跨多个节点
distributed
可自定义阈值:
- 将修改为
INTERVAL '5 minutes'、'10 minutes'等'30 seconds' - 根据集群规模和预期负载调整值
LIMIT
Active Sessions
活跃会话
Find sessions with long-running active queries:
sql
-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
node_id,
session_id,
user_name,
client_address,
application_name,
status,
active_query_start,
now() - active_query_start AS active_query_for,
substring(active_queries, 1, 200) AS active_queries_preview,
substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;Key columns:
- : Duration of current active query
active_query_for - : Source application for drill-down
application_name - : Client IP/hostname for troubleshooting
client_address - : Session state (Idle, Active, etc.)
status
查找包含长时间运行查询的会话:
sql
-- 包含运行时长超过5分钟的活跃查询的会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
node_id,
session_id,
user_name,
client_address,
application_name,
status,
active_query_start,
now() - active_query_start AS active_query_for,
substring(active_queries, 1, 200) AS active_queries_preview,
substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;关键列说明:
- :当前活跃查询的持续时长
active_query_for - :来源应用,用于深入排查
application_name - :客户端IP/主机名,用于故障排查
client_address - :会话状态(空闲、活跃等)
status
Active Transactions
活跃事务
Identify long-running transactions (potential blockers):
sql
-- Transactions running > 5 minutes
SELECT
id AS txn_id,
node_id,
session_id,
application_name,
start,
now() - start AS running_for,
num_stmts,
num_retries,
num_auto_retries,
substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;Key columns:
- /
num_retries: High retry counts indicate contentionnum_auto_retries - : Number of statements in transaction (large = potentially problematic)
num_stmts - : Transaction fingerprint
txn_string
Production safety note: is production-approved and safe for triage.
crdb_internal.cluster_transactions识别长时间运行的事务(潜在阻塞源):
sql
-- 运行时间超过5分钟的事务
SELECT
id AS txn_id,
node_id,
session_id,
application_name,
start,
now() - start AS running_for,
num_stmts,
num_retries,
num_auto_retries,
substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;关键列说明:
- /
num_retries:高重试次数表示存在资源竞争num_auto_retries - :事务中的语句数量(数量过大可能存在问题)
num_stmts - :事务特征码
txn_string
生产环境安全提示: 已通过生产环境验证,可安全用于排查。
crdb_internal.cluster_transactionsDrill-Down by Application, User, or Client
按应用、用户或客户端深入排查
Once you identify suspicious activity, drill down by filtering:
识别到可疑活动后,可通过过滤条件深入排查:
Filter by Application
按应用过滤
sql
-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;sql
-- 特定应用的所有活动
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;Filter by User
按用户过滤
sql
-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
AND active_query_start IS NOT NULL
ORDER BY active_query_start;sql
-- 特定用户的所有活动
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
AND active_query_start IS NOT NULL
ORDER BY active_query_start;Filter by Client Address
按客户端地址过滤
sql
-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;sql
-- 特定客户端IP的所有会话
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;Combined Filters
组合过滤
sql
-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
AND user_name = 'app_user'
AND start < now() - INTERVAL '10 minutes'
ORDER BY start;sql
-- 特定应用和用户的长时间查询
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
AND user_name = 'app_user'
AND start < now() - INTERVAL '10 minutes'
ORDER BY start;Safety Considerations
安全注意事项
Read-only operations:
All diagnostic queries ( statements, ) are read-only and safe to run in production.
SHOWcrdb_internal.cluster_transactionsCancellation operations (opt-in):
CAUTION: Canceling queries/sessions terminates user work
Only proceed if:
- You've confirmed the query/session is runaway or stuck
- You have authorization to interrupt user workloads
- You've notified stakeholders if appropriate
- You have or
CANCELQUERYprivilegesCANCELSESSION
只读操作:
所有诊断查询(语句、)均为只读操作,可在生产环境安全运行。
SHOWcrdb_internal.cluster_transactions终止操作(需手动选择):
注意:终止查询/会话会中断用户工作
仅在以下情况下执行:
- 已确认查询/会话属于失控或卡住状态
- 您有权中断用户工作负载
- 已通知相关利益相关方(如适用)
- 具备或
CANCELQUERY权限CANCELSESSION
Canceling Runaway Work (Opt-In)
终止失控任务(需手动选择)
Cancel a Specific Query
终止特定查询
sql
-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';Example:
sql
CANCEL QUERY '15f9e0e91f072f0f0000000000000001';sql
-- 1. 从上述排查查询中获取query_id
-- 2. 执行终止操作
CANCEL QUERY '<query_id>';示例:
sql
CANCEL QUERY '15f9e0e91f072f0f0000000000000001';Cancel an Entire Session
终止整个会话
sql
-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';Example:
sql
CANCEL SESSION '15f9e0e91f072f0f';Verification:
After canceling, re-run the triage queries to confirm the query/session is gone.
Required privileges:
- system privilege to cancel queries
CANCELQUERY - system privilege to cancel sessions
CANCELSESSION - Admin role has both by default
See permissions reference for granting these privileges.
sql
-- 1. 从上述排查查询中获取session_id
-- 2. 终止该会话中的所有查询
CANCEL SESSION '<session_id>';示例:
sql
CANCEL SESSION '15f9e0e91f072f0f';验证:
终止后,重新运行排查查询确认查询/会话已消失。
所需权限:
- 系统权限:用于终止查询
CANCELQUERY - 系统权限:用于终止会话
CANCELSESSION - 管理员角色默认具备上述两种权限
如需授予这些权限,请查看权限参考文档。
Common Triage Workflows
常见排查流程
Workflow 1: "Cluster is slow" investigation
流程1:“集群运行缓慢”排查
Scenario: Users report general slowness.
-
Check for long-running queries:sql
-- Run the "Long-Running Queries" diagnostic -- Look for queries running > 5-10 minutes -
Identify source applications:sql
-- Group by application to find culprits WITH q AS (SHOW CLUSTER STATEMENTS) SELECT application_name, COUNT(*) AS num_queries, AVG(now() - start) AS avg_duration FROM q WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name ORDER BY num_queries DESC; -
Drill down into specific app:sql
-- Filter by top application from step 2 -- Use "Filter by Application" query -
Decide on action:
- Contact app team to investigate query patterns
- Cancel specific runaway queries if critical
- Check for schema/index issues if queries are legitimate
场景: 用户反馈整体运行缓慢。
-
检查长时间运行的查询:sql
-- 运行“长时间运行的查询”诊断语句 -- 查找运行时长超过5-10分钟的查询 -
识别来源应用:sql
-- 按应用分组查找问题来源 WITH q AS (SHOW CLUSTER STATEMENTS) SELECT application_name, COUNT(*) AS num_queries, AVG(now() - start) AS avg_duration FROM q WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name ORDER BY num_queries DESC; -
深入排查特定应用:sql
-- 过滤步骤2中排名靠前的应用 -- 使用“按应用过滤”查询语句 -
决定操作:
- 联系应用团队排查查询模式
- 若情况紧急,终止特定失控查询
- 若查询合法,检查模式/索引问题
Workflow 2: Find high-retry transactions
流程2:查找高重试事务
Scenario: Suspect contention issues.
-
Check for high retry counts:sql
SELECT application_name, AVG(num_retries) AS avg_retries, MAX(num_retries) AS max_retries, COUNT(*) AS num_txns FROM crdb_internal.cluster_transactions WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name HAVING AVG(num_retries) > 5 ORDER BY avg_retries DESC; -
Investigate specific transactions:sql
-- Find transactions with >10 retries SELECT id, application_name, num_retries, num_stmts, substring(txn_string, 1, 200) AS txn_preview FROM crdb_internal.cluster_transactions WHERE num_retries > 10 ORDER BY num_retries DESC; -
Next steps:
- Review transaction patterns for contention
- Check for lock conflicts or hotspots
- Consider schema changes to reduce contention
场景: 怀疑存在资源竞争问题。
-
检查高重试次数:sql
SELECT application_name, AVG(num_retries) AS avg_retries, MAX(num_retries) AS max_retries, COUNT(*) AS num_txns FROM crdb_internal.cluster_transactions WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name HAVING AVG(num_retries) > 5 ORDER BY avg_retries DESC; -
深入排查特定事务:sql
-- 查找重试次数超过10次的事务 SELECT id, application_name, num_retries, num_stmts, substring(txn_string, 1, 200) AS txn_preview FROM crdb_internal.cluster_transactions WHERE num_retries > 10 ORDER BY num_retries DESC; -
后续步骤:
- 审查事务模式以排查资源竞争
- 检查锁冲突或热点问题
- 考虑修改模式以减少资源竞争
Workflow 3: Identify resource hogs by user
流程3:按用户识别资源消耗大户
Scenario: Need to attribute load to specific users.
-
Count active queries per user:sql
WITH q AS (SHOW CLUSTER STATEMENTS) SELECT user_name, COUNT(*) AS num_active_queries, AVG(now() - start) AS avg_duration FROM q GROUP BY user_name ORDER BY num_active_queries DESC; -
Drill down to specific user's activity:sql
-- Use "Filter by User" query -
Take action:
- Contact user if unexpected load
- Review user's query patterns
- Cancel if clearly runaway
场景: 需要将负载归因于特定用户。
-
统计每个用户的活跃查询数量:sql
WITH q AS (SHOW CLUSTER STATEMENTS) SELECT user_name, COUNT(*) AS num_active_queries, AVG(now() - start) AS avg_duration FROM q GROUP BY user_name ORDER BY num_active_queries DESC; -
深入排查特定用户的活动:sql
-- 使用“按用户过滤”查询语句 -
采取行动:
- 若负载异常,联系对应用户
- 审查用户的查询模式
- 若明确为失控查询,执行终止操作
Troubleshooting
故障排除
| Issue | Cause | Fix |
|---|---|---|
| No active queries, or insufficient privileges | Grant |
Query text shows | Using | This is expected for privacy; use |
| Can't cancel query: "permission denied" | Missing | Grant |
| High transaction volume on cluster | Add filters (application_name, time threshold) to reduce result set |
| "relation does not exist" error | Typo in table name or old CockroachDB version | Verify you're using production-approved tables; check CockroachDB version compatibility |
| Triage queries themselves are slow | Cluster under extreme load | Use more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first |
| 问题 | 原因 | 解决方法 |
|---|---|---|
| 无活跃查询,或权限不足 | 授予 |
查询文本显示 | 使用了 | 这是隐私保护的预期效果;如需完整文本,使用 |
| 无法终止查询:“permission denied” | 缺少 | 为您的用户授予 |
| 集群事务量过高 | 添加过滤条件(应用名称、时间阈值)以减少结果集 |
| 出现“relation does not exist”错误 | 表名拼写错误或CockroachDB版本过旧 | 确认使用的是生产环境认可的表;检查CockroachDB版本兼容性 |
| 排查查询本身运行缓慢 | 集群处于极端负载状态 | 使用更严格的过滤条件(更短的时间窗口、特定应用);优先终止明显的失控任务 |
Key Considerations
关键注意事项
- Privacy: Use instead of
VIEWACTIVITYREDACTEDto protect sensitive query constants in multi-tenant environmentsVIEWACTIVITY - Performance impact: Triage queries are read-only and lightweight, but avoid running them in tight loops during extreme load
- LIMIT clause: Always include to prevent overwhelming output on large clusters
LIMIT - Time thresholds: Adjust based on your workload (5 minutes is a reasonable default, but fast OLTP may need 30 seconds)
INTERVAL - Cancellation is disruptive: Only cancel queries/sessions after confirming they're problematic; coordinate with application teams when possible
- Not for historical analysis: These queries show current state only; for trends over time, use DB Console or Prometheus metrics
- Production-approved sources: Only use ,
SHOW CLUSTER STATEMENTS, andSHOW CLUSTER SESSIONSfor production triagecrdb_internal.cluster_transactions
- 隐私保护: 在多租户环境中,使用而非
VIEWACTIVITYREDACTED以保护敏感查询常量VIEWACTIVITY - 性能影响: 排查查询为只读且轻量,但在极端负载期间避免频繁循环执行
- LIMIT子句: 始终添加以避免在大型集群中返回过多结果
LIMIT - 时间阈值: 根据您的工作负载调整(5分钟为合理默认值,但快速OLTP场景可能需要30秒)
INTERVAL - 终止操作具有破坏性: 仅在确认查询/会话存在问题后执行终止操作;尽可能与应用团队协调
- 不适用于历史分析: 这些查询仅显示当前状态;如需分析时间趋势,使用DB Console或Prometheus指标
- 生产环境认可的来源: 仅使用、
SHOW CLUSTER STATEMENTS和SHOW CLUSTER SESSIONS进行生产环境排查crdb_internal.cluster_transactions
References
参考资料
Skill references:
- SQL query variations and examples
- RBAC and privilege setup
Related skills:
- profiling-statement-fingerprints - For historical performance pattern analysis and trend identification
- profiling-transaction-fingerprints - For historical transaction-level analysis including retry storms and commit latency
Official CockroachDB Documentation:
技能参考:
- SQL查询变体及示例
- RBAC及权限设置
相关技能:
- profiling-statement-fingerprints - 用于历史性能模式分析和趋势识别
- profiling-transaction-fingerprints - 用于历史事务级分析,包括重试风暴和提交延迟
官方CockroachDB文档: