dt-obs-frontends
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFrontend Observability Skill
前端可观测性Skill
Monitor web and mobile frontends using Real User Monitoring (RUM) with DQL queries.
This skill targets the new RUM experience only; do not use classic RUM data.
通过DQL查询使用真实用户监控(RUM)能力监控网页端和移动端前端。
本Skill仅适配新版RUM体验,请勿用于经典RUM数据查询。
Overview
概述
This skill helps you:
- Monitor Core Web Vitals and frontend performance
- Track user sessions, engagement, and behavior
- Analyze errors and correlate with backend traces
- Optimize mobile app startup and stability
- Diagnose performance issues with detailed timing analysis
Data Sources:
- Metrics: with
timeseries(trends, alerting)dt.frontend.* - Events: (individual page views, requests, clicks, errors)
fetch user.events - Sessions: (session-level aggregates: duration, bounce, counts)
fetch user.sessions
本Skill可帮助你:
- 监控Core Web Vitals和前端性能
- 追踪用户会话、参与度和行为
- 分析错误并与后端链路关联
- 优化移动应用启动速度和稳定性
- 通过详细的耗时分析诊断性能问题
数据源:
- 指标: 带前缀的
dt.frontend.*(趋势分析、告警)timeseries - 事件: (独立页面访问、请求、点击、错误)
fetch user.events - 会话: (会话级聚合:时长、跳出、计数)
fetch user.sessions
Quick Reference
快速参考
Common Metrics
常用指标
- - User action volume
dt.frontend.user_action.count - - User action duration
dt.frontend.user_action.duration - - Request volume
dt.frontend.request.count - - Request latency (ms)
dt.frontend.request.duration - - Error counts
dt.frontend.error.count - - Active sessions
dt.frontend.session.active.estimated_count - - Unique users
dt.frontend.user.active.estimated_count - - CLS metric
dt.frontend.web.page.cumulative_layout_shift - - DOM interactive time
dt.frontend.web.navigation.dom_interactive - - FID metric (legacy; prefer INP)
dt.frontend.web.page.first_input_delay - - LCP metric
dt.frontend.web.page.largest_contentful_paint - - INP metric
dt.frontend.web.page.interaction_to_next_paint - - Load event end
dt.frontend.web.navigation.load_event_end - - Time to first byte
dt.frontend.web.navigation.time_to_first_byte
- - 用户操作量
dt.frontend.user_action.count - - 用户操作耗时
dt.frontend.user_action.duration - - 请求量
dt.frontend.request.count - - 请求延迟(毫秒)
dt.frontend.request.duration - - 错误计数
dt.frontend.error.count - - 活跃会话数
dt.frontend.session.active.estimated_count - - 独立用户数
dt.frontend.user.active.estimated_count - - CLS指标
dt.frontend.web.page.cumulative_layout_shift - - DOM可交互时间
dt.frontend.web.navigation.dom_interactive - - FID指标(旧版,推荐使用INP)
dt.frontend.web.page.first_input_delay - - LCP指标
dt.frontend.web.page.largest_contentful_paint - - INP指标
dt.frontend.web.page.interaction_to_next_paint - - 加载事件结束时间
dt.frontend.web.navigation.load_event_end - - 首字节时间
dt.frontend.web.navigation.time_to_first_byte
Common Filters
常用过滤器
- - Filter by frontend name (e.g.
frontend.name)my-frontend - - Exclude synthetic monitoring
dt.rum.user_type - - Geographic filtering
geo.country.iso_code - - Mobile, desktop, tablet
device.type - - Browser filtering
browser.name
- - 按前端名称过滤(例如
frontend.name)my-frontend - - 排除合成监控流量
dt.rum.user_type - - 地理过滤
geo.country.iso_code - - 移动设备、桌面设备、平板
device.type - - 浏览器过滤
browser.name
Common Timeseries Dimensions
常用时序维度
Use these for timeseries splits and breakdowns:
dt.frontend.*- - Frontend name
frontend.name geo.country.iso_codedevice.typebrowser.nameos.name- -
user_type,real_user,syntheticrobot
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_page_summary == true
| summarize page_views = count(), by: {frontend.name}
| sort page_views desc用于时序数据的拆分和拆解:
dt.frontend.*- - 前端名称
frontend.name geo.country.iso_codedevice.typebrowser.nameos.name- -
user_type(真实用户)、real_user(合成监控)、synthetic(机器人)robot
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_page_summary == true
| summarize page_views = count(), by: {frontend.name}
| sort page_views descEvent Characteristics
事件特征
- - Page views (web)
characteristics.has_page_summary - - Views (mobile)
characteristics.has_view_summary - - Navigation events
characteristics.has_navigation - - Clicks, forms, etc.
characteristics.has_user_interaction - - Network request events
characteristics.has_request - - Error events
characteristics.has_error - - Mobile crashes
characteristics.has_crash - - Long JavaScript tasks
characteristics.has_long_task - - CSP violations
characteristics.has_csp_violation
- - 页面访问(网页端)
characteristics.has_page_summary - - 页面浏览(移动端)
characteristics.has_view_summary - - 导航事件
characteristics.has_navigation - - 点击、表单提交等
characteristics.has_user_interaction - - 网络请求事件
characteristics.has_request - - 错误事件
characteristics.has_error - - 移动应用崩溃
characteristics.has_crash - - 长JavaScript任务
characteristics.has_long_task - - CSP违规
characteristics.has_csp_violation
Session Data (user.sessions
)
user.sessions会话数据(user.sessions
)
user.sessionsuser.sessionsuser.eventsuser.eventsSession identity and context:
- — Session ID (NOT
dt.rum.session.id)dt.rum.session_id - — Instance ID
dt.rum.instance.id - - array of frontends involved in session
frontend.name - —
dt.rum.application.typeorwebmobile - —
dt.rum.user_type,real_user, orsyntheticrobot
Session aggregates (underscore naming — NOT dot):
| Field | Description | ⚠️ NOT this |
|---|---|---|
| Number of navigations | |
| Clicks, form submissions | |
| User actions | |
| XHR/fetch requests | |
| Total events in session | |
| Page views (web) | |
| Views (mobile/SPA) | |
Error fields (dot naming — same as events):
- ,
error.count,error.exception_count,error.http_4xx_counterror.http_5xx_count - ,
error.anr_count,error.csp_violation_counterror.has_crash
Session lifecycle:
- ,
start_time,end_time(nanoseconds)duration - —
end_reason,timeout, etc.synthetic_execution_finished - — Boolean bounce flag
characteristics.is_bounce - — Session replay available
characteristics.has_replay
User identity:
- — User identifier (typically email, username or customerId), set via
dt.rum.user_tagAPI call in the instrumented frontend. Not always populated — only present when the frontend explicitly callsdtrum.identifyUser().identifyUser() - When is empty,
dt.rum.user_tagis often the only user differentiator. The value is a random ID assigned by the RUM agent on the client side, so it is not personally identifiable but can be used to distinguish unique users whendt.rum.instance.idis not set. On web this is based on a persistent cookie, so it can be deleted by the user.user_tag - The user tag is a session-level field — query it from , not
user.sessions(where it may be empty even if the session has one).user.events
Client/device context:
- ,
browser.name,browser.version,device.typeos.name - ,
geo.country.iso_code,client.ipclient.isp
Synthetic-only fields:
- ,
dt.entity.synthetic_test,dt.entity.synthetic_locationdt.entity.synthetic_test_step
Time window behavior:
- only returns sessions that started in
fetch user.sessions, from: X, to: Y— NOT sessions that were merely active during that window.[X, Y] - Sessions can last 8h+ (the aggregation service waits 30+ minutes of inactivity before closing a session).
- To find all sessions active during a time window, extend the lookback by at least 8 hours: e.g., to cover events from the last 24h, query .
fetch user.sessions, from: now() - 32h - This matters for correlation queries (e.g., matching to
user.eventsby session ID) — a narrowuser.sessionswindow will miss long-running sessions and produce false "orphans."user.sessions
Session creation delay:
- The session aggregation service waits for ~30+ minutes of inactivity before closing a session and writing the record.
user.sessions - This means recent events (last ~1 hour) will not yet have a matching entry — this is normal, not a data gap.
user.sessions - When correlating with
user.events, exclude recent data (e.g., useuser.sessions) to avoid counting in-progress sessions as orphans.to: now() - 1h
Zombie sessions (events without a record):
user.sessions- Not every in
dt.rum.session.idwill have a correspondinguser.eventsrecord. The session aggregation service intentionally skips zombie sessions — sessions with no real user activity (zero navigations and zero user interactions).user.sessions - Zombie sessions contain only background, machine-driven activity (e.g., automatic XHR requests, heartbeats) with no page views or clicks. Serializing them would add no value to users.
- When correlating with
user.events, expect a large number of unmatched session IDs. This is by design, not a data gap. Filter to sessions with activity before diagnosing orphans:user.sessionsdqlfetch user.events, from: now() - 2h, to: now() - 1h | filter isNotNull(dt.rum.session.id) | summarize navs = countIf(characteristics.has_navigation == true), interactions = countIf(characteristics.has_user_interaction == true), by: {dt.rum.session.id} | filter navs > 0 or interactions > 0
Example — bounce rate and session quality:
dql
fetch user.sessions, from: now() - 24h
| filter dt.rum.user_type == "real_user"
| summarize
total_sessions = count(),
bounces = countIf(characteristics.is_bounce == true),
zero_activity = countIf(toLong(navigation_count) == 0 and toLong(user_interaction_count) == 0),
avg_duration_s = avg(toLong(duration)) / 1000000000
| fieldsAdd bounce_rate_pct = round((bounces * 100.0) / total_sessions, decimals: 1)user.sessionsuser.eventsuser.events会话标识与上下文:
- — 会话ID(不是
dt.rum.session.id)dt.rum.session_id - — 实例ID
dt.rum.instance.id - - 会话涉及的前端数组
frontend.name - —
dt.rum.application.type(网页端)或web(移动端)mobile - —
dt.rum.user_type(真实用户)、real_user(合成监控)或synthetic(机器人)robot
会话聚合字段(下划线命名,不是点分隔):
| 字段 | 描述 | ⚠️ 不要使用该字段 |
|---|---|---|
| 导航次数 | |
| 点击、表单提交次数 | |
| 用户操作次数 | |
| XHR/fetch请求次数 | |
| 会话内总事件数 | |
| 页面访问量(网页端) | |
| 页面浏览量(移动端/单页应用) | |
错误字段(点分隔命名,与事件一致):
- 、
error.count、error.exception_count、error.http_4xx_counterror.http_5xx_count - 、
error.anr_count、error.csp_violation_counterror.has_crash
会话生命周期:
- 、
start_time、end_time(纳秒)duration - —
end_reason(超时)、timeout(合成监控执行完成)等synthetic_execution_finished - — 跳出标识布尔值
characteristics.is_bounce - — 是否有会话回放
characteristics.has_replay
用户标识:
- — 用户标识符(通常为邮箱、用户名或客户ID),通过埋点前端的
dt.rum.user_tagAPI调用设置。并非总是存在——仅当前端显式调用dtrum.identifyUser()时才会填充。identifyUser() - 当为空时,
dt.rum.user_tag通常是唯一的用户区分标识。该值是RUM Agent在客户端分配的随机ID,因此不包含个人可识别信息,但在dt.rum.instance.id未设置时可用于区分独立用户。网页端该值基于持久化Cookie生成,因此可能被用户删除。user_tag - 用户标签是会话级字段——从中查询,不要从
user.sessions中查询(即使会话存在用户标签,事件中的该字段也可能为空)。user.events
客户端/设备上下文:
- 、
browser.name、browser.version、device.typeos.name - 、
geo.country.iso_code、client.ipclient.isp
仅合成监控可用字段:
- 、
dt.entity.synthetic_test、dt.entity.synthetic_locationdt.entity.synthetic_test_step
时间窗口行为:
- 仅返回在
fetch user.sessions, from: X, to: Y区间内启动的会话——不返回仅在该窗口内活跃的会话。[X, Y] - 会话可持续8小时以上(聚合服务会等待30分钟以上无活动后才会关闭会话)。
- 要查找某个时间窗口内所有活跃的会话,至少将回溯时间延长8小时:例如要覆盖过去24小时的事件,查询。
fetch user.sessions, from: now() - 32h - 这一点对关联查询非常重要(例如按会话ID匹配和
user.events)——狭窄的user.sessions查询窗口会遗漏长时间运行的会话,产生虚假的“孤立”记录。user.sessions
会话创建延迟:
- 会话聚合服务会等待约30分钟以上无活动后才会关闭会话并写入记录。
user.sessions - 这意味着最近约1小时的事件还没有对应的条目——这是正常现象,不是数据缺口。
user.sessions - 关联和
user.events时,排除最近的数据(例如使用user.sessions),避免将进行中的会话计为孤立记录。to: now() - 1h
僵尸会话(没有记录的事件):
user.sessions- 并非中的所有
user.events都有对应的dt.rum.session.id记录。会话聚合服务会故意跳过僵尸会话——即没有真实用户活动的会话(零次导航和零次用户交互)。user.sessions - 僵尸会话仅包含后台机器驱动的活动(例如自动XHR请求、心跳),没有页面访问或点击。序列化这类会话对用户没有价值。
- 关联和
user.events时,会存在大量未匹配的会话ID,这是设计如此,不是数据缺口。诊断孤立记录前先过滤出有活动的会话:user.sessionsdqlfetch user.events, from: now() - 2h, to: now() - 1h | filter isNotNull(dt.rum.session.id) | summarize navs = countIf(characteristics.has_navigation == true), interactions = countIf(characteristics.has_user_interaction == true), by: {dt.rum.session.id} | filter navs > 0 or interactions > 0
示例——跳出率和会话质量:
dql
fetch user.sessions, from: now() - 24h
| filter dt.rum.user_type == "real_user"
| summarize
total_sessions = count(),
bounces = countIf(characteristics.is_bounce == true),
zero_activity = countIf(toLong(navigation_count) == 0 and toLong(user_interaction_count) == 0),
avg_duration_s = avg(toLong(duration)) / 1000000000
| fieldsAdd bounce_rate_pct = round((bounces * 100.0) / total_sessions, decimals: 1)Performance Thresholds
性能阈值
- LCP: Good <2.5s | Poor >4.0s
- INP: Good <200ms | Poor >500ms
- CLS: Good <0.1 | Poor >0.25
- Cold Start: Good <3s | Poor >5s
- Long Tasks: >50ms problematic, >250ms severe
- LCP: 优秀 <2.5s | 较差 >4.0s
- INP: 优秀 <200ms | 较差 >500ms
- CLS: 优秀 <0.1 | 较差 >0.25
- 冷启动: 优秀 <3s | 较差 >5s
- 长任务: >50ms有问题,>250ms严重
Core Workflows
核心工作流
1. Web Performance Monitoring
1. 网页性能监控
Track Core Web Vitals, page performance, and request latency for SEO and UX optimization.
Primary Files:
- - Core Web Vitals (LCP, INP, CLS)
references/WebVitals.md - - Request and page performance
references/performance-analysis.md
Common Queries:
- All Core Web Vitals summary
- Web Vitals by page/device
- Request duration SLA monitoring
- Page load performance trends
追踪Core Web Vitals、页面性能和请求延迟,用于SEO和UX优化。
主要文件:
- - Core Web Vitals(LCP、INP、CLS)
references/WebVitals.md - - 请求和页面性能
references/performance-analysis.md
常用查询:
- 所有Core Web Vitals汇总
- 按页面/设备拆分的Web Vitals
- 请求耗时SLA监控
- 页面加载性能趋势
2. User Session & Behavior Analysis
2. 用户会话与行为分析
Understand user engagement, navigation patterns, and session characteristics. Analyze button clicks, form interactions, and user journeys.
Data source choice:
- Use for session-level analysis (bounce rate, session duration, session counts)
fetch user.sessions - Use for event-level detail (individual clicks, navigation timing, specific pages)
fetch user.events
Primary Files:
- - Session tracking and user analytics
references/user-sessions.md - - Navigation and engagement patterns
references/performance-analysis.md
Common Queries:
- Active sessions by frontend
- Sessions by custom property
- Bounce rate analysis (use with
user.sessions)characteristics.is_bounce - Session quality (zero-activity sessions via ,
navigation_count)user_interaction_count - Click analysis on UI elements (use with
user.events)characteristics.has_user_interaction - External referrers (traffic sources)
理解用户参与度、导航模式和会话特征。分析按钮点击、表单交互和用户旅程。
数据源选择:
- 会话级分析使用(跳出率、会话时长、会话计数)
fetch user.sessions - 事件级细节使用(独立点击、导航耗时、特定页面)
fetch user.events
主要文件:
- - 会话追踪和用户分析
references/user-sessions.md - - 导航和参与度模式
references/performance-analysis.md
常用查询:
- 按前端拆分的活跃会话数
- 按自定义属性拆分的会话数
- 跳出率分析(使用的
user.sessions字段)characteristics.is_bounce - 会话质量(通过、
navigation_count统计零活动会话)user_interaction_count - UI元素点击分析(使用带过滤的
characteristics.has_user_interaction)user.events - 外部引荐来源(流量来源)
3. Error Tracking & Debugging
3. 错误追踪与调试
Monitor error rates, analyze exceptions, and correlate frontend issues with backend.
Primary Files:
- - Error analysis and debugging
references/error-tracking.md - - Trace correlation
references/performance-analysis.md
Common Queries:
- Error rate monitoring
- JavaScript exceptions by type
- Failed requests with backend traces
- Request timing breakdown
监控错误率、分析异常、关联前端问题与后端链路。
主要文件:
- - 错误分析与调试
references/error-tracking.md - - 链路关联
references/performance-analysis.md
常用查询:
- 错误率监控
- 按类型拆分的JavaScript异常
- 关联后端链路的失败请求
- 请求耗时拆解
4. Mobile Frontend Monitoring
4. 移动端前端监控
Track mobile app performance, startup times, and crash analytics for iOS and Android. Analyze app version performance and device-specific issues.
Primary Files:
- - App starts, crashes, and mobile-specific metrics
references/mobile-monitoring.md
Common Queries:
- Cold start performance by app version (iOS, Android)
- Warm start and hot start metrics
- Crash rate by device model and OS version
- ANR events (Android)
- Native crash signals
- App version comparison
追踪iOS和Android移动应用性能、启动时间和崩溃分析。分析应用版本性能和设备特定问题。
主要文件:
- - 应用启动、崩溃和移动端特有指标
references/mobile-monitoring.md
常用查询:
- 按应用版本(iOS、Android)拆分的冷启动性能
- 温启动和热启动指标
- 按设备型号和OS版本拆分的崩溃率
- ANR事件(Android)
- 原生崩溃信号
- 应用版本对比
5. Advanced Performance Optimization
5. 高级性能优化
Deep performance diagnostics including JavaScript profiling, main thread blocking, UI jank analysis, and geographic performance.
Primary Files:
- - Advanced diagnostics and long tasks
references/performance-analysis.md
Common Queries:
- Long JavaScript tasks blocking main thread
- UI jank and rendering delays
- Tasks >50ms impacting responsiveness
- Third-party long tasks (iframes)
- Single-page app performance issues
- Geographic performance distribution
- Performance degradation detection
深度性能诊断,包括JavaScript profiling、主线程阻塞、UI卡顿分析和地域性能。
主要文件:
- - 高级诊断和长任务分析
references/performance-analysis.md
常用查询:
- 阻塞主线程的长JavaScript任务
- UI卡顿和渲染延迟
- 影响响应性的>50ms任务
- 第三方长任务(iframe)
- 单页应用性能问题
- 地域性能分布
- 性能降级检测
Best Practices
最佳实践
-
Use metrics for trends, events for debugging
- Metrics: Timeseries dashboards, alerting, capacity planning
- Events: Root cause analysis, detailed diagnostics
-
Filter by frontend in multi-app environments
- Always use for clarity
frontend.name
- Always use
-
Match interval to time range
- 5m intervals for hours, 1h for days, 1d for weeks
-
Exclude synthetic traffic when analyzing real users
- Filter to focus on genuine behavior
dt.rum.user_type
- Filter
-
Combine metrics with events for complete insights
- Start with metric trends, drill into events for details
-
Extendtime window for correlation queries
user.sessions- only returns sessions that started in the query window
user.sessions - Sessions can last 8h+, so extend lookback by at least 8h when joining with
user.events
-
趋势分析用指标,调试用事件
- 指标:时序大盘、告警、容量规划
- 事件:根因分析、详细诊断
-
多应用环境下按前端过滤
- 始终使用保证查询清晰
frontend.name
- 始终使用
-
时间间隔与时间范围匹配
- 小时级查询用5分钟间隔,天级用1小时间隔,周级用1天间隔
-
分析真实用户时排除合成流量
- 过滤聚焦真实用户行为
dt.rum.user_type
- 过滤
-
结合指标与事件获得完整洞察
- 从指标趋势开始,下钻到事件查看细节
-
关联查询时延长时间窗口
user.sessions- 仅返回在查询窗口内启动的会话
user.sessions - 会话可持续8小时以上,因此与关联时至少延长8小时回溯时间
user.events
Slow Page Load Playbook
页面加载缓慢排查手册
Start by segmenting the problem by page, browser, geo location, and .
dt.rum.user_typeHeuristics:
- High TTFB -> slow backend
- High LCP with normal TTFB -> render bottleneck
- High CLS -> layout shifts (late-loading content, ads, fonts)
- Long tasks dominate -> JavaScript execution bottlenecks (heavy frameworks, large bundles)
首先按页面、浏览器、地理位置和拆分问题。
dt.rum.user_type判定规则:
- TTFB高 -> 后端缓慢
- TTFB正常但LCP高 -> 渲染瓶颈
- CLS高 -> 布局偏移(内容晚加载、广告、字体)
- 长任务占比高 -> JavaScript执行瓶颈(重型框架、大包体积)
Backend latency (high TTFB)
后端延迟(高TTFB)
dql
fetch user.events
| filter frontend.name == "my-frontend" and characteristics.has_request == true
| filter page.url.path == "/checkout"
| summarize avg_ttfb = avg(request.time_to_first_byte), avg_duration = avg(duration)If TTFB is high, analyze backend spans by correlating frontend events with backend traces using .
dt.rum.trace_iddql
fetch user.events
| filter frontend.name == "my-frontend" and characteristics.has_request == true
| filter page.url.path == "/checkout"
| summarize avg_ttfb = avg(request.time_to_first_byte), avg_duration = avg(duration)如果TTFB很高,通过关联前端事件和后端链路,分析后端span。
dt.rum.trace_idHeavy JavaScript execution (long tasks)
JavaScript执行过重(长任务)
Long tasks by page:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_long_task == true
| summarize
long_task_count = count(),
total_blocking_time = sum(duration),
by: {frontend.name, page.url.path}
| sort total_blocking_time desc
| limit 20Long tasks by script source:
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_long_task == true
| summarize
long_task_count = count(),
total_blocking_time = sum(duration),
by: {long_task.attribution.container_src}
| sort total_blocking_time desc
| limit 20按页面统计长任务:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_long_task == true
| summarize
long_task_count = count(),
total_blocking_time = sum(duration),
by: {frontend.name, page.url.path}
| sort total_blocking_time desc
| limit 20按脚本来源统计长任务:
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_long_task == true
| summarize
long_task_count = count(),
total_blocking_time = sum(duration),
by: {long_task.attribution.container_src}
| sort total_blocking_time desc
| limit 20Large JavaScript bundles
JavaScript包体积过大
dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| filter endsWith(url.full, ".js")
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| filter endsWith(url.full, ".js")
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20Large resources
资源体积过大
dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20Cache effectiveness
缓存效率
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_request == true
| fieldsAdd cache_status = if(
performance.incomplete_reason == "local_cache" or performance.transfer_size == 0 and
(performance.encoded_body_size > 0 or performance.decoded_body_size > 0),
"cached",
else: if(performance.transfer_size > 0, "network", else: "uncached")
)
| summarize
request_count = count(),
avg_duration = avg(duration),
by: {url.domain, cache_status}dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_request == true
| fieldsAdd cache_status = if(
performance.incomplete_reason == "local_cache" or performance.transfer_size == 0 and
(performance.encoded_body_size > 0 or performance.decoded_body_size > 0),
"cached",
else: if(performance.transfer_size > 0, "network", else: "uncached")
)
| summarize
request_count = count(),
avg_duration = avg(duration),
by: {url.domain, cache_status}Compression waste
压缩浪费
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.encoded_body_size) and isNotNull(performance.decoded_body_size)
| filter performance.encoded_body_size > 0
| fieldsAdd
expansion_ratio = performance.decoded_body_size / performance.encoded_body_size,
wasted_bytes = performance.decoded_body_size - performance.encoded_body_size
| summarize
requests = count(),
avg_expansion_ratio = avg(expansion_ratio),
total_wasted_bytes = sum(wasted_bytes),
by: {request.url.host, request.url.path}
| sort total_wasted_bytes desc
| limit 50dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.encoded_body_size) and isNotNull(performance.decoded_body_size)
| filter performance.encoded_body_size > 0
| fieldsAdd
expansion_ratio = performance.decoded_body_size / performance.encoded_body_size,
wasted_bytes = performance.decoded_body_size - performance.encoded_body_size
| summarize
requests = count(),
avg_expansion_ratio = avg(expansion_ratio),
total_wasted_bytes = sum(wasted_bytes),
by: {request.url.host, request.url.path}
| sort total_wasted_bytes desc
| limit 50Network issues
网络问题
Compare by location and domain when TTFB is high but backend performance is good:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
request_count = count(),
avg_duration = avg(duration),
p75_duration = percentile(duration, 75),
p95_duration = percentile(duration, 95),
by: {geo.country.iso_code, request.url.domain}
| sort p95_duration desc
| limit 50Analyze DNS time:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.domain_lookup_start) and isNotNull(performance.domain_lookup_end)
| fieldsAdd dns_ms = performance.domain_lookup_end - performance.domain_lookup_start
| summarize
request_count = count(),
avg_dns_ms = avg(dns_ms),
p75_dns_ms = percentile(dns_ms, 75),
p95_dns_ms = percentile(dns_ms, 95),
by: {request.url.domain}
| sort p95_dns_ms desc
| limit 50Analyze by protocol (http/1.1, h2, h3):
dql
fetch user.events
| filter characteristics.has_request
| summarize cnt = count(), by: {url.domain, performance.next_hop_protocol}
| sort cnt desc
| limit 50当TTFB很高但后端性能良好时,按地域和域名对比:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
request_count = count(),
avg_duration = avg(duration),
p75_duration = percentile(duration, 75),
p95_duration = percentile(duration, 95),
by: {geo.country.iso_code, request.url.domain}
| sort p95_duration desc
| limit 50分析DNS耗时:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.domain_lookup_start) and isNotNull(performance.domain_lookup_end)
| fieldsAdd dns_ms = performance.domain_lookup_end - performance.domain_lookup_start
| summarize
request_count = count(),
avg_dns_ms = avg(dns_ms),
p75_dns_ms = percentile(dns_ms, 75),
p95_dns_ms = percentile(dns_ms, 95),
by: {request.url.domain}
| sort p95_dns_ms desc
| limit 50按协议分析(http/1.1、h2、h3):
dql
fetch user.events
| filter characteristics.has_request
| summarize cnt = count(), by: {url.domain, performance.next_hop_protocol}
| sort cnt desc
| limit 50Third-party dependencies
第三方依赖
Analyze request performance by domain:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
request_count = count(),
avg_duration = avg(duration),
p75_duration = percentile(duration, 75),
p95_duration = percentile(duration, 95),
by: {request.url.domain}
| sort p95_duration desc
| limit 50按域名分析请求性能:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
request_count = count(),
avg_duration = avg(duration),
p75_duration = percentile(duration, 75),
p95_duration = percentile(duration, 95),
by: {request.url.domain}
| sort p95_duration desc
| limit 50Troubleshooting
故障排查
Handling Zero Results
处理零结果
When queries return no data, follow this diagnostic workflow:
-
Validate Timeframe
- Check if timeframe is appropriate for the data type
- RUM data may have delay (1-2 minutes for recent events)
- Verify timeframe syntax: or similar
now()-1h to now() - Try expanding timeframe: for initial exploration
now()-24h
-
Verify frontend Configuration
- Confirm frontend is instrumented and sending RUM data
- Check filter is correct
frontend.name - Test without frontend filter to see if any RUM data exists
- Verify frontend name matches the environment
-
Check Data Availability
- Run basic query:
fetch user.events | limit 1 - If no events exist, RUM may not be configured
- Check if timeframe predates frontend deployment
- Verify user has access to the environment
- Run basic query:
-
Review Query Syntax
- Validate filters aren't too restrictive
- Check for typos in field names or metric names
- Test query incrementally: start simple, add filters gradually
- Verify characteristics filters match event types
When to Ask User for Clarification:
- No RUM data exists in environment → "Is RUM configured for this frontend?"
- Timeframe unclear → "What time period should I analyze?"
- Expected data missing → "Has this frontend sent data recently?"
当查询无返回数据时,遵循以下诊断流程:
-
验证时间范围
- 检查时间范围是否适配数据类型
- RUM数据可能有延迟(最近的事件有1-2分钟延迟)
- 验证时间范围语法:或类似格式
now()-1h to now() - 尝试扩大时间范围:初始探索使用
now()-24h
-
验证前端配置
- 确认前端已埋点并发送RUM数据
- 检查过滤器是否正确
frontend.name - 去掉前端过滤器测试是否存在任何RUM数据
- 验证前端名称与环境匹配
-
检查数据可用性
- 运行基础查询:
fetch user.events | limit 1 - 如果没有事件,可能未配置RUM
- 检查时间范围是否早于前端部署时间
- 验证用户有权限访问该环境
- 运行基础查询:
-
检查查询语法
- 验证过滤器是否过于严格
- 检查字段名或指标名是否有拼写错误
- 增量测试查询:从简单查询开始,逐步添加过滤器
- 验证特征过滤器与事件类型匹配
何时向用户请求澄清:
- 环境中不存在RUM数据 → "该前端是否配置了RUM?"
- 时间范围不明确 → "我应该分析哪个时间段?"
- 预期数据缺失 → "该前端最近是否有发送数据?"
Handling Anomalous Results
处理异常结果
When query results seem unexpected or suspicious:
Unexpected High Values:
- Metric spikes: Verify interval aggregation (avg vs. max vs. sum)
- Session counts: Check for bot traffic or synthetic monitoring
- Error rates: Confirm error definition matches expectations
- Performance degradation: Look for deployment or infrastructure changes
Unexpected Low Values:
- Missing sessions: Verify filter isn't excluding real users
dt.rum.user_type - Low request counts: Check if frontend filter is too narrow
- Few errors: Confirm error characteristics filter is correct
- Missing mobile data: Verify platform-specific fields exist
Inconsistent Data:
- Metrics vs. Events mismatch: Different aggregation methods are expected
- Geographic anomalies: Check timezone assumptions
- Device distribution skew: May reflect actual user base
- Version mismatches: Verify app version filtering logic
当查询结果看起来不符合预期或可疑时:
异常高值:
- 指标突增:验证区间聚合方式(平均值vs最大值vs求和)
- 会话计数异常:检查是否有机器人流量或合成监控
- 错误率异常:确认错误定义与预期一致
- 性能降级:排查部署或基础设施变更
异常低值:
- 会话缺失:验证过滤器是否排除了真实用户
dt.rum.user_type - 请求计数低:检查前端过滤器是否过窄
- 错误很少:确认错误特征过滤器是否正确
- 移动端数据缺失:验证平台特有字段是否存在
数据不一致:
- 指标与事件不匹配:聚合方式不同属于预期情况
- 地理异常:检查时区假设
- 设备分布倾斜:可能反映真实用户群体特征
- 版本不匹配:验证应用版本过滤逻辑
Decision Tree: Ask vs. Investigate
决策树:询问vs调查
Query returns unexpected results
│
├─ Is this a zero-result scenario?
│ ├─ YES → Follow "Handling Zero Results" workflow
│ └─ NO → Continue
│
├─ Can I validate the result independently?
│ ├─ YES → Run validation query
│ │ ├─ Validation confirms result → Report findings
│ │ └─ Validation contradicts → Investigate further
│ └─ NO → Continue
│
├─ Is the anomaly clearly explained by data?
│ ├─ YES → Report with explanation
│ └─ NO → Continue
│
├─ Do I need domain knowledge to interpret?
│ ├─ YES → Ask user for context
│ │ Example: "The error rate is 15%. Is this expected for your frontend?"
│ └─ NO → Continue
│
└─ Is the issue ambiguous or requires clarification?
├─ YES → Ask specific question with data context
│ Example: "I see two frontends named 'web-app'. Which frontend name should I use?"
└─ NO → Investigate and report findings with caveats查询返回异常结果
│
├─ 是否是零结果场景?
│ ├─ 是 → 遵循"处理零结果"流程
│ └─ 否 → 继续
│
├─ 我能否独立验证结果?
│ ├─ 是 → 运行验证查询
│ │ ├─ 验证确认结果 → 上报发现
│ │ └─ 验证结果矛盾 → 进一步调查
│ └─ 否 → 继续
│
├─ 异常是否能被数据清晰解释?
│ ├─ 是 → 附带解释上报
│ └─ 否 → 继续
│
├─ 我是否需要领域知识来解读?
│ ├─ 是 → 向用户请求上下文
│ │ 示例:"错误率为15%,这对您的前端来说是否属于预期情况?"
│ └─ 否 → 继续
│
└─ 问题是否模糊或需要澄清?
├─ 是 → 结合数据上下文提出具体问题
│ 示例:"我发现有两个名为'web-app'的前端,我应该使用哪个前端名称?"
└─ 否 → 调查并附带说明上报结果Common Investigation Steps
常用调查步骤
For Performance Issues:
- Compare to baseline: Query same metric for previous week
- Segment by dimension: Break down by device, browser, geography
- Check for outliers: Use percentiles (p50, p95, p99) vs. averages
- Correlate with deployments: Filter by app version or time windows
For Data Availability Issues:
- Start broad: Query all RUM data without filters
- Add filters incrementally: Isolate which filter eliminates data
- Check related metrics: If events missing, try timeseries
- Validate entity relationships: Confirm frontend-to-service links
For Unexpected Patterns:
- Expand timeframe: Look for historical context
- Cross-reference data sources: Compare events and metrics
- Check sampling: Verify no sampling is affecting results
- Consider external factors: Holidays, outages, traffic changes
性能问题排查:
- 与基线对比:查询上一周的相同指标
- 按维度拆分:按设备、浏览器、地域拆解
- 检查异常值:使用分位数(p50、p95、p99)而非平均值
- 与部署关联:按应用版本或时间窗口过滤
数据可用性问题排查:
- 从宽泛查询开始:不加过滤器查询所有RUM数据
- 增量添加过滤器:定位哪个过滤器排除了数据
- 检查相关指标:如果事件缺失,尝试查询时序数据
- 验证实体关系:确认前端与服务的关联
异常模式排查:
- 扩大时间范围:查看历史上下文
- 交叉对比数据源:对比事件和指标
- 检查采样:确认无采样影响结果
- 考虑外部因素:节假日、故障、流量变化
Red Flags: When to Stop and Ask
红色警报:何时停止并询问用户
Always ask the user when:
- ❌ No RUM data exists anywhere in the environment
- ❌ Multiple frontends match the user's description
- ❌ Results contradict user's stated expectations explicitly
- ❌ Data suggests monitoring is misconfigured
- ❌ Query requires business context (e.g., "acceptable error rate")
- ❌ Timeframe is ambiguous and affects interpretation significantly
Example clarifying questions:
- "I found two frontends named 'checkout'. Which one: or
checkout-web?"checkout-mobile - "The query returns 0 results for the past hour. Should I expand the timeframe, or do you expect real-time data?"
- "The average LCP is 8 seconds, which exceeds the 4-second threshold. Is this frontend known to have performance issues?"
- "I see only synthetic traffic. Should I include to focus on real users?"
dt.rum.user_type='REAL_USER'
出现以下情况时始终询问用户:
- ❌ 环境中完全不存在RUM数据
- ❌ 多个前端匹配用户的描述
- ❌ 结果明确与用户的预期矛盾
- ❌ 数据显示监控配置错误
- ❌ 查询需要业务上下文(例如"可接受的错误率")
- ❌ 时间范围模糊且会显著影响解读
澄清问题示例:
- "我找到了两个名为'checkout'的前端,应该用哪个:还是
checkout-web?"checkout-mobile - "过去一小时的查询返回0结果,我应该扩大时间范围,还是您期望查询实时数据?"
- "平均LCP为8秒,超过了4秒的阈值,该前端是否已知存在性能问题?"
- "我仅看到合成监控流量,是否需要添加过滤聚焦真实用户?"
dt.rum.user_type='REAL_USER'
When to Use This Skill
何时使用本Skill
Use frontend-observability skill when:
- Monitoring web or mobile frontend performance
- Analyzing Core Web Vitals for SEO
- Tracking user sessions, engagement, or behavior
- Analyzing click events and button interactions
- Debugging frontend errors or slow requests
- Correlating frontend issues with backend traces
- Optimizing mobile app startup or crash rates (iOS, Android)
- Analyzing app version performance
- Diagnosing UI jank and main thread blocking
- Analyzing security compliance (CSP violations)
- Profiling JavaScript performance (long tasks)
Do NOT use for:
- Backend service monitoring (use services skill)
- Infrastructure metrics (use infrastructure skill)
- Log analysis (use logs skill)
- Business process monitoring (use business-events skill)
符合以下场景时使用前端可观测性Skill:
- 监控网页或移动端前端性能
- 分析Core Web Vitals用于SEO优化
- 追踪用户会话、参与度或行为
- 分析点击事件和按钮交互
- 调试前端错误或缓慢请求
- 关联前端问题与后端链路
- 优化移动应用启动速度或崩溃率(iOS、Android)
- 分析应用版本性能
- 诊断UI卡顿和主线程阻塞
- 分析安全合规性(CSP违规)
- profiling JavaScript性能(长任务)
请勿用于以下场景:
- 后端服务监控(使用服务Skill)
- 基础设施指标(使用基础设施Skill)
- 日志分析(使用日志Skill)
- 业务流程监控(使用业务事件Skill)
Progressive Disclosure
渐进式披露
Always Available
始终可用
- FrontendBasics.md - RUM fundamentals and quick reference
- FrontendBasics.md - RUM基础和快速参考
Loaded by Workflow
按工作流加载
- Web Performance: WebVitals.md, performance-analysis.md
- User Behavior: user-sessions.md, performance-analysis.md
- Error Analysis: error-tracking.md, performance-analysis.md
- Mobile Apps: mobile-monitoring.md
- 网页性能:WebVitals.md、performance-analysis.md
- 用户行为:user-sessions.md、performance-analysis.md
- 错误分析:error-tracking.md、performance-analysis.md
- 移动应用:mobile-monitoring.md
Load on Explicit Request
显式请求时加载
- Advanced diagnostics (long tasks, user actions)
- Security compliance (CSP violations, visibility tracking)
- Specialized mobile features (platform-specific phases)
- 高级诊断(长任务、用户操作)
- 安全合规(CSP违规、可见性追踪)
- 移动端特有功能(平台特有阶段)
Reference Files
参考文件
Core Reference Documents
核心参考文档
- - Core Web Vitals monitoring
references/WebVitals.md - - Session and user analytics
references/user-sessions.md - - Error analysis and debugging
references/error-tracking.md - - Mobile app performance and crashes
references/mobile-monitoring.md - - Advanced performance diagnostics
references/performance-analysis.md
- - Core Web Vitals监控
references/WebVitals.md - - 会话和用户分析
references/user-sessions.md - - 错误分析与调试
references/error-tracking.md - - 移动应用性能和崩溃
references/mobile-monitoring.md - - 高级性能诊断
references/performance-analysis.md