dt-obs-services
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseApplication Services Skill
应用服务Skill
Monitor application service performance, health, and runtime-specific metrics using DQL.
使用DQL监控应用服务的性能、健康状态以及特定运行时指标。
Core Capabilities
核心能力
1. Service Performance (RED Metrics)
1. 服务性能(RED指标)
Monitor service Rate, Errors, Duration using metrics-based timeseries queries.
Key Metrics:
- - Response time (microseconds)
dt.service.request.response_time - - Request count
dt.service.request.count - - Failed request count
dt.service.request.failure_count
Common Use Cases:
- Response time monitoring (avg, p50, p95, p99)
- Error rate tracking and spike detection
- Traffic analysis (throughput, peaks, growth)
- Performance degradation detection
- Multi-cluster comparison
Quick Example:
dql
timeseries {
p95 = percentile(dt.service.request.response_time, 95),
total_requests = sum(dt.service.request.count),
failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]→ For detailed queries: See references/service-metrics.md
使用基于指标的时序查询监控服务的请求速率(Rate)、错误数(Errors)、响应时长(Duration)。
关键指标:
- - 响应时间(微秒)
dt.service.request.response_time - - 请求总数
dt.service.request.count - - 请求失败数
dt.service.request.failure_count
常见用例:
- 响应时间监控(平均值、p50、p95、p99分位值)
- 错误率追踪与峰值检测
- 流量分析(吞吐量、峰值、增长趋势)
- 性能降级检测
- 多集群对比
快速示例:
dql
timeseries {
p95 = percentile(dt.service.request.response_time, 95),
total_requests = sum(dt.service.request.count),
failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]→ 查看详细查询: 参考 references/service-metrics.md
2. Advanced Service Analysis
2. 高级服务分析
Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.
Use Cases:
- SLA compliance tracking with custom thresholds
- Service health scoring (multi-dimensional)
- Operation/endpoint-level performance analysis
- Custom error classification
- Failure pattern detection with error details
Quick Example:
dql
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total→ For detailed queries: See references/service-metrics.md
基于Span的查询适用于需要灵活过滤和自定义聚合的复杂场景。
用例:
- 基于自定义阈值的SLA合规跟踪
- 多维度服务健康评分
- 操作/接口级别的性能分析
- 自定义错误分类
- 结合错误详情的失败模式检测
快速示例:
dql
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total→ 查看详细查询: 参考 references/service-metrics.md
3. Service Messaging Metrics
3. 服务消息指标
Monitor message-based service communication (queues, topics).
Key Metrics:
- - Messages sent to queues or topics
dt.service.messaging.publish.count - - Messages received from queues or topics
dt.service.messaging.receive.count - - Messages successfully processed
dt.service.messaging.process.count - - Messages that failed processing
dt.service.messaging.process.failure_count
Use Cases:
- Message throughput monitoring (publish/receive rates)
- Message processing failure tracking
- Queue/topic health analysis
- Consumer lag detection (publish vs receive rate comparison)
Quick Example:
dql
timeseries {
published = sum(dt.service.messaging.publish.count),
received = sum(dt.service.messaging.receive.count),
processed = sum(dt.service.messaging.process.count),
failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}→ For detailed queries: See references/service-metrics.md
监控基于消息的服务通信(队列、主题)。
关键指标:
- - 发送到队列或主题的消息数
dt.service.messaging.publish.count - - 从队列或主题接收的消息数
dt.service.messaging.receive.count - - 成功处理的消息数
dt.service.messaging.process.count - - 处理失败的消息数
dt.service.messaging.process.failure_count
用例:
- 消息吞吐量监控(发送/接收速率)
- 消息处理失败追踪
- 队列/主题健康分析
- 消费者延迟检测(对比发送与接收速率)
快速示例:
dql
timeseries {
published = sum(dt.service.messaging.publish.count),
received = sum(dt.service.messaging.receive.count),
processed = sum(dt.service.messaging.process.count),
failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}→ 查看详细查询: 参考 references/service-metrics.md
4. Service Mesh Monitoring
4. 服务网格监控
Monitor service mesh ingress performance and overhead.
Key Metrics:
- - Mesh response time (microseconds)
dt.service.request.service_mesh.response_time - - Mesh request count
dt.service.request.service_mesh.count - - Mesh failure count
dt.service.request.service_mesh.failure_count
Use Cases:
- Mesh vs direct performance comparison
- Mesh overhead calculation
- Mesh failure analysis
- gRPC traffic monitoring
- Multi-cluster mesh performance
Quick Example:
dql
timeseries {
direct_p95 = percentile(dt.service.request.response_time, 95),
mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000→ For detailed queries: See references/service-metrics.md
监控服务网格入口性能与开销。
关键指标:
- - 网格响应时间(微秒)
dt.service.request.service_mesh.response_time - - 网格请求总数
dt.service.request.service_mesh.count - - 网格请求失败数
dt.service.request.service_mesh.failure_count
用例:
- 网格请求与直连请求的性能对比
- 网格开销计算
- 网格故障分析
- gRPC流量监控
- 多集群网格性能对比
快速示例:
dql
timeseries {
direct_p95 = percentile(dt.service.request.response_time, 95),
mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000→ 查看详细查询: 参考 references/service-metrics.md
5. Runtime-Specific Monitoring
5. 特定运行时监控
Technology-specific runtime performance and resource usage metrics.
Java/JVM - references/java.md
- Memory: heap, pools, metaspace
- GC: impact, suspension, frequency, pause time
- Threads: count monitoring, leak detection
- Classes: loading, unloading, growth
Node.js - references/nodejs.md
- Event loop: utilization, active handles
- V8 heap: memory used, total
- GC: collection time, suspension
- Process: RSS memory
.NET CLR - references/dotnet.md
- Memory: consumption by generation
- GC: collection count, suspension time
- Thread pool: threads, queued work
- JIT: compilation time
Python - references/python.md
- Threads: active thread count
- Heap: allocated blocks
- GC: collection by generation, pause time
- Objects: collected, uncollectable
PHP - references/php.md
- OPcache: hit ratio, memory, restarts
- GC: effectiveness, duration
- JIT: buffer usage
- Interned strings: usage, buffer
Go - references/go.md
- Goroutines: count, leak detection
- GC: suspension, collection time
- Memory: heap by state, committed
- Scheduler: worker threads, queue size
- CGo: call frequency
针对不同技术栈的运行时性能与资源使用指标。
Java/JVM - references/java.md
- 内存:堆、内存池、元空间
- GC:影响、暂停时长、频率、停顿时间
- 线程:数量监控、泄漏检测
- 类:加载、卸载、增长趋势
Node.js - references/nodejs.md
- 事件循环:利用率、活跃句柄
- V8堆:已用内存、总内存
- GC:回收时间、暂停时长
- 进程:RSS内存
.NET CLR - references/dotnet.md
- 内存:按分代统计的消耗量
- GC:回收次数、暂停时长
- 线程池:线程数、排队任务
- JIT:编译时间
Python - references/python.md
- 线程:活跃线程数
- 堆:已分配块
- GC:分代回收、停顿时间
- 对象:已回收、不可回收
PHP - references/php.md
- OPcache:命中率、内存、重启次数
- GC:有效性、持续时长
- JIT:缓冲区使用率
- 驻留字符串:使用量、缓冲区
Go - references/go.md
- Goroutines:数量、泄漏检测
- GC:暂停时长、回收时间
- 内存:按状态统计的堆、已提交内存
- 调度器:工作线程、队列大小
- CGo:调用频率
When to Use This Skill
此Skill的适用场景
✅ Use for:
- Monitoring service performance (response time, errors, traffic)
- Calculating SLA compliance
- Analyzing service mesh performance
- Monitoring messaging throughput and processing failures
- Troubleshooting runtime-specific issues (GC, memory, threads)
- Multi-cluster service comparison
- Operation/endpoint-level analysis
❌ Don't use for:
- Infrastructure metrics (use infrastructure skills)
- Log analysis (use logs skills)
- Distributed tracing workflows (use traces/spans skills)
- Database performance (use database skills)
✅ 适用于:
- 监控服务性能(响应时间、错误、流量)
- 计算SLA合规率
- 分析服务网格性能
- 监控消息吞吐量与处理失败情况
- 排查特定运行时问题(GC、内存、线程)
- 多集群服务对比
- 操作/接口级别的分析
❌ 不适用于:
- 基础设施指标监控(请使用基础设施相关Skill)
- 日志分析(请使用日志相关Skill)
- 分布式追踪工作流(请使用追踪/Span相关Skill)
- 数据库性能监控(请使用数据库相关Skill)
Agent Instructions
Agent使用说明
Understanding User Intent
理解用户意图
Map user questions to capabilities:
| User Request | Use Capability | Key Files |
|---|---|---|
| "service performance", "response time", "error rate" | Service Performance (RED) | service-metrics.md |
| "SLA tracking", "health scoring" | Advanced Service Analysis | service-metrics.md |
| "service mesh", "Istio", "Linkerd", "mesh overhead" | Service Mesh Monitoring | service-metrics.md |
| "messaging", "queue", "topic", "publish", "consumer" | Service Messaging Metrics | service-metrics.md |
| "JVM GC", "Java memory", "heap" | Runtime-Specific (Java) | java.md |
| "Node.js event loop", "V8 heap" | Runtime-Specific (Node.js) | nodejs.md |
| ".NET CLR", "GC generation" | Runtime-Specific (.NET) | dotnet.md |
| "Python GC", "thread count" | Runtime-Specific (Python) | python.md |
| "OPcache", "PHP GC" | Runtime-Specific (PHP) | php.md |
| "goroutines", "Go GC", "scheduler" | Runtime-Specific (Go) | go.md |
将用户问题映射到对应能力:
| 用户请求 | 对应能力 | 关联文件 |
|---|---|---|
| "服务性能"、"响应时间"、"错误率" | 服务性能(RED指标) | service-metrics.md |
| "SLA跟踪"、"健康评分" | 高级服务分析 | service-metrics.md |
| "服务网格"、"Istio"、"Linkerd"、"网格开销" | 服务网格监控 | service-metrics.md |
| "消息队列"、"队列"、"主题"、"发布"、"消费者" | 服务消息指标 | service-metrics.md |
| "JVM GC"、"Java内存"、"堆" | 特定运行时(Java) | java.md |
| "Node.js事件循环"、"V8堆" | 特定运行时(Node.js) | nodejs.md |
| ".NET CLR"、"GC分代" | 特定运行时(.NET) | dotnet.md |
| "Python GC"、"线程数" | 特定运行时(Python) | python.md |
| "OPcache"、"PHP GC" | 特定运行时(PHP) | php.md |
| "goroutines"、"Go GC"、"调度器" | 特定运行时(Go) | go.md |
Query Construction Patterns
查询构建模式
1. Metrics-based (timeseries)
- Use for: Standard monitoring, dashboards, alerting
- Pattern:
timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions} - Files: service-metrics.md, all runtime-specific files
2. Span-based (fetch spans)
- Use for: Complex filtering, custom logic, detailed analysis
- Pattern:
fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ... - Files: service-metrics.md (Advanced Service Analysis section)
3. Comparison queries
- Use for baseline comparison
append - Use for time-shifted baselines
shift: -15m - Example: Performance degradation detection
1. 基于指标(时序)
- 适用场景: 标准监控、仪表盘、告警
- 模式:
timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions} - 关联文件: service-metrics.md、所有运行时相关文件
2. 基于Span(fetch spans)
- 适用场景: 复杂过滤、自定义逻辑、详细分析
- 模式:
fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ... - 关联文件: service-metrics.md(高级服务分析章节)
3. 对比查询
- 使用进行基线对比
append - 使用进行时间偏移基线对比
shift: -15m - 示例: 性能降级检测
Response Construction Guidelines
响应构建指南
Always include:
- Metric name(s) - Clear metric identifiers
- Aggregation - How data is aggregated (avg, sum, percentile)
- Grouping - Dimensions used (,
dt.service.name, etc.)k8s.workload.name - Unit conversion - Convert microseconds to milliseconds where appropriate
- Filtering - Relevant thresholds or conditions
When referencing runtime-specific content:
- Check user's technology stack first
- Provide only relevant runtime queries (don't overwhelm with all 6 runtimes)
- Explain runtime-specific metrics (e.g., "OPcache hit ratio" measures PHP opcode cache efficiency)
始终包含以下内容:
- 指标名称 - 清晰的指标标识符
- 聚合方式 - 数据的聚合逻辑(平均值、总和、分位数)
- 分组维度 - 使用的分组维度(、
dt.service.name等)k8s.workload.name - 单位转换 - 按需将微秒转换为毫秒
- 过滤条件 - 相关阈值或筛选条件
引用特定运行时内容时:
- 优先确认用户的技术栈
- 仅提供相关运行时的查询(不要展示全部6种运行时的内容造成信息过载)
- 解释特定运行时指标的含义(例如:"OPcache命中率"用于衡量PHP opcode缓存的效率)
Common Workflows
常见工作流程
Workflow: Service Health Check
工作流程:服务健康检查
1. Check response time (RED metrics)
2. Check error rate (RED metrics)
3. Check traffic patterns (RED metrics)
4. If runtime-specific issues suspected → Load runtime-specific reference1. 检查响应时间(RED指标)
2. 检查错误率(RED指标)
3. 检查流量模式(RED指标)
4. 如果怀疑存在运行时特定问题 → 加载对应运行时参考文档Workflow: SLA Monitoring
工作流程:SLA监控
1. Define SLA criteria (e.g., < 3s response time AND < 1% error rate)
2. Use span-based query for custom SLA logic
3. Calculate compliance percentage
4. Filter non-compliant services1. 定义SLA标准(例如:响应时间<3s且错误率<1%)
2. 使用基于Span的查询实现自定义SLA逻辑
3. 计算合规率
4. 筛选出不合规的服务Workflow: Service Mesh Analysis
工作流程:服务网格分析
1. Check mesh response time
2. Compare mesh vs direct performance
3. Calculate mesh overhead
4. Analyze mesh failure rates1. 检查网格响应时间
2. 对比网格请求与直连请求的性能
3. 计算网格开销
4. 分析网格失败率Workflow: Runtime Troubleshooting
工作流程:运行时问题排查
- Identify technology stack → Load runtime-specific reference
- Check memory/GC metrics → threads/goroutines → runtime features
- 确认技术栈 → 加载对应运行时参考文档
- 检查内存/GC指标 → 线程/goroutines → 运行时特性
References
参考文档
Core Service Monitoring:
- references/service-metrics.md - Complete RED metrics, SLA tracking, service mesh queries
Runtime-Specific Monitoring:
- references/java.md - Java/JVM monitoring
- references/nodejs.md - Node.js monitoring
- references/dotnet.md - .NET CLR monitoring
- references/python.md - Python monitoring
- references/php.md - PHP monitoring
- references/go.md - Go runtime monitoring
核心服务监控:
- references/service-metrics.md - 完整的RED指标、SLA跟踪、服务网格查询
特定运行时监控:
- references/java.md - Java/JVM监控
- references/nodejs.md - Node.js监控
- references/dotnet.md - .NET CLR监控
- references/python.md - Python监控
- references/php.md - PHP监控
- references/go.md - Go运行时监控