agent-telemetry

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Telemetry

Agent 遥测

Make application runtime behavior queryable by coding agents through structured logging and telemetry endpoints.
通过结构化日志和遥测端点,让编码Agent能够查询应用运行时行为。

Core Problem

核心问题

Coding agents debugging issues often can't answer "what actually happened at runtime?" because:
  • Logs don't exist, or are unstructured
    console.log
    noise
  • Logs exist but there's no documented way for agents to query them
  • Agent docs (CLAUDE.md, AGENTS.md) don't mention how to access telemetry
编码Agent在调试问题时往往无法回答「运行时实际发生了什么?」,原因如下:
  • 日志不存在,或者是无结构的
    console.log
    噪音
  • 日志存在,但没有供Agent查询的文档化方式
  • Agent文档(CLAUDE.md、AGENTS.md)未提及如何访问遥测数据

Workflow

工作流

Phase 1: Audit Current State

阶段1:审计当前状态

Determine what telemetry already exists.
1. Check for logging infrastructure:
bash
undefined
确认现有遥测能力。
1. 检查日志基础设施:
bash
undefined

Find logging configuration and usage

查找日志配置和使用情况

grep -r "winston|pino|bunyan|log4j|slog|Logger|logging.config" --include="*.{ts,js,py,rb,go,rs}" -l .

```bash
grep -r "winston|pino|bunyan|log4j|slog|Logger|logging.config" --include="*.{ts,js,py,rb,go,rs}" -l .

```bash

Find log output configuration

查找日志输出配置

grep -r "LOG_LEVEL|LOG_FORMAT|LOG_FILE|OTEL_|SENTRY_DSN" .env* config/ -l 2>/dev/null

**2. Check for existing telemetry endpoints:**

```bash
grep -r "LOG_LEVEL|LOG_FORMAT|LOG_FILE|OTEL_|SENTRY_DSN" .env* config/ -l 2>/dev/null

**2. 检查现有遥测端点:**

```bash

Health/debug/metrics endpoints

健康检查/调试/指标端点

grep -r "health|metrics|debug|status|readiness|liveness" --include="*.{ts,js,py,rb,go}" -l src/ app/ 2>/dev/null

**3. Check agent docs for log access instructions:**

```bash
grep -r "health|metrics|debug|status|readiness|liveness" --include="*.{ts,js,py,rb,go}" -l src/ app/ 2>/dev/null

**3. 检查Agent文档中是否有日志访问说明:**

```bash

Do agent docs mention logs?

Agent文档是否提及日志相关内容?

grep -ri "log|telemetry|debug|observ" CLAUDE.md AGENTS.md .claude/.md .cursor/.md 2>/dev/null

**4. Classify the result:**

| Finding | Action |
|---------|--------|
| No structured logging exists | Go to Phase 2 |
| Logging exists but no agent access | Go to Phase 3 |
| Logging + access exists but undocumented | Go to Phase 4 |
| Everything in place | Validate and suggest improvements |
grep -ri "log|telemetry|debug|observ" CLAUDE.md AGENTS.md .claude/.md .cursor/.md 2>/dev/null

**4. 结果分类:**

| 发现项 | 操作 |
|---------|--------|
| 不存在结构化日志 | 进入阶段2 |
| 存在日志但未向Agent开放访问 | 进入阶段3 |
| 存在日志和访问能力但未文档化 | 进入阶段4 |
| 所有能力均已完备 | 验证并提出优化建议 |

Phase 2: Add Structured Logging

阶段2:添加结构化日志

If no structured logging exists, add it. See
references/logging-setup.md
for framework-specific patterns.
Principles:
  • Use structured JSON logs, not string interpolation
  • Include correlation IDs for request tracing
  • Log at boundaries: incoming requests, outgoing calls, errors, state transitions
  • Use consistent field names:
    timestamp
    ,
    level
    ,
    message
    ,
    requestId
    ,
    userId
    ,
    duration
    ,
    error
Where to add logging (priority order):
  1. Request/response middleware (every request gets logged)
  2. Error handlers (unhandled errors get captured with context)
  3. External service calls (DB queries, API calls, queue operations)
  4. Business logic decision points (state transitions, authorization decisions)
Minimum viable logging — add a request logger middleware that captures:
{timestamp, level, requestId, method, path, statusCode, duration, userId?}
This single addition makes most debugging possible.
如果不存在结构化日志,先添加相关能力。可参考
references/logging-setup.md
获取不同框架的适配方案。
原则:
  • 使用结构化JSON日志,而非字符串插值
  • 包含关联ID用于请求追踪
  • 在边界处记录日志:传入请求、外发调用、错误、状态转换
  • 使用统一的字段名:
    timestamp
    level
    message
    requestId
    userId
    duration
    error
日志添加优先级:
  1. 请求/响应中间件(记录所有请求)
  2. 错误处理器(捕获未处理错误及上下文)
  3. 外部服务调用(数据库查询、API调用、队列操作)
  4. 业务逻辑决策点(状态转换、权限校验决策)
最小可用日志配置 —— 添加请求日志中间件,捕获以下字段:
{timestamp, level, requestId, method, path, statusCode, duration, userId?}
仅这一项配置就可以满足绝大多数调试需求。

Phase 3: Expose Logs to Agents

阶段3:向Agent开放日志访问权限

Agents need a way to query logs without SSH access or cloud console dashboards. Provide at least one of:
Option A: Log file (simplest) Write structured logs to a known file path agents can read directly.
undefined
Agent需要无需SSH权限或云控制台面板即可查询日志的能力,至少提供以下方案中的一种:
选项A:日志文件(最简单) 将结构化日志写入Agent可直接读取的固定文件路径。
undefined

Agent reads recent errors

Agent读取最近的错误日志

tail -100 logs/app.json | jq 'select(.level == "error")'
tail -100 logs/app.json | jq 'select(.level == "error")'

Agent reads logs for a specific request

Agent读取特定请求的日志

grep "requestId.*abc123" logs/app.json | jq .

**Option B: Dev log endpoint (recommended for web apps)**
Add a development-only endpoint that returns recent log entries with filtering.
GET /__dev/logs?level=error&last=50 GET /__dev/logs?path=/api/users&last=20 GET /__dev/logs?requestId=abc-123

This endpoint must:
- Only be available in development (`NODE_ENV=development` or equivalent)
- Return JSON array of log entries
- Support filtering by level, path, timerange, requestId
- Limit response size (default 100 entries)

See `references/dev-endpoint.md` for implementation patterns by framework.

**Option C: CLI query tool**
Wrap log access in a script agents can execute:

```bash
grep "requestId.*abc123" logs/app.json | jq .

**选项B:开发环境日志端点(Web应用推荐)**
添加仅开发环境可用的端点,支持返回带过滤条件的近期日志条目。
GET /__dev/logs?level=error&last=50 GET /__dev/logs?path=/api/users&last=20 GET /__dev/logs?requestId=abc-123

该端点必须满足:
- 仅在开发环境可用(`NODE_ENV=development`或等效配置)
- 返回日志条目组成的JSON数组
- 支持按级别、路径、时间范围、requestId过滤
- 限制响应大小(默认返回100条)

可参考`references/dev-endpoint.md`获取不同框架的实现方案。

**选项C:CLI查询工具**
将日志访问能力封装为Agent可执行的脚本:

```bash

Query recent errors

查询最近的错误日志

./scripts/query-logs.sh --level error --last 50
./scripts/query-logs.sh --level error --last 50

Query by request path

按请求路径查询日志

./scripts/query-logs.sh --path /api/users --since "5 minutes ago"

**Choose based on project context:**

| Project Type | Best Option |
|-------------|-------------|
| Next.js / Express / Rails with local dev | Option B (dev endpoint) |
| CLI tool or background worker | Option A (log file) |
| Docker-based development | Option A (mounted log volume) or Option C |
| Monorepo with multiple services | Option C (unified query script) |
./scripts/query-logs.sh --path /api/users --since "5 minutes ago"

**根据项目类型选择最优方案:**

| 项目类型 | 最优选项 |
|-------------|-------------|
| 带本地开发环境的Next.js / Express / Rails 项目 | 选项B(开发端点) |
| CLI工具或后台任务 | 选项A(日志文件) |
| 基于Docker的开发环境 | 选项A(挂载日志卷)或选项C |
| 包含多个服务的单体仓库 | 选项C(统一查询脚本) |

Phase 4: Document in Agent Docs

阶段4:在Agent文档中补充说明

This is critical. Without documentation, agents won't know telemetry exists.
Update CLAUDE.md (or equivalent agent doc) with a Debugging section:
markdown
undefined
这一步至关重要,没有文档说明的话Agent不会知道遥测能力的存在。
在CLAUDE.md(或等效Agent文档)中添加调试章节:
markdown
undefined

Debugging

调试

Querying Application Logs

查询应用日志

Structured JSON logs are available at [location].
Quick commands:
bash
undefined
结构化JSON日志可在[位置]获取。
常用命令:
bash
undefined

View recent errors

查看最近的错误日志

[command to view errors]
[查看错误的命令]

View logs for a specific endpoint

查看特定端点的日志

[command to filter by path]
[按路径过滤的命令]

View logs for a specific request

查看特定请求的日志

[command to filter by request ID]
[按请求ID过滤的命令]

View logs from the last N minutes

查看最近N分钟的日志

[command to filter by time]

**Log format:**
```json
{
  "timestamp": "ISO-8601",
  "level": "info|warn|error",
  "message": "Human-readable description",
  "requestId": "correlation-id",
  "method": "GET",
  "path": "/api/resource",
  "statusCode": 200,
  "duration": 45
}
Common debugging workflows:
  • User reports error → query by time range and error level
  • Flaky test → query by endpoint path during test run
  • Performance issue → query by path, sort by duration

**Key rules for the documentation:**
- Include copy-pasteable commands (agents execute, not read)
- Show the log schema so agents know what fields to filter on
- List 3-4 common debugging workflows with exact commands
- Mention where log config lives for agents that need to adjust log levels
[按时间过滤的命令]

**日志格式:**
```json
{
  "timestamp": "ISO-8601",
  "level": "info|warn|error",
  "message": "人类可读描述",
  "requestId": "关联ID",
  "method": "GET",
  "path": "/api/resource",
  "statusCode": 200,
  "duration": 45
}
常见调试工作流:
  • 用户上报错误 → 按时间范围和错误级别查询
  • 测试用例不稳定 → 按测试运行期间的端点路径查询
  • 性能问题 → 按路径查询,按耗时排序

**文档编写核心规则:**
- 包含可直接复制执行的命令(Agent需要执行而非阅读)
- 展示日志 schema,让Agent知道可以基于哪些字段过滤
- 列出3-4种常见调试工作流的 exact 命令
- 说明日志配置的存放位置,方便Agent调整日志级别

Phase 5: Validate

阶段5:验证

Test the full loop:
  1. Trigger a request — hit an endpoint or run an operation
  2. Query the logs — use the documented method to find the log entry
  3. Verify agent usability — can an agent find the relevant log in <3 commands?
  4. Check error capture — trigger an error and verify it appears with full context
If any step fails, iterate on the logging or documentation.
测试完整链路:
  1. 触发请求 —— 访问某个端点或执行某项操作
  2. 查询日志 —— 使用文档说明的方式找到对应日志条目
  3. 验证Agent可用性 —— Agent能否在3条命令以内找到相关日志?
  4. 检查错误捕获能力 —— 触发一个错误,验证其是否携带完整上下文出现在日志中
如果任何步骤不通过,迭代优化日志配置或文档。

Anti-Patterns

反模式

Anti-PatternWhy It's BadDo Instead
console.log("here")
No structure, no context, no filteringStructured JSON with consistent fields
Logs only in cloud dashboardAgents can't access Datadog/CloudWatchLocal file or dev endpoint
Log everything at debug levelToo noisy, can't find signalLog at boundaries, use appropriate levels
Logging sensitive dataPII in logs is a liabilityRedact tokens, passwords, PII
No request correlationCan't trace a request across log linesAdd requestId to every log entry
Docs say "check the logs" with no howAgent doesn't know where or howExact commands with examples
反模式弊端优化方案
console.log("here")
无结构、无上下文、无法过滤使用带统一字段的结构化JSON日志
仅在云控制台存储日志Agent无法访问Datadog/CloudWatch存储到本地文件或提供开发端点
所有内容都按debug级别记录噪音过多,无法提取有效信息在边界处记录日志,使用合适的日志级别
记录敏感数据日志中存在个人可识别信息会带来合规风险脱敏处理令牌、密码、个人可识别信息
无请求关联ID无法跨日志条目追踪单个请求为每个日志条目添加requestId
文档仅写「检查日志」无具体说明Agent不知道从哪查、怎么查提供带示例的精确命令