observability-llm-obs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM and Agentic Observability
LLM与智能代理可观测性
Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on
LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use
ES|QL, Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A
given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)—
discover what is available before querying.
仅使用已导入Elastic的数据来解答用户关于监控LLM和智能代理组件的问题。重点关注LLM性能、成本与Token利用率、响应质量,以及调用链或智能代理工作流编排。使用ES|QL、Elasticsearch API,必要时可使用Kibana API。请勿依赖Kibana UI;本技能无需UI即可运行。特定部署通常会采用一种或多种数据导入路径(APM/OTLP追踪和/或集成指标/日志)——在查询前需先确认可用的数据源。
Where to look
数据查询位置
- Trace and metrics data (APM / OTel): Trace data in Elastic is stored in when collected by the Elastic APM Agent, and in
traces*(and similar) when collected by OpenTelemetry. Use the generic patterntraces-generic.otel-defaultto find all trace data regardless of source. When the application is instrumented with OpenTelemetry (e.g. Elastic Distributions of OpenTelemetry (EDOT), OpenLLMetry, OpenLIT, Langtrace exporting to OTLP), LLM and agent spans land in these trace data streams; metrics may land intraces*or metrics-generic. Querymetrics-apm*andtraces*data streams for per-request and aggregated LLM signals.metrics* - Integration metrics and logs: When the user collects data via
Elastic LLM integrations
(OpenAI, Azure OpenAI, Azure AI Foundry, Amazon Bedrock, Bedrock AgentCore, GCP Vertex AI, etc.), metrics and logs go
to integration data streams (e.g. ,
metrics*with dataset/namespace per integration). Check which data streams exist.logs* - Discover first: Use Elasticsearch to list data streams or indices (e.g. , or
GET _data_stream,GET traces*/_mapping) and optionally sample a document to see which LLM-related fields are present. Do not assume both APM and integration data exist.GET metrics*/_mapping - ES|QL: Use the elasticsearch-esql skill for ES|QL syntax, commands, and query patterns when building queries
against or metrics data streams.
traces* - Alerts and SLOs: Use the Observability APIs SLOs
API (Stack |
Serverless) and Alerting API
(Stack |
Serverless) to find SLOs and alerting rules
that target LLM-related data (e.g. services backed by , or integration metrics). Firing alerts or violated/degrading SLOs point to potential degraded performance.
traces*
- 追踪与指标数据(APM / OTel): 当使用Elastic APM Agent采集时,Elastic中的追踪数据存储在**索引中;当使用OpenTelemetry采集时,存储在
traces*(及类似索引)中。使用通用模式traces-generic.otel-default可查找所有来源的追踪数据。当应用通过OpenTelemetry(如Elastic OpenTelemetry发行版(EDOT)、OpenLLMetry、OpenLIT、Langtrace导出至OTLP)进行埋点时,LLM和代理的Span会存入这些追踪数据流;指标可能存入traces*或metrics-generic索引。查询metrics-apm*和traces***数据流以获取单请求维度和聚合维度的LLM信号。metrics* - 集成指标与日志: 当用户通过Elastic LLM集成(OpenAI、Azure OpenAI、Azure AI Foundry、Amazon Bedrock、Bedrock AgentCore、GCP Vertex AI等)采集数据时,指标和日志会存入集成数据流(例如带集成专属dataset/namespace的、
metrics*索引)。需确认存在哪些数据流。logs* - 先发现再查询: 使用Elasticsearch列出数据流或索引(如,或
GET _data_stream、GET traces*/_mapping),可选择性地抽取样本文档以查看包含哪些LLM相关字段。请勿默认同时存在APM和集成数据。GET metrics*/_mapping - ES|QL: 针对或指标数据流构建查询时,可借助elasticsearch-esql技能获取ES|QL语法、命令和查询模式。
traces* - 告警与SLO: 使用可观测性API中的SLO API(Stack版 | Serverless版)和告警API(Stack版 | Serverless版),查找针对LLM相关数据(如支撑的服务,或集成指标)的SLO和告警规则。触发的告警或已违例/降级的SLO通常指向潜在的性能退化问题。
traces*
Data available in Elastic
Elastic中的可用数据
From traces and metrics (traces*, metrics-apm* / metrics-generic)
来自追踪与指标(traces*, metrics-apm* / metrics-generic)
Spans from OTel/EDOT (and compatible SDKs) carry span attributes that may follow
OpenTelemetry GenAI semantic conventions or
provider-specific names. In Elasticsearch, attributes typically appear under (exact key names depend
on ingestion). Common attributes:
span.attributes| Purpose | Example attribute names (OTel GenAI) |
|---|---|
| Operation / provider | |
| Model | |
| Token usage | |
| Request config | |
| Errors | |
| Conversation / agent | |
Cost is not in the OTel spec; some instrumentations add custom attributes (e.g. ).
Discover actual field names from the index mapping or a sample document (e.g. or flattened keys).
llm.response.cost.usd_estimatespan.attributes.*Use duration and event.outcome on spans for latency and success/failure. Use trace.id, span.id, and
parent/child span relationships to analyze call chaining and agentic workflows (e.g. one root span, multiple LLM or
tool-call child spans).
来自OTel/EDOT(及兼容SDK)的Span携带Span属性,这些属性可能遵循OpenTelemetry生成式AI语义规范或供应商自定义命名。在Elasticsearch中,属性通常位于下(具体键名取决于导入方式)。常见属性如下:
span.attributes| 用途 | 示例属性名(OTel生成式AI规范) |
|---|---|
| 操作/供应商 | |
| 模型 | |
| Token用量 | |
| 请求配置 | |
| 错误信息 | |
| 会话/代理 | |
成本数据未纳入OTel规范;部分埋点工具会添加自定义属性(如)。可通过索引映射或样本文档(如或扁平化键)确认实际字段名。
llm.response.cost.usd_estimatespan.attributes.*使用Span的duration和event.outcome分析延迟和请求成败状态。利用trace.id、span.id及父子Span关系分析调用链和智能代理工作流(例如一个根Span对应多个LLM或工具调用子Span)。
From LLM integrations
来自LLM集成
Integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Bedrock, Bedrock AgentCore, Vertex AI, etc.) ship metrics (and
where supported logs) to Elastic. Metrics typically include token usage, request counts, latency, and—where the
integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and
data streams are defined by each integration package; discover them from the integration docs or from the target data
stream mapping.
集成(OpenAI、Azure OpenAI、Azure AI Foundry、Bedrock、Bedrock AgentCore、Vertex AI等)会将指标(部分支持日志)发送至Elastic。指标通常包含Token用量、请求次数、延迟,以及集成支持的成本相关字段。日志可能包含提示词/响应内容或安全护栏事件。具体字段名和数据流由各集成包定义,可从集成文档或目标数据流映射中查询。
Determine what data is available
确认可用数据
- List data streams: and filter for
GET _data_stream,traces*(ormetrics-apm*), andmetrics*/metrics-*that match known LLM integration datasets (e.g. from Elastic LLM observability).logs-* - Inspect trace indices: For , run a small search or use mapping to see if spans contain
traces*orgen_ai.*(or similar) attributes. Confirm presence of token, model, and duration fields.llm.* - Inspect integration indices: For metrics/logs data streams, check mapping or one document to see token, cost, latency, and model dimensions.
- Use one source per use case: If both APM and integration data exist, prefer one consistent source for a given question (e.g. use traces for per-request chain analysis, integration metrics for aggregate token/cost).
- Check alerts and SLOs: Use the SLOs API and Alerting API to list SLOs and alerting rules that target LLM-related services or integration metrics, and to get open or recently fired alerts. Firing alerts or SLOs in degrading/violated status point to potential degraded performance.
- 列出数据流: 执行并筛选
GET _data_stream、traces*(或metrics-apm*),以及匹配已知LLM集成数据集的metrics*/metrics-*(例如来自Elastic LLM可观测性的数据集)。logs-* - 检查追踪索引: 针对索引,执行小型搜索或查看映射,确认Span是否包含
traces*或gen_ai.*(或类似)属性,同时确认Token、模型和延迟字段是否存在。llm.* - 检查集成索引: 针对指标/日志数据流,查看映射或单个文档,确认Token、成本、延迟和模型维度字段。
- 单场景单数据源: 若同时存在APM和集成数据,针对特定问题请选择一致的数据源(例如使用追踪数据进行单请求链分析,使用集成指标进行聚合Token/成本统计)。
- 检查告警与SLO: 使用SLO API和告警API列出针对LLM相关服务或集成指标的SLO和告警规则,获取当前触发或近期触发的告警。触发的告警或处于降级/违例状态的SLO指向潜在的性能退化问题。
Use cases and query patterns
用例与查询模式
LLM performance (latency, throughput, errors)
LLM性能(延迟、吞吐量、错误率)
- Traces: ES|QL on filtered by span attributes (e.g.
traces*orgen_ai.operation.namewhen present). Compute throughput (count per time bucket), latency (e.g.gen_ai.provider.nameor span duration), and error rate (duration.us) by model, service, or time.event.outcome == "failure" - Integrations: Query integration metrics for request rate, latency, and error metrics by model/dimension as exposed by the integration.
- 追踪数据: 对执行ES|QL查询,通过Span属性(如存在
traces*或gen_ai.operation.name时)过滤数据。按模型、服务或时间维度计算吞吐量(时间桶内请求数)、延迟(如gen_ai.provider.name或Span持续时间)和错误率(duration.us)。event.outcome == "failure" - 集成数据: 查询集成指标,按集成暴露的模型/维度统计请求速率、延迟和错误指标。
Cost and token utilization
成本与Token利用率
- Traces: Aggregate from spans in : sum
traces*andgen_ai.usage.input_tokens(or equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. customgen_ai.usage.output_tokens), sum it for cost views.llm.response.cost.* - Integrations: Use integration metrics that expose token counts and/or cost; aggregate by time and model.
- 追踪数据: 对中的Span进行聚合:按时间、模型或服务维度求和
traces*和gen_ai.usage.input_tokens(或等效属性名)。若存在成本属性(如自定义gen_ai.usage.output_tokens),可对其求和以生成成本视图。llm.response.cost.* - 集成数据: 使用集成暴露的Token计数和/或成本指标,按时间和模型维度进行聚合。
Response quality and safety
响应质量与安全性
- Traces: Use ,
event.outcome, and span attributes (e.g.error.type) ingen_ai.response.finish_reasonsto identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in attributes (e.g.traces*,gen_ai.input.messages) and not redacted.gen_ai.output.messages - Integrations: Query integration logs for guardrail blocks, content filter events, or policy violations (e.g. Bedrock Guardrails) using the fields defined by that integration.
- 追踪数据: 使用中的
traces*、event.outcome和Span属性(如error.type)识别失败、超时或内容过滤事件。若提示词/响应内容已捕获到属性中且未被脱敏(如gen_ai.response.finish_reasons、gen_ai.input.messages),可将其与上述事件关联分析。gen_ai.output.messages - 集成数据: 查询集成日志中的安全拦截、内容过滤事件或策略违例(例如Bedrock Guardrails),使用该集成定义的字段进行查询。
Call chaining and agentic workflow orchestration
调用链与智能代理工作流编排
- Traces only: Use trace hierarchy in . Filter by root service or trace attributes; group by
traces*and use parent/child span relationships (e.g.trace.id,parent.id) to reconstruct chains (e.g. orchestration span → multiple LLM or tool-call spans). Aggregate by span name orspan.idto see distribution of steps (e.g. retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.gen_ai.operation.name
- 仅使用追踪数据: 利用中的追踪层级。按根服务或追踪属性过滤数据;按
traces*分组,使用父子Span关系(如trace.id、parent.id)重构调用链(例如编排Span → 多个LLM或工具调用Span)。按Span名称或span.id聚合,查看步骤分布(如检索、LLM调用、工具使用)。单个Span和整个追踪的持续时间可用于定位瓶颈和端到端延迟。gen_ai.operation.name
Using ES|QL for LLM data
使用ES|QL分析LLM数据
- Availability: ES|QL is available in Elasticsearch 8.11+ (GA in 8.14) and in Elastic Observability Serverless.
- Scoping: Always restrict by time range (). When present, add
@timestampand optionallyservice.name. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword field forservice.environmentorgen_ai.provider.name).gen_ai.operation.name - Performance: Use , coarse time buckets when only trends are needed, and avoid full scans over large windows.
LIMIT
- 可用性: ES|QL在Elasticsearch 8.11+版本中可用(8.14版本正式GA),同时支持Elastic Observability Serverless。
- 范围限定: 始终按时间范围()过滤数据。若存在
@timestamp,可添加该字段,还可选择性添加service.name。针对LLM专属Span,确认字段名后通过Span属性过滤(例如service.environment或gen_ai.provider.name的关键字段)。gen_ai.operation.name - 性能优化: 使用限制结果数,仅需趋势时使用粗粒度时间桶,避免对大时间窗口执行全量扫描。
LIMIT
Workflow
工作流
text
LLM observability progress:
- [ ] Step 1: Determine available data (traces*, metrics-apm* or metrics*, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevanttext
LLM可观测性实施步骤:
- [ ] 步骤1:确认可用数据(traces*、metrics-apm*或metrics*,或集成数据流)
- [ ] 步骤2:发现LLM相关字段名(通过映射或样本文档)
- [ ] 步骤3:针对用户问题(性能、成本、质量、编排)执行ES|QL或Elasticsearch查询
- [ ] 步骤4:检查针对LLM相关数据的活跃告警或SLO(告警API、SLO API);步骤2获取的字段名可帮助识别相关规则;触发的告警或已违例/降级的SLO指示潜在性能退化
- [ ] 步骤5:仅基于导入数据总结发现结果;必要时包含告警/SLO状态Examples
示例
Example: Token usage over time from traces
示例:基于追踪数据的Token用量趋势
Assume span attributes are available as and
(adjust to actual field names from mapping):
span.attributes.gen_ai.usage.input_tokensspan.attributes.gen_ai.usage.output_tokensesql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500假设Span属性为和(需根据映射调整为实际字段名):
span.attributes.gen_ai.usage.input_tokensspan.attributes.gen_ai.usage.output_tokensesql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500Example: Latency and error rate by model
示例:按模型统计延迟与错误率
esql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(*),
failures = COUNT(*) WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100esql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(*),
failures = COUNT(*) WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100Example: Agentic workflow (trace-level view)
示例:智能代理工作流(追踪级视图)
Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:
esql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50获取包含至少一个LLM Span的Trace ID,并统计每个Trace的Span数量以查看链长:
esql
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50Example: Integration metrics (Amazon Bedrock AgentCore)
示例:集成指标(Amazon Bedrock AgentCore)
The Amazon Bedrock AgentCore integration
ships metrics to the data stream (time series index). Use for
aggregations on time series data streams (Elasticsearch 9.2+); use a time range with (9.3+). The
integration’s dashboards and
alerting rule templates
Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:
metrics-aws_bedrock_agentcore.metrics-*TSTRANGEesql
TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESCFor Elasticsearch 8.x or when is not available, use with and / over
the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI,
Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see
Elastic LLM observability).
TSFROMBUCKET(@timestamp, 1 hour)SUMAVGAmazon Bedrock AgentCore集成会将指标发送至数据流(时间序列索引)。针对时间序列数据流执行聚合时,使用**命令(Elasticsearch 9.2+);配合时间范围时使用**命令(9.3+)。该集成的仪表盘和告警规则模板示例:按小时和代理统计Token用量(计数器)、调用次数(计数器)和平均延迟(仪表盘指标):
metrics-aws_bedrock_agentcore.metrics-*TSTRANGEesql
TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC对于Elasticsearch 8.x版本或无法使用命令的场景,使用配合,并对指标字段执行/聚合(如集成告警规则模板中的方式)。针对其他LLM集成(OpenAI、Azure OpenAI、Vertex AI等),使用对应集成的数据流索引模式和其包中定义的字段名(详见Elastic LLM可观测性)。
TSFROMBUCKET(@timestamp, 1 hour)SUMAVGGuidelines
注意事项
- Data only in Elastic: Use only data collected and stored in Elastic (traces in , metrics, or integration metrics/logs). Do not describe or rely on other vendors’ UIs or products.
traces* - One technology per customer: Assume a single ingestion path per deployment when answering; discover which (traces vs integration) exists and use it consistently for the question.
- Discover field names: Before writing ES|QL or Query DSL, confirm LLM-related attribute or metric names from
or a sample document; naming may differ (e.g.
_mappingvsgen_ai.*or integration-specific fields).llm.* - No Kibana UI dependency: Prefer ES|QL and Elasticsearch APIs; use Kibana APIs only when needed (e.g. SLO, alerting). Do not instruct the user to open Kibana UI.
- References: LLM and agentic AI observability, Observability Labs – LLM Observability, OpenTelemetry GenAI spans. For ES|QL syntax and query patterns, use the elasticsearch-esql skill, or look through ES|QL TS command reference for Elastic v9.3 or higher and for Serverless, and look through ES|QL FROM command reference for other Elastic versions.
- 仅使用Elastic中的数据: 仅使用已采集并存储在Elastic中的数据(中的追踪数据、指标或集成指标/日志)。请勿描述或依赖其他厂商的UI或产品。
traces* - 单部署单技术栈: 回答问题时假设单个部署仅使用一种数据导入路径;先确认存在哪种路径(追踪或集成),并在整个问题解答过程中保持一致。
- 确认字段名: 在编写ES|QL或Query DSL之前,通过或样本文档确认LLM相关属性或指标的名称;命名可能存在差异(如
_mappingvsgen_ai.*或集成专属字段)。llm.* - 不依赖Kibana UI: 优先使用ES|QL和Elasticsearch API;仅在必要时使用Kibana API(如SLO、告警)。请勿指导用户打开Kibana UI。
- 参考文档: LLM与智能代理AI可观测性, 可观测性实验室 – LLM可观测性, OpenTelemetry生成式AI Span。关于ES|QL语法和查询模式,可使用elasticsearch-esql技能,或查阅Elastic v9.3及以上版本和Serverless版本的ES|QL TS命令参考,其他Elastic版本可查阅ES|QL FROM命令参考。",