observability-logs-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLogs Search
日志搜索
Search and filter logs to support incident investigation. The workflow mirrors Kibana Discover: apply a time range and
scope filter, then iteratively add exclusion filters (NOT) until a small, interesting subset of logs remains—either
the root cause or the key document. Optionally view logs in context (preceding and following that document) or pivot to
another entity and start a fresh search. Use ES|QL only (); do not use Query DSL.
POST /_query搜索和过滤日志以支持事件调查。该工作流与Kibana Discover的逻辑一致:应用时间范围和范围过滤器,然后迭代添加排除过滤器(NOT),直到剩下一小部分有价值的日志子集——要么是根本原因,要么是关键文档。还可选择查看上下文日志(该文档的前后日志),或者切换到另一个实体重新开始搜索。仅使用ES|QL();请勿使用Query DSL。
POST /_queryWhen NOT to use
不适用场景
- Metrics or traces — use the dedicated metric or trace tools.
- 指标或链路追踪 —— 使用专门的指标或链路追踪工具。
Parameter conventions
参数约定
Use consistent names for Observability log search:
| Parameter | Type | Description |
|---|---|---|
| string | Start of time range (Elasticsearch date math, e.g. |
| string | End of time range (e.g. |
| string | KQL query string to narrow results. Not |
| number | Maximum log samples to return (e.g. 10–100) |
| string | Optional field to group the histogram by (e.g. |
For entity filters, use ECS field names: , , , ,
. Query ECS names only; OpenTelemetry aliases map automatically in Observability indices.
service.namehost.nameservice.environmentkubernetes.pod.namekubernetes.namespace使用统一的命名规则进行可观测性日志搜索:
| 参数 | 类型 | 描述 |
|---|---|---|
| string | 时间范围的起始值(Elasticsearch日期表达式,例如 |
| string | 时间范围的结束值(例如 |
| string | 用于缩小结果范围的KQL查询字符串。请勿使用 |
| number | 要返回的最大日志样本数(例如10–100) |
| string | 可选参数,用于对直方图进行分组的字段(例如 |
对于实体过滤器,请使用ECS字段名:、、、、。仅查询ECS名称;在可观测性索引中,OpenTelemetry别名会自动映射到ECS。
service.namehost.nameservice.environmentkubernetes.pod.namekubernetes.namespaceContext minimization
上下文最小化
Keep the context window small. In the sample branch of the query, KEEP only a subset of fields; do not return full
documents by default. A small summary (e.g. 10 docs with KEEP) stays under ~1000 tokens; a single full JSON doc can
exceed 4000 tokens.
Recommended KEEP list for sample logs:
, , , , , , ,
, , ,
messageerror.messageservice.namecontainer.namehost.namecontainer.idagent.namekubernetes.container.namekubernetes.node.namekubernetes.namespacekubernetes.pod.nameMessage fallback: If present, use the first non-empty of: (OTel), , ,
, , , (OTel). Observability
index templates often alias these; when building a single “message” for display, prefer that order.
body.textmessageerror.messageevent.originalexception.messageerror.exception.messageattributes.exception.messageLimit samples: Default to a small sample (10–20 logs) per query. Cap at 500; do not fetch thousands in one call.
Each funnel step is only to decide the next call—only the final narrowed result is the one to keep in context and
summarize.
保持上下文窗口尽可能小。在查询的样本分支中,仅保留字段的子集;默认情况下不要返回完整文档。一个小型摘要(例如保留10条文档的KEEP操作)的token数保持在~1000以内;而单个完整JSON文档的token数可能超过4000。
日志样本推荐保留字段列表:
, , , , , , ,
, , ,
messageerror.messageservice.namecontainer.namehost.namecontainer.idagent.namekubernetes.container.namekubernetes.node.namekubernetes.namespacekubernetes.pod.name消息回退策略: 如果存在多个消息字段,优先使用第一个非空字段,顺序为:(OTel)、、、、、、(OTel)。可观测性索引模板通常会为这些字段设置别名;在构建用于展示的单一“消息”时,请遵循上述优先级顺序。
body.textmessageerror.messageevent.originalexception.messageerror.exception.messageattributes.exception.message样本数量限制: 默认每个查询返回少量样本(10–20条日志)。上限为500条;请勿一次调用获取数千条日志。每个漏斗步骤仅用于决定下一步的查询;只有最终缩小范围后的结果才需要保留上下文并进行总结。
The funnel workflow
漏斗式工作流
You must iterate. Do not stop after one query. Keep excluding noise with until fewer than 20 log patterns
(distinct message categories) remain. Always keep the full filter when iterating: concatenate new NOTs to the
previous KQL; do not “zoom out” or drop earlier exclusions.
NOT- Round 1 — broad: Run a query with only the scope filter (e.g. ) and time range. Get total count, histogram, sample logs, and message categorization (common + rare patterns).
service.name: advertService - Inspect: Look at the histogram (when spikes or drops occur), the sample messages, and the categorized patterns (fork4 = top patterns by count, fork5 = rare patterns). If the histogram shows a sharp spike at a specific time, narrow the time range (t_start, t_end) around that spike for the next round. Count how many distinct log patterns remain (from the categorization); identify high-volume noise to exclude.
- Round 2 — exclude noise: Add clauses to the KQL filter for the dominant noise patterns. Run the query again with the full filter (all previous NOTs plus new ones).
NOT - Repeat: Keep adding clauses and re-running with the full filter. Do not stop after one or two rounds. Continue until fewer than 20 log patterns remain (use the categorization result to count distinct message categories). Then the remaining set is small enough to interpret as the interesting bits (errors, anomalies, root cause).
NOT - Pivot (optional): Once the funnel isolates a specific entity (e.g. ,
container.id), run one more query focused on that entity to see its “dying words” or surrounding context.host.name - Step back (if needed): If the funnel does not reveal the root cause, consider viewing logs in context (preceding and following the key document) or a different entity and start a fresh search.
If you stop before reaching fewer than 20 log patterns, you will report noise instead of the actual failures. Each
intermediate result is only for deciding the next call; only the final narrowed result should be kept in context and
summarized.
必须进行迭代。不要在一次查询后就停止。持续使用排除无关噪声,直到剩余的日志模式(不同的消息类别)少于20种。迭代时始终保留完整的过滤器:将新的NOT子句追加到之前的KQL中;不要“缩小范围”或删除之前的排除条件。
NOT- 第一轮——宽泛查询:运行仅包含范围过滤器(例如)和时间范围的查询。获取总日志数、直方图、日志样本以及消息分类(常见和罕见模式)。
service.name: advertService - 检查分析:查看直方图(突增或下降发生的时间)、样本消息以及分类后的模式(fork4 = 按数量排序的顶部模式,fork5 = 罕见模式)。如果直方图显示在特定时间点有明显突增,在下一轮查询中缩小该突增点周围的时间范围(t_start, t_end)。统计剩余的不同日志模式数量(来自分类结果);确定需要排除的高噪声模式。
- 第二轮——排除噪声:在KQL过滤器中添加针对主要噪声模式的子句。使用完整过滤器(所有之前的NOT子句加上新的NOT子句)再次运行查询。
NOT - 重复迭代:持续添加子句并使用完整过滤器重新运行查询。不要在一到两轮后就停止。继续迭代直到剩余的日志模式少于20种(使用分类结果统计不同的消息类别数)。此时剩余的日志子集足够小,可以作为有价值的内容(错误、异常、根本原因)进行分析。
NOT - 切换实体(可选):当漏斗式筛选锁定到特定实体(例如、
container.id)后,再运行一次针对该实体的查询,查看其“临终日志”或周围上下文。host.name - 回退调整(如有需要):如果漏斗式筛选未找到根本原因,可以考虑查看上下文日志(关键文档的前后日志),或者切换到另一个实体重新开始搜索。
如果在剩余日志模式少于20种之前就停止,你将报告噪声而非实际的故障。每个中间结果仅用于决定下一步的查询;只有最终缩小范围后的结果才需要保留上下文并进行总结。
ES|QL patterns for log search
日志搜索的ES|QL模式
Use ES|QL () only; do not use Query DSL. Always return in one request: a time-series histogram, total
count, a small sample of logs, and message categorization (common and rare patterns). The histogram is the primary
signal—it shows when spikes or drops occur and guides the next filter. Use to compute trend, total, samples, and
categorization in a single query.
POST /_queryFORKFORK output interpretation: The response contains multiple result sets identified by a column (or
equivalent). Map them as: fork1 = trend (count per time bucket), fork2 = total count (single row), fork3 =
sample logs, fork4 = common message patterns (top 20 by count, from up to 10k logs), fork5 = rare message
patterns (bottom 20 by count, from up to 10k logs). Use fork1 to spot when to narrow the time range; use fork2 to see
how much noise remains; use fork3 to decide which NOTs to add next; use fork4 and fork5 to see how many distinct log
patterns remain and to choose the next exclusions—continue iterating until fewer than 20 log patterns remain.
_fork仅使用ES|QL();请勿使用Query DSL。每次请求必须返回:时间序列直方图、总日志数、少量日志样本以及消息分类(常见和罕见模式)。直方图是主要信号——它显示突增或下降发生的时间,并指导下一步的过滤操作。使用在单个查询中计算趋势、总数、样本和分类结果。
POST /_queryFORKKQL guidance
FORK输出解读
- Prefer phrase queries for specificity when the target text is tokenized as you expect (e.g.
,
message: "GET /health").service.name: "advertService" - If the target would not be tokenized as a single term, use a wildcard (e.g. ,
message: *Returning*). Do not put wildcard characters inside quoted phrases.message: *WARNING* - Use explicit fielded KQL: ,
service.name: "payment-api",message: "GET /health",NOT kubernetes.namespace: "kube-system".error.message: * AND NOT message: "Known benign warning" - Filtering on (e.g.
log.level) can be useful, but it is often flawed: many logs have missing or incorrect level metadata (e.g. everything as "info", or level only in the message text). Prefer funneling by message content orlog.level: errorwhen hunting failures; treaterror.messageas a hint, not a reliable filter.log.level - Random full-text searches for words like "error" are also often flawed: they match harmless mentions (e.g. "no error", "error code 0", stack traces that reference the word). Prefer scoping by service/entity and iterating with NOT exclusions on actual message patterns rather than relying on a single keyword.
响应包含多个结果集,由列(或等效字段)标识。对应关系为:fork1 = 趋势(每个时间桶的日志数),fork2 = 总日志数(单行),fork3 = 日志样本,fork4 = 常见消息模式(按数量排序的前20种,来自最多1万条日志),fork5 = 罕见消息模式(按数量排序的后20种,来自最多1万条日志)。使用fork1确定何时缩小时间范围;使用fork2查看剩余噪声的数量;使用fork3决定下一步要添加哪些NOT子句;使用fork4和fork5查看剩余的不同日志模式数量并选择下一个要排除的模式——继续迭代直到剩余的日志模式少于20种。
_forkBasic log search with histogram, samples, and categorization
KQL使用指南
Include message categorization so you can count distinct log patterns and iterate until fewer than 20 remain. Use a
five-way FORK: trend, total, samples, common patterns, rare patterns.
json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}- fork4 (common): top 20 message patterns by count, from up to 10,000 logs—use to add NOTs for dominant noise.
- fork5 (rare): bottom 20 message patterns by count—helps spot needles in the haystack.
Count distinct patterns across fork4/fork5 (and the overall categorization) and continue iterating until fewer than 20 log patterns remain.
Adjust the index pattern (e.g. , ), time range, and bucket size (e.g. , , ). Keep sample
LIMIT small (10–20 by default; cap at 500). Use KEEP so the sample branch returns only summary fields, not full
documents.
logs-*logs-*-*30s5m1h- 当目标文本符合预期的分词方式时,优先使用短语查询以提高精确性(例如、
message: "GET /health")。service.name: "advertService" - 如果目标文本不会被分词为单个术语,请使用通配符(例如、
message: *Returning*)。不要在引号括起来的短语内使用通配符。message: *WARNING* - 使用带明确字段的KQL:、
service.name: "payment-api"、message: "GET /health"、NOT kubernetes.namespace: "kube-system"。error.message: * AND NOT message: "Known benign warning" - 基于过滤(例如
log.level)可能有用,但通常存在缺陷:许多日志缺少或错误设置了级别元数据(例如所有日志都标记为“info”,或者级别仅存在于消息文本中)。在排查故障时,优先根据消息内容或log.level: error进行漏斗式筛选;将error.message视为提示,而非可靠的过滤器。log.level - 仅搜索“error”等关键词的全文检索也通常存在缺陷:它们会匹配无害的提及(例如“no error”、“error code 0”、引用该词的堆栈跟踪)。优先按服务/实体划定范围,并通过NOT排除实际的消息模式,而非依赖单个关键词。
Adding a KQL filter
包含直方图、样本和分类的基础日志搜索
Narrow results with . The KQL expression is a single double-quoted string in ES|QL.
KQL("...")Escaping in the request body: The query is sent inside JSON, so every double quote that is part of the ES|QL string
must be escaped. Use for the quotes that wrap the KQL expression. If the KQL expression itself contains double
quotes (e.g. a phrase like ), escape those in the JSON as so the KQL parser receives
literal quote characters.
\"message: "GET /health"\\\"json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}包含消息分类,以便统计不同的日志模式并迭代至剩余模式少于20种。使用五分支FORK:趋势、总数、样本、常见模式、罕见模式。
json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, container.name, host.name, kubernetes.container.name, kubernetes.node.name, kubernetes.namespace, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}- fork4(常见模式):按数量排序的前20种消息模式,来自最多1万条日志——用于添加NOT子句排除主要噪声。
- fork5(罕见模式):按数量排序的后20种消息模式——有助于发现“大海捞针”式的关键日志。
统计fork4/fork5(以及整体分类)中的不同模式数量,继续迭代直到剩余的日志模式少于20种。
调整索引模式(例如、)、时间范围和桶大小(例如、、)。保持样本LIMIT较小(默认10–20条;上限为500条)。使用KEEP确保样本分支仅返回摘要字段,而非完整文档。
logs-*logs-*-*30s5m1hExcluding noise with NOT
添加KQL过滤器
Build the funnel by excluding known noise. In the request body, wrap the KQL string in and escape any quotes
inside the KQL expression as :
\"...\"\\\"json
"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."json
"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."使用缩小结果范围。KQL表达式在ES|QL中是一个用双引号括起来的字符串。
KQL("...")Histogram grouped by a dimension
请求体中的转义处理
Break down the trend by a second dimension (e.g. , ) to see which level or entity drives the
spike:
log.levelservice.nametext
STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.levelUse a limited set of group values in the response to avoid explosion (e.g. top N by count, rest as ).
_other查询是在JSON中发送的,因此ES|QL字符串中的每个双引号都必须转义。使用包裹KQL表达式。如果KQL表达式本身包含双引号(例如短语),在JSON中需转义为,以便KQL解析器能识别为字面量引号字符。
"message: "GET /health"\"json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= TO_DATETIME(\"2025-03-06T10:00:00.000Z\") AND @timestamp <= TO_DATETIME(\"2025-03-06T11:00:00.000Z\") | WHERE KQL(\"service.name: checkout AND log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 10 | KEEP _id, _index, message, error.message, service.name, host.name, kubernetes.pod.name) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` DESC | LIMIT 20) (LIMIT 10000 | STATS COUNT(*) BY CATEGORIZE(message) | SORT `COUNT(*)` ASC | LIMIT 20)"
}Examples
使用NOT排除噪声
Last hour of logs for a service
—
json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}通过排除已知噪声构建漏斗式筛选。在请求体中,将KQL字符串用包裹,并将KQL表达式中的引号转义为:
"..."\"json
"query": "... | WHERE KQL(\"NOT message: \\\"GET /health\\\" AND NOT kubernetes.namespace: \\\"kube-system\\\"\") | ..."json
"query": "... | WHERE KQL(\"error.message: * AND NOT message: \\\"Known benign warning\\\"\") | ..."Error logs with trend and samples
按维度分组的直方图
json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}按第二个维度(例如、)拆分趋势,查看哪个级别或实体导致了日志突增:
log.levelservice.nametext
STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 1m), log.level在响应中使用有限的分组值,避免结果爆炸(例如按数量排序的前N个值,其余归为)。
_otherIterative funnel: NOT and NOT and NOT until the interesting bits
示例
—
某服务最近一小时的日志
Do not stop after one exclusion. Each round, add more NOTs for the current top noise, then run again.
Round 1: → e.g. 55k logs; samples show "Returning N ads", "WARNING:
request...", "received ad request".
KQL("service.name: advertService")Round 2: Exclude the biggest noise:
→ re-run, check new total
and samples.
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*")Round 3: Exclude next noise (e.g. request/cache chatter):
→
re-run.
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*")Round 4+: Keep adding NOTs for whatever still dominates the samples (use fork4/fork5 categorization to see
patterns). Continue until fewer than 20 log patterns remain; then what remains is the signal to report (e.g. "error
fetching ads", encoding issues).
Escaping: wrap the KQL string in in the JSON; for quoted phrases inside KQL use .
\"...\"\\\"json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 1 hour AND @timestamp <= NOW() | WHERE KQL(\"service.name: api-gateway\") | SORT @timestamp DESC | LIMIT 20"
}Guidelines
包含趋势和样本的错误日志
- Funnel: iterate with NOT. Do not report findings after a single broad query. Add NOT clauses for dominant noise, re-run with the full filter (keep all previous NOTs), and repeat until fewer than 20 log patterns remain (use categorization fork4/fork5 to count). Stopping early yields noise, not signal.
- Histogram first: Use the trend (fork1) to see when spikes or drops occur; narrow the time range around the spike if needed before adding more NOTs.
- Context minimization: KEEP only summary fields in the sample branch; default LIMIT 10–20, cap at 500. Each funnel step is for deciding the next call; only the final narrowed result is for context and summary.
- Request body escaping: The value is JSON. Escape double quotes in the ES|QL string:
queryfor the KQL wrapper,\"for quotes inside the KQL expression (e.g. phrase values).\\\" - Use Elasticsearch date math for and
start(e.g.end,now-1h) when building queries programmatically.now-15m - Choose bucket size from the time range: aim for roughly 20–50 buckets (e.g. 1h window → or
1m).2m - Prefer ECS field names. In Observability index templates, OTel fields are aliased to ECS; see references/log-search-reference.md for resource metadata field fallbacks (container, host, cluster, namespace, pod, workload).
- : Filtering or grouping by it can be OK but is often unreliable when levels are missing or mis-set; prefer message content or
log.levelfor finding failures.error.message - Keyword searches: Searching only for words like "error" or "fail" is often flawed (e.g. "no error", "error code 0"); prefer scoping by entity and funneling with NOT on real message patterns.
json
POST /_query
{
"query": "FROM logs-* METADATA _id, _index | WHERE @timestamp >= NOW() - 2 hours AND @timestamp <= NOW() | WHERE KQL(\"log.level: error\") | FORK (STATS count = COUNT(*) BY bucket = BUCKET(@timestamp, 5m) | SORT bucket) (STATS total = COUNT(*)) (SORT @timestamp DESC | LIMIT 15)"
}References
迭代式漏斗筛选:不断使用NOT直到得到有价值的内容
- references/log-search-reference.md — ECS/OTel field mapping and index patterns
不要在一次排除后就停止。每一轮都添加针对当前主要噪声的NOT子句,然后重新运行查询。
第一轮: → 例如5.5万条日志;样本显示“Returning N ads”、“WARNING: request...”、“received ad request”。
KQL("service.name: advertService")第二轮:排除最大的噪声:
→ 重新运行,查看新的总数和样本。
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING*")第三轮:排除下一个噪声(例如请求/缓存相关日志):
→ 重新运行。
KQL("service.name: advertService AND NOT message: *Returning* AND NOT message: *WARNING* AND NOT message: *received ad request* AND NOT message: *Adding* AND NOT message: *Cache miss*")第四轮及以后:持续添加针对样本中主要噪声的NOT子句(使用fork4/fork5分类结果查看模式)。继续迭代直到剩余的日志模式少于20种;此时剩余的内容就是需要报告的关键信号(例如“error fetching ads”、编码问题)。
转义处理:在JSON中用包裹KQL字符串;KQL表达式中的带引号短语使用转义。
"..."\"—
指导原则
—
- 漏斗式筛选:使用NOT迭代。不要在一次宽泛查询后就报告结果。针对主要噪声添加NOT子句,使用完整过滤器(保留所有之前的NOT子句)重新运行,重复此过程直到剩余的日志模式少于20种(使用分类结果fork4/fork5统计)。过早停止会得到噪声而非信号。
- 先看直方图:使用趋势(fork1)查看突增或下降发生的时间;如有需要,在添加更多NOT子句前先缩小时间范围。
- 上下文最小化:在样本分支中仅保留摘要字段;默认LIMIT为10–20条,上限500条。每个漏斗步骤仅用于决定下一步的查询;只有最终缩小范围后的结果才需要保留上下文并总结。
- 请求体转义:的值是JSON格式。转义ES|QL字符串中的双引号:使用
query作为KQL的包裹符,使用"作为KQL表达式内部的引号(例如短语值)。\" - 编程构建查询时,为和
start使用Elasticsearch日期表达式(例如end、now-1h)。now-15m - 根据时间范围选择桶大小:目标是生成约20–50个桶(例如1小时窗口 → 或
1m)。2m - 优先使用ECS字段名。在可观测性索引模板中,OTel字段会被别名到ECS;请查看references/log-search-reference.md获取资源元数据字段的回退方案(容器、主机、集群、命名空间、Pod、工作负载)。
- :基于它过滤或分组可能有用,但当日志级别缺失或设置错误时通常不可靠;排查故障时优先使用消息内容或
log.level。error.message - 关键词搜索:仅搜索“error”或“fail”等单词通常存在缺陷(例如“no error”、“error code 0”);优先按实体划定范围并通过NOT排除实际的消息模式。
—
参考资料
—
- references/log-search-reference.md — ECS/OTel字段映射和索引模式