APM Service Health
Assess APM service health using
Observability APIs,
ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use
SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
Where to look
- Observability APIs (Observability APIs): Use the
SLOs API (Stack |
Serverless) to get SLO definitions, status, burn
rate, and error budget. Use the Alerting API
(Stack |
Serverless) to list and manage alerting
rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
- ES|QL and Elasticsearch: Query and
metrics*apm*,metrics*otel*
with ES|QL (see
Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style
aggregations. Use Elasticsearch APIs (e.g. for ES|QL, or Query DSL) as documented in the Elasticsearch
repo for indices and search.
- APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed
transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to
Elasticsearch significant_terms on . See
APM Correlations script.
- Infrastructure: Correlate via resource attributes (e.g. , , ) in
traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU
throttling directly impact APM health.
- Logs: Use ES|QL or Elasticsearch search on log indices filtered by or to explain
behavior and root cause.
- Observability Labs: Observability Labs and
APM tag for patterns and troubleshooting.
Health criteria
Synthesize health from all of the following when available:
| Signal | What to check |
|---|
| SLOs | Burn rate, status (healthy/degrading/violated), error budget. |
| Firing alerts | Open or recently fired alerts for the service or dependencies. |
| ML anomalies | Anomaly jobs; score and severity for latency, throughput, or error rate. |
| Throughput | Request rate; compare to baseline or previous period. |
| Latency | Avg, p95, p99; compare to SLO targets or history. |
| Error rate | Failed/total requests; spikes or sustained elevation. |
| Dependency health | Downstream latency, error rate, availability (ES|QL, APIs, Kibana repo). |
| Infrastructure | CPU usage, memory; OOM and CPU throttling on pods/containers/hosts. |
| Logs | App logs filtered by service or trace ID for context and root cause. |
Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe
degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to
explain why and suggest next steps.
Using ES|QL for APM metrics
When querying APM data from Elasticsearch (
,
metrics*apm*,metrics*otel*
), use
ES|QL by
default where available.
- Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always
available in
Elastic Observability Serverless Complete tier.
- Scoping to a service: Always filter by (and when relevant). Combine with a
time range on :
esql
WHERE service.name == "my-service-name" AND service.environment == "production"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
- Example patterns: Throughput, latency, and error rate over time: see Kibana
trace_charts_definition.ts
(, , ). Use → → /
with and WHERE service.name == "<service_name>"
.
- Performance: Add to cap rows and token usage. Prefer coarser (e.g. 1 hour)
when only trends are needed; finer buckets increase work and result size.
APM Correlations script
When only a
subpopulation of transactions has high latency or failures, run the
apm-correlations script to list
attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana
internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on
.
bash
# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
Environment: and
(or
/
) for Kibana; for fallback,
and
. Use the same time range as the investigation.
Workflow
text
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions
Step 1: Identify the service
Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most
relevant. Use ES|QL on
or
metrics*apm*,metrics*otel*
(e.g.
WHERE service.name == "<name>"
) or Kibana repo APM routes to obtain service-level data. If the user has not provided
the time range, assume last hour.
Step 2: Check SLOs and firing alerts
SLOs: Call the
SLOs API to get SLO definitions and status for the service (latency, availability),
healthy/degrading/violated, burn rate, error budget.
Alerts: For active APM alerts, call
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
.
When checking one service, include both rules where
matches the service and rules where
is absent (all-services rules). Do not query
indices for active-state checks. Correlate
with SLO violations or metric changes.
Step 3: Check ML anomalies
If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the
service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to
narrow Steps 4–5.
Step 4: Review throughput, latency, and error rate
Use
ES|QL against
or
metrics*apm*,metrics*otel*
for the service and time range to get
throughput (e.g. req/min),
latency (avg, p95, p99),
error rate (failed/total or 5xx/total). Example:
FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...
.
Compare to prior period or SLO targets. See
Using ES|QL for APM metrics.
Step 5: Assess dependency health
Obtain dependency and service-map data via
ES|QL on
/
metrics*apm*,metrics*otel*
(e.g.
downstream service/span aggregations) or via APM route handlers in the
Kibana repo that expose
dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or
failing dependencies as likely causes.
Step 6: Correlate with infrastructure and logs
- APM Correlations (when only a subpopulation is affected): Run
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...]
to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See
APM Correlations script.
- Infrastructure: Use resource attributes from traces (e.g. , , ) and
query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU
throttling directly impact APM health; correlate their time windows with APM degradation.
- Logs: Use ES|QL or Elasticsearch on log indices with
service.name == "<service_name>"
or
to explain behavior and root cause (exceptions, timeouts, restarts).
Step 7: Summarize and recommend
State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.
Examples
Example: ES|QL for a specific service
Scope with
WHERE service.name == "<service_name>"
and time range. Throughput and error rate (1-hour buckets;
caps rows and tokens):
esql
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
Latency percentiles and exact field names: see Kibana
trace_charts_definition.ts
.
Example: "Is service X healthy?"
- Resolve service X and time range. Call SLOs API and Alerting API; run ES|QL on
/
metrics*apm*,metrics*otel*
for throughput, latency, error rate; query
dependency/service-map data (ES|QL or Kibana repo).
- Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
- Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g.
Observability Labs).
Example: "Why is service Y slow?"
- Service Y and slowness time range. Call SLOs API and Alerting API; run ES|QL for Y and dependencies;
query ML anomaly results.
- Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
- Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).
Example: Correlate service to infrastructure (OpenTelemetry)
Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check
CPU and memory for those resources in the same time window as the APM issue:
- From the service’s traces or metrics, read resource attributes such as , ,
, or .
- Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the
incident time range. Check CPU usage and memory consumption (e.g.
system.cpu.total.norm.pct
); look for
OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.
Example: Filter logs by service or trace ID
To understand behavior for a specific service or a single trace, filter logs accordingly:
- By service: Run ES|QL or Elasticsearch search on log indices with
service.name == "<service_name>"
and time
range to get application logs (errors, warnings, restarts) in the service context.
- By trace ID: When investigating a specific request, take the from the APM trace and filter logs by
(or equivalent field in your log schema). Logs with that trace ID show the full request
path and help explain failures or latency.
Guidelines
- Use Observability APIs (SLOs API,
Alerting API) and ES|QL on
/
metrics*apm*,metrics*otel*
(8.11+ or Serverless), filtering by (and
when relevant). For active APM alerts, call
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
.
When checking one service, evaluate both rule types: rules where matches the target service, and
rules where is absent (all-services rules). Treat either as applicable to the service before
declaring health. Do not query indices when determining currently active alerts; use the Alerting API
response above as the source of truth. For APM correlations, run the apm-correlations script (see
APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route
handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo.
- Always use the user's time range; avoid assuming "last 1 hour" if the issue is historical.
- When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies,
throughput, latency, error rate, and dependencies.
- When analyzing only application metrics ingested via OpenTelemetry, use the ES|QL TS (time series) command for
efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in
Elastic Observability Serverless.
- Summary: one short health verdict plus bullet points for evidence and next steps.