Loading...
Loading...
Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.
npx skill4agent add majiayu000/claude-arsenal observability-sreThese rules are mandatory. Violating them means the skill is not working correctly.
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
expr: cpu_usage > 70%
# Users don't care about CPU, they care about latency
- alert: MemoryHigh
expr: memory_usage > 80%
# Internal metric, may not affect users
# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200
annotations:
summary: "Users experiencing slow response times"
- alert: ErrorRateHigh
expr: slo:api_errors:rate5m > 0.001
annotations:
summary: "Users encountering errors"# ❌ FORBIDDEN: High cardinality labels
labels:
user_id: "usr_123" # Millions of values!
order_id: "ord_456" # Millions of values!
request_id: "req_789" # Every request is unique!
# ✅ REQUIRED: Low cardinality only
labels:
namespace: "production" # Few values
app: "api-server" # Few values
level: "error" # 5-6 values
method: "GET" # ~10 values
# High cardinality data goes in log body:
logger.info({
user_id: "usr_123", # In JSON body, not label
order_id: "ord_456",
}, "Order processed");# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets
# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime
groups:
- name: slo_tracking
rules:
- record: slo:api_availability:ratio
expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- alert: ErrorBudgetBurnRate
expr: slo:api_availability:ratio < 0.999
for: 5m
annotations:
summary: "Burning error budget too fast"// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");
// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
order_id: "ord_123",
}, "Payment processed");
// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}| Scenario | Tool/Pattern | Reason |
|---|---|---|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |
| Pillar | What | When | Tools |
|---|---|---|---|
| Metrics | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| Logs | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| Traces | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down
# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)
# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level
# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level
# Global Prometheus config
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{__name__=~"job:.*"}' # Recording rules only
static_configs:
- targets:
- 'cluster-prom-us-east.internal:9090'
- 'cluster-prom-eu-west.internal:9090'# Precompute expensive queries
groups:
- name: api_performance
interval: 30s
rules:
# Request rate (requests per second)
- record: job:api_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status)
# Error rate
- record: job:api_errors:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# P95 latency
- record: job:api_latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))# Increase scrape interval for high-target deployments
scrape_interval: 30s # Default: 15s reduces load by 50%
# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*|process_.*' # Drop Go runtime metrics
action: drop
# Limit sample retention
storage:
tsdb:
retention.time: 15d # Keep only 15 days locally
retention.size: 50GB # Or max 50GB// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
}),
],
});
sdk.start();import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
// Create custom span for business operation
return tracer.startActiveSpan('processPayment', async (span) => {
try {
// Add business context
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
});
// Child span for external API call
const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
const result = await stripe.charges.create({ amount, currency: 'usd' });
childSpan.setAttribute('stripe.charge_id', result.id);
childSpan.setStatus({ code: SpanStatusCode.OK });
childSpan.end();
return result;
});
span.setStatus({ code: SpanStatusCode.OK });
return paymentResult;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}# OpenTelemetry Collector config
processors:
# Probabilistic sampling: Keep 10% of traces
probabilistic_sampler:
sampling_percentage: 10
# Tail sampling: Make decisions after seeing full trace
tail_sampling:
policies:
# Always sample errors
- name: error-traces
type: status_code
status_code: {status_codes: [ERROR]}
# Always sample slow requests
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
# Sample 5% of normal traffic
- name: normal-traces
type: probabilistic
probabilistic: {sampling_percentage: 5}// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';
// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
headers: {
// W3C Trace Context headers injected automatically:
// traceparent: 00-<trace-id>-<span-id>-01
// tracestate: vendor=value
},
});
// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });// Use structured logging library
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
// Include trace context in logs
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return {
trace_id: traceId,
span_id: spanId,
};
},
});
// Structured logging with context
logger.info(
{
user_id: '123',
order_id: 'ord_456',
amount: 99.99,
payment_method: 'card',
},
'Payment processed successfully'
);
// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging'); // Very verbose
logger.debug({ state }, 'Debug information'); // Development
logger.info({ event }, 'Normal operation'); // Production default
logger.warn({ issue }, 'Warning condition'); // Potential issues
logger.error({ error, context }, 'Error occurred'); // Errors
logger.fatal({ critical }, 'Fatal error'); // Process crash# Promtail config - ships logs to Loki
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add pod labels as Loki labels (LOW cardinality only!)
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
pipeline_stages:
# Parse JSON logs
- json:
expressions:
level: level
trace_id: trace_id
# Extract fields as labels
- labels:
level:
trace_id:{app="api"} | json | user_id="123"# LogQL query examples
{namespace="production", app="api"} |= "error" # Text search
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing
rate({app="api"}[5m]) # Log rate per second
sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level# Google SRE's key metrics for any service
1. Latency
SLI: P95 request latency
SLO: 95% of requests complete in < 200ms
2. Traffic
SLI: Requests per second
SLO: Handle 10,000 req/s peak load
3. Errors
SLI: Error rate (5xx / total)
SLO: < 0.1% error rate
4. Saturation
SLI: Resource utilization (CPU, memory, disk)
SLO: CPU < 70%, Memory < 80%# Error budget = 1 - SLO
SLO = 99.9% # "three nines"
Error_Budget = 100% - 99.9% = 0.1%
# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes
# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%
# Policy: If budget > 90% consumed, freeze deployments# Recording rules for SLI calculation
groups:
- name: slo_availability
interval: 30s
rules:
# Total requests
- record: slo:api_requests:total
expr: sum(rate(http_requests_total[5m]))
# Successful requests (non-5xx)
- record: slo:api_requests:success
expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
# Availability SLI
- record: slo:api_availability:ratio
expr: slo:api_requests:success / slo:api_requests:total
# 30-day availability
- record: slo:api_availability:30d
expr: avg_over_time(slo:api_availability:ratio[30d])
- name: slo_latency
interval: 30s
rules:
# P95 latency SLI
- record: slo:api_latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
expr: |
(
slo:api_availability:ratio < 0.999 # Below 99.9% SLO
and
slo:api_availability:30d > 0.999 # But 30-day average still OK
)
for: 5m
annotations:
summary: "Burning error budget too fast"
description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"| Level | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Service down or major degradation | < 15 min | Complete outage, data loss, security breach |
| SEV-2 | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates |
| SEV-3 | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance |
| SEV-4 | Cosmetic, no user impact | Next business day | UI glitches, logging errors |
Incident Commander (IC):
- Overall coordination and decision-making
- Declares incident start/end
- Decides on escalations
- Owns communication to leadership
Operations Lead (OL):
- Technical investigation and mitigation
- Coordinates engineers
- Implements fixes
- Reports status to IC
Communications Lead (CL):
- Internal/external status updates
- Customer communication
- Stakeholder notifications
- Status page updates1. Detection (Alert fires or user reports)
↓
2. Triage (Assess severity, assign IC)
↓
3. Response (Assemble team, create war room)
↓
4. Mitigation (Stop the bleeding, restore service)
↓
5. Resolution (Fix root cause)
↓
6. Postmortem (Blameless review, action items)
↓
7. Follow-up (Implement improvements)# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken
# GOOD: Symptom-based alert
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200 # User-facing metric
annotations:
summary: "API is slow for users"
# BAD: Cause-based alert
- alert: CPUHigh
expr: cpu_usage > 70% # Internal metric, might not impact users
# Don't alert unless this affects SLOs
# Use SLO-based alerting
# Alert when error budget burn rate is too high# Incident Postmortem: [Title]
**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson
## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss
## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring
## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.
## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss
## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics
## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)
## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just podsInfrastructure:
- Pod/node termination
- Network latency/packet loss
- DNS failures
- Cloud region outage
Resources:
- CPU stress
- Memory exhaustion
- Disk I/O saturation
- File descriptor limits
Dependencies:
- Database connection failures
- API timeout/errors
- Cache unavailability
- Message queue backlog
Security:
- DDoS simulation
- Certificate expiration
- Unauthorized access attempts# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
correlation: "50"
jitter: "50ms"
duration: "5m"
scheduler:
cron: "@every 2h" # Run every 2 hours
---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: fixed-percent
value: "10" # Kill 10% of pods
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "30s"Dynatrace Davis AI:
- Auto-detected 73% of incidents before customer impact
- Reduced alert noise by 90%
- Causal AI for root cause analysis
Datadog Watchdog:
- Anomaly detection across metrics, logs, traces
- Automated correlation of related issues
- LLM-powered investigation assistant
Elastic AIOps:
- Machine learning for log anomaly detection
- Automated baseline learning
- Predictive alerting
New Relic AI:
- Natural language query interface
- Automated incident summarization
- Proactive capacity recommendations# Monitor AI model performance
from opentelemetry import trace, metrics
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Create metrics for AI model
model_latency = meter.create_histogram(
"ai.model.latency",
description="AI model inference latency",
unit="ms"
)
model_tokens = meter.create_counter(
"ai.model.tokens",
description="Token usage"
)
async def run_ai_model(prompt: str):
with tracer.start_as_current_span("ai.inference") as span:
start = time.time()
span.set_attribute("ai.model", "gpt-4")
span.set_attribute("ai.prompt_length", len(prompt))
response = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
latency = (time.time() - start) * 1000
tokens = response.usage.total_tokens
# Record metrics
model_latency.record(latency, {"model": "gpt-4"})
model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})
# Add to span
span.set_attribute("ai.response_length", len(response.choices[0].message.content))
span.set_attribute("ai.tokens_used", tokens)
return responseRED Dashboard (for services):
- Rate: Requests per second
- Errors: Error rate
- Duration: Latency (P50, P95, P99)
USE Dashboard (for resources):
- Utilization: % of capacity used
- Saturation: Queue depth, wait time
- Errors: Error count
Four Golden Signals Dashboard:
- Latency
- Traffic
- Errors
- Saturation
SLO Dashboard:
- Current SLI value
- Error budget remaining
- Burn rate
- Trend (30-day){
"title": "API Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"options": {
"tooltip": { "mode": "multi" },
"legend": { "displayMode": "table", "calcs": ["mean", "last"] }
},
"fieldConfig": {
"defaults": {
"unit": "reqps", // Requests per second
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
}
}## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics
## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)
## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined
## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process
## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready
## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises