Observability & Site Reliability Engineering

Core Principles

Three Pillars — Metrics, Logs, and Traces provide holistic visibility
Observability-First — Build systems that explain their own behavior
SLO-Driven — Define reliability targets that matter to users
Proactive Detection — Find issues before customers do
Blameless Culture — Learn from failures without blame
Automate Toil — Reduce repetitive operational work
Continuous Improvement — Each incident makes systems more resilient
Full-Stack Visibility — Monitor from infrastructure to business metrics

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

Symptom-Based Alerts Only

Alert on user-facing symptoms, not internal infrastructure metrics.

yaml

# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"

Low Cardinality Labels

Loki/Prometheus labels must have low cardinality (<10 unique labels).

yaml

# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");

SLO-Based Error Budgets

Every service must have defined SLOs with error budget tracking.

yaml

# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"

Trace Context in Logs

All logs must include trace_id for correlation with distributed traces.

typescript

// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

Quick Reference

When to Use What

Scenario	Tool/Pattern	Reason
Metrics collection	Prometheus + Grafana	Industry standard, powerful query language
Distributed tracing	OpenTelemetry + Tempo/Jaeger	Vendor-neutral, CNCF standard
Log aggregation (cost-sensitive)	Grafana Loki	Indexes only labels, 10x cheaper
Log aggregation (search-heavy)	ELK Stack	Full-text search, advanced analytics
Unified observability	Elastic/Datadog/Dynatrace	Single pane of glass for all telemetry
Incident management	PagerDuty/Opsgenie	Alert routing, on-call scheduling
Chaos engineering	Gremlin/Chaos Mesh	Controlled failure injection
AIOps/Anomaly detection	Dynatrace/Datadog	AI-driven root cause analysis

The Three Pillars

Pillar	What	When	Tools
Metrics	Numerical time-series data	Real-time monitoring, alerting	Prometheus, StatsD, CloudWatch
Logs	Event records with context	Debugging, audit trails	Loki, ELK, Splunk
Traces	Request journey across services	Performance analysis, dependencies	OpenTelemetry, Jaeger, Zipkin

Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

Observability Architecture

Layered Prometheus Setup

yaml

# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'

Recording Rules for Performance

yaml

# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # P95 latency
      - record: job:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

Resource Optimization

yaml

# Increase scrape interval for high-target deployments
scrape_interval: 30s  # Default: 15s reduces load by 50%

# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'  # Drop Go runtime metrics
    action: drop

# Limit sample retention
storage:
  tsdb:
    retention.time: 15d  # Keep only 15 days locally
    retention.size: 50GB # Or max 50GB

Distributed Tracing with OpenTelemetry

Auto-Instrumentation Setup

typescript

// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

typescript

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

yaml

# OpenTelemetry Collector config
processors:
  # Probabilistic sampling: Keep 10% of traces
  probabilistic_sampler:
    sampling_percentage: 10

  # Tail sampling: Make decisions after seeing full trace
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Sample 5% of normal traffic
      - name: normal-traces
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Context Propagation

typescript

// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

Structured Logging Best Practices

JSON Logging Format

typescript

// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}

Log Levels

typescript

// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash

Grafana Loki Configuration

yaml

# Promtail config - ships logs to Loki
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add pod labels as Loki labels (LOW cardinality only!)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
    pipeline_stages:
      # Parse JSON logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
      # Extract fields as labels
      - labels:
          level:
          trace_id:

Loki Best Practices

Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
LogQL for Filtering — Use
```
{app="api"} | json | user_id="123"
```
Retention Policy — Keep recent logs longer, compress old logs

promql

# LogQL query examples
{namespace="production", app="api"} |= "error"  # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}"  # JSON parsing

rate({app="api"}[5m])  # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h]))  # Count by level

SLO/SLI/SLA Management

Definitions

SLI (Service Level Indicator) — Quantifiable measurement of service behavior
- Examples: Request latency, error rate, availability, throughput
SLO (Service Level Objective) — Target value/range for an SLI
- Examples: 99.9% availability, P95 latency < 200ms
SLA (Service Level Agreement) — Formal commitment with consequences
- Examples: "99.9% uptime or 10% credit"

The Four Golden Signals

yaml

# Google SRE's key metrics for any service

1. Latency
   SLI: P95 request latency
   SLO: 95% of requests complete in < 200ms

2. Traffic
   SLI: Requests per second
   SLO: Handle 10,000 req/s peak load

3. Errors
   SLI: Error rate (5xx / total)
   SLO: < 0.1% error rate

4. Saturation
   SLI: Resource utilization (CPU, memory, disk)
   SLO: CPU < 70%, Memory < 80%

Error Budget

python

# Error budget = 1 - SLO
SLO = 99.9%  # "three nines"
Error_Budget = 100% - 99.9% = 0.1%

# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes

# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%

# Policy: If budget > 90% consumed, freeze deployments

SLO Implementation with Prometheus

yaml

# Recording rules for SLI calculation
groups:
  - name: slo_availability
    interval: 30s
    rules:
      # Total requests
      - record: slo:api_requests:total
        expr: sum(rate(http_requests_total[5m]))

      # Successful requests (non-5xx)
      - record: slo:api_requests:success
        expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

      # Availability SLI
      - record: slo:api_availability:ratio
        expr: slo:api_requests:success / slo:api_requests:total

      # 30-day availability
      - record: slo:api_availability:30d
        expr: avg_over_time(slo:api_availability:ratio[30d])

  - name: slo_latency
    interval: 30s
    rules:
      # P95 latency SLI
      - record: slo:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      slo:api_availability:ratio < 0.999  # Below 99.9% SLO
      and
      slo:api_availability:30d > 0.999    # But 30-day average still OK
    )
  for: 5m
  annotations:
    summary: "Burning error budget too fast"
    description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"

Incident Response

Incident Severity Levels

Level	Impact	Response Time	Examples
SEV-1	Service down or major degradation	< 15 min	Complete outage, data loss, security breach
SEV-2	Significant impact, partial outage	< 1 hour	Feature unavailable, high error rates
SEV-3	Minor impact, workaround exists	< 4 hours	Single component degraded, slow performance
SEV-4	Cosmetic, no user impact	Next business day	UI glitches, logging errors

Incident Response Roles (IMAG Framework)

yaml

Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates

Incident Workflow

1. Detection (Alert fires or user reports)
   ↓
2. Triage (Assess severity, assign IC)
   ↓
3. Response (Assemble team, create war room)
   ↓
4. Mitigation (Stop the bleeding, restore service)
   ↓
5. Resolution (Fix root cause)
   ↓
6. Postmortem (Blameless review, action items)
   ↓
7. Follow-up (Implement improvements)

On-Call Best Practices

Rotation — 1-week shifts, balanced across timezones
Escalation — Primary → Secondary → Manager (15 min each)
Playbooks — Step-by-step debugging guides for common issues
Runbooks — Automated remediation scripts
Handoff — 15-min sync at rotation change
Compensation — On-call pay or comp time
Health — No more than 2 incidents/night target

Alert Fatigue Prevention

yaml

# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken

# GOOD: Symptom-based alert
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200  # User-facing metric
  annotations:
    summary: "API is slow for users"

# BAD: Cause-based alert
- alert: CPUHigh
  expr: cpu_usage > 70%  # Internal metric, might not impact users
  # Don't alert unless this affects SLOs

# Use SLO-based alerting
# Alert when error budget burn rate is too high

Blameless Postmortems

Core Principles

Assume Good Intentions — Everyone did their best with available information
Focus on Systems — Identify gaps in process/tooling, not people
Psychological Safety — No punishment for honest mistakes
Learning Culture — Incidents are opportunities to improve
Separate from Performance Reviews — Postmortem participation never affects evaluations

Postmortem Template

markdown

# Incident Postmortem: [Title]

**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson

## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss

## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring

## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.

## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss

## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics

## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)

## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just pods

Follow-up

Review postmortem in team meeting within 1 week
Track action items to completion (not optional!)
Share learnings across teams
Update runbooks and playbooks
Celebrate successful incident response

Chaos Engineering

Principles

Define Steady State — Normal system behavior (e.g., 99.9% success rate)
Hypothesize — Predict system will remain stable under failure
Inject Failures — Simulate real-world events
Disprove Hypothesis — Look for deviations from steady state
Learn and Improve — Fix weaknesses, increase resilience

Failure Types

yaml

Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts

Chaos Mesh Example

yaml

# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    correlation: "50"
    jitter: "50ms"
  duration: "5m"
  scheduler:
    cron: "@every 2h"  # Run every 2 hours

---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
spec:
  action: pod-kill
  mode: fixed-percent
  value: "10"  # Kill 10% of pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "30s"

Best Practices

Start Small — Non-production first, then canary production
Collect Baselines — Know normal metrics before experiments
Define Success — Clear criteria for what "stable" means
Monitor Everything — Watch metrics, logs, traces during tests
Automate Rollback — Stop experiment if SLOs violated
Game Days — Scheduled chaos exercises with full team
Blameless Reviews — Treat chaos failures like production incidents

AIOps and AI in Observability

2025 Trends

Anomaly Detection — AI spots unusual patterns in metrics/logs
Root Cause Analysis — Correlate failures across services automatically
Predictive Alerting — Predict failures before they happen
Auto-Remediation — AI suggests or applies fixes autonomously
Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
AI Observability — Monitor AI model drift, hallucinations, token usage

AI-Driven Platforms (2025)

yaml

Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations

Implementing AI Observability

python

# Monitor AI model performance
from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics for AI model
model_latency = meter.create_histogram(
    "ai.model.latency",
    description="AI model inference latency",
    unit="ms"
)
model_tokens = meter.create_counter(
    "ai.model.tokens",
    description="Token usage"
)

async def run_ai_model(prompt: str):
    with tracer.start_as_current_span("ai.inference") as span:
        start = time.time()

        span.set_attribute("ai.model", "gpt-4")
        span.set_attribute("ai.prompt_length", len(prompt))

        response = await openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        latency = (time.time() - start) * 1000
        tokens = response.usage.total_tokens

        # Record metrics
        model_latency.record(latency, {"model": "gpt-4"})
        model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

        # Add to span
        span.set_attribute("ai.response_length", len(response.choices[0].message.content))
        span.set_attribute("ai.tokens_used", tokens)

        return response

Grafana Dashboards

3-3-3 Rule

3 rows of panels per dashboard
3 panels per row
3 key metrics per panel

Avoid "dashboard sprawl" — Each dashboard should answer ONE question.

Dashboard Categories

yaml

RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)

Panel Best Practices

json

{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

Checklist

markdown

## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics

## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)

## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined

## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process

## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready

## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises

observability-sre

NPX Install

Tags

SKILL.md Content

Observability & Site Reliability Engineering

Core Principles

Hard Rules (Must Follow)

Symptom-Based Alerts Only

Low Cardinality Labels

SLO-Based Error Budgets

Trace Context in Logs

Quick Reference

When to Use What

The Three Pillars

Observability Architecture

Layered Prometheus Setup

Recording Rules for Performance

Resource Optimization

Distributed Tracing with OpenTelemetry

Auto-Instrumentation Setup

Manual Instrumentation for Business Logic

Sampling Strategies

Context Propagation

Structured Logging Best Practices

JSON Logging Format

Log Levels

Grafana Loki Configuration

Loki Best Practices

SLO/SLI/SLA Management

Definitions

The Four Golden Signals

Error Budget

SLO Implementation with Prometheus

Incident Response

Incident Severity Levels

Incident Response Roles (IMAG Framework)

Incident Workflow

On-Call Best Practices

Alert Fatigue Prevention

Blameless Postmortems

Core Principles

Postmortem Template

Follow-up

Chaos Engineering

Principles

Failure Types

Chaos Mesh Example

Best Practices

AIOps and AI in Observability

2025 Trends

AI-Driven Platforms (2025)

Implementing AI Observability

Grafana Dashboards

3-3-3 Rule

Dashboard Categories

Panel Best Practices

Checklist

See Also