Loading...
Loading...
Found 14 Skills
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows. Use PROACTIVELY for monitoring infrastructure, performance optimization, or production reliability.
Use when working with incident response incident response
Create structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates for effective incident response. Use when documenting response procedures for recurring alerts, standardizing incident response across an on-call rotation, reducing MTTR with clear diagnostic steps, creating training materials for new team members, or linking alert annotations directly to resolution procedures.
Expert-level site reliability engineering, SLOs, incident management, and operational excellence
Expert in SRE practices, incident management, root cause analysis, and automated remediation.
Expert SRE incident responder specializing in rapid problem resolution.
Senior DevOps Engineer with expertise in CI/CD automation, infrastructure as code, monitoring, and SRE practices. Proficient in cloud platforms, containerization, configuration management, and building scalable DevOps pipelines with focus on automation and operational excellence.
Prometheus, Grafana, CloudWatch, Azure Monitor, Stackdriver, logging, alerting, and SRE practices
Expert guidance for designing, implementing, and maintaining cloud infrastructure using Experience in Infrastructure as Code (IaC) principles. Use this skill for architecting cloud solutions, setting up CI/CD pipelines, implementing observability, and following SRE best practices.
Production incident response automation. Reads logs, checks recent deploys, identifies root cause, suggests fixes, drafts incident comms, creates post-mortem templates. Severity classification (SEV1-4), escalation paths, status page updates. Generates incident-report.md with timeline, root cause, impact assessment, remediation steps, and prevention measures.
Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.