Loading...
Loading...
Found 14 Skills
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements.
Automate PagerDuty tasks via Rube MCP (Composio): manage incidents, services, schedules, escalation policies, and on-call rotations. Always search tools first for current schemas.
Triage and manage production incidents. Trigger with "we have an incident", "production is down", "something is broken", "there's an outage", "SEV1", or when the user describes a production issue needing immediate response.
When the user wants to create, optimize, or structure a status page. Also use when the user mentions "status page," "status.yourdomain.com," "uptime," "service health," "incident page," or "system status."
DevOps and IT Ops automation - CI/CD, monitoring, incident management, and infrastructure workflows
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
This document describes the PagerDuty REST APIs.. Use when working with the PagerDuty API or when the user needs to interact with this API.
Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management. Masters incident command, blameless post-mortems, error budget management, and system reliability patterns. Handles critical outages, communication strategies, and continuous improvement. Use IMMEDIATELY for production incidents or SRE practices.
Expert delivery management covering continuous delivery, release management, deployment coordination, and service operations.
Expert-level site reliability engineering, SLOs, incident management, and operational excellence
Guidelines and workflows for community moderation, trust & safety, and escalation.