Loading...
Loading...
Found 38 Skills
Use when a Head of Ops, Knowledge Manager, or TPM-Internal needs to author, validate, or clean up company SOPs and internal runbooks (procurement intake, vendor offboarding, incident-comms cascade, employee onboarding, expense reimbursement, system-access provisioning, customer-escalation playbook) — including 5W2H completeness checks (Who-What-When-Where-Why-How-HowMuch), cross-link and orphan-page validation across a sprawling Notion/Confluence/Obsidian wiki, KB ingestion + hygiene reporting, ops onboarding doc generation, and runbook step verification (named owner, expected duration, observable success signal, rollback path, escalation contact). Pairs Kaoru Ishikawa's 5W2H method, Atul Gawande's *The Checklist Manifesto*, ISO 9001, ITIL v4 Service Operation, FDA 21 CFR Part 211, and Google SRE Workbook runbook discipline with deterministic stdlib-only Python tools that score completeness, detect anti-patterns, and emit prioritized cleanup lists. Distinct from `engineering/llm-wiki` (Karpathy-style personal PKM second brain), `engineering-team/runbook-generator` (system-ops production debugging runbook), `project-management/*` (Jira/Confluence delivery + ticket tracking), and sibling `business-operations/process-mapper` (BPMN process *design*, while knowledge-ops is process *documentation*).
Disaster recovery drill exercises and security checklists for web application projects (SPA, SSR, full-stack web apps). Focused on solo/indie developers using free-tier infrastructure (Vercel, Supabase, Cloudflare, Netlify, Railway, etc.). Bridges big-tech best practices (NIST, Google SRE DiRT, ISO 22301) to indie scale. Use when the user mentions drills, disaster recovery, security audit, incident simulation, project health check, resilience testing, backup strategies, secret rotation, or incident response for web projects. Not for mobile apps, desktop software, CLI tools, or games.
Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows. Use PROACTIVELY for monitoring infrastructure, performance optimization, or production reliability.
Use when working with incident response incident response
Create structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates for effective incident response. Use when documenting response procedures for recurring alerts, standardizing incident response across an on-call rotation, reducing MTTR with clear diagnostic steps, creating training materials for new team members, or linking alert annotations directly to resolution procedures.
Systematic incident investigation methodology. Use when investigating production issues, service degradation, errors, latency spikes, or outages.
An engineering runbook — service overview, alerts table, dashboards links, common procedures with copy-pasteable commands, on-call rotation, and an incident-response checklist. Use when the brief mentions "runbook", "ops doc", "on-call guide", "SRE doc", or "运维手册".
Generates comprehensive operational runbooks for any system or process. Reads codebase, infrastructure config, and deployment scripts to produce structured runbook.md files formatted for on-call engineers. Use when you need operations documentation, incident response guides, deployment procedures, or disaster recovery plans.
Production incident response automation. Reads logs, checks recent deploys, identifies root cause, suggests fixes, drafts incident comms, creates post-mortem templates. Severity classification (SEV1-4), escalation paths, status page updates. Generates incident-report.md with timeline, root cause, impact assessment, remediation steps, and prevention measures.
Incident Commander Skill
Runbook Generator
Injects managed chaos into environments to test system resilience. Validates that self-healing and monitoring systems work as expected under stress.