Search Results: sre

Found 38 Skills

Document Processingalirezarezvani/claude-ski...

knowledge-ops

Use when a Head of Ops, Knowledge Manager, or TPM-Internal needs to author, validate, or clean up company SOPs and internal runbooks (procurement intake, vendor offboarding, incident-comms cascade, employee onboarding, expense reimbursement, system-access provisioning, customer-escalation playbook) — including 5W2H completeness checks (Who-What-When-Where-Why-How-HowMuch), cross-link and orphan-page validation across a sprawling Notion/Confluence/Obsidian wiki, KB ingestion + hygiene reporting, ops onboarding doc generation, and runbook step verification (named owner, expected duration, observable success signal, rollback path, escalation contact). Pairs Kaoru Ishikawa's 5W2H method, Atul Gawande's *The Checklist Manifesto*, ISO 9001, ITIL v4 Service Operation, FDA 21 CFR Part 211, and Google SRE Workbook runbook discipline with deterministic stdlib-only Python tools that score completeness, detect anti-patterns, and emit prioritized cleanup lists. Distinct from `engineering/llm-wiki` (Karpathy-style personal PKM second brain), `engineering-team/runbook-generator` (system-ops production debugging runbook), `project-management/*` (Jira/Confluence delivery + ticket tracking), and sibling `business-operations/process-mapper` (BPMN process *design*, while knowledge-ops is process *documentation*).

🇺🇸|EnglishTranslated

Security & Compliancequangrau/vibekit

drill-recovery

Disaster recovery drill exercises and security checklists for web application projects (SPA, SSR, full-stack web apps). Focused on solo/indie developers using free-tier infrastructure (Vercel, Supabase, Cloudflare, Netlify, Railway, etc.). Bridges big-tech best practices (NIST, Google SRE DiRT, ISO 22301) to indie scale. Use when the user mentions drills, disaster recovery, security audit, incident simulation, project health check, resilience testing, backup strategies, secret rotation, or incident response for web projects. Not for mobile apps, desktop software, CLI tools, or games.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicessickn33/antigravity-aweso...

observability-engineer

Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows. Use PROACTIVELY for monitoring infrastructure, performance optimization, or production reliability.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicessickn33/antigravity-aweso...

incident-response-incident-response

Use when working with incident response incident response

🇺🇸|EnglishTranslated

DevOps & Cloud Servicespjt222/development-guides

write-incident-runbook

Create structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates for effective incident response. Use when documenting response procedures for recurring alerts, standardizing incident response across an on-call rotation, reducing MTTR with clear diagnostic steps, creating training materials for new team members, or linking alert annotations directly to resolution procedures.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesincidentfox/incidentfox

investigate

Systematic incident investigation methodology. Use when investigating production issues, service degradation, errors, latency spikes, or outages.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesnexu-io/open-design

eng-runbook

An engineering runbook — service overview, alerts table, dashboards links, common procedures with copy-pasteable commands, on-call rotation, and an incident-response checklist. Use when the brief mentions "runbook", "ops doc", "on-call guide", "SRE doc", or "运维手册".

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesonewave-ai/claude-skills

runbook-generator

Generates comprehensive operational runbooks for any system or process. Reads codebase, infrastructure config, and deployment scripts to produce structured runbook.md files formatted for on-call engineers. Use when you need operations documentation, incident response guides, deployment procedures, or disaster recovery plans.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesonewave-ai/claude-skills

incident-responder

Production incident response automation. Reads logs, checks recent deploys, identifies root cause, suggests fixes, drafts incident comms, creates post-mortem templates. Severity classification (SEV1-4), escalation paths, status page updates. Generates incident-report.md with timeline, root cause, impact assessment, remediation steps, and prevention measures.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesalirezarezvani/claude-ski...

incident-commander

Incident Commander Skill

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesalirezarezvani/claude-ski...

runbook-generator

Runbook Generator

🇺🇸|EnglishTranslated

1 scripts/Checked

DevOps & Cloud Servicesfamaoai-creator/gemini-sk...

chaos-monkey-orchestrator

Injects managed chaos into environments to test system resilience. Validates that self-healing and monitoring systems work as expected under stress.

🇺🇸|EnglishTranslated

1 scripts/Checked