Loading...
Loading...
Found 33 Skills
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
Expert guidance for designing, implementing, and maintaining cloud infrastructure using Experience in Infrastructure as Code (IaC) principles. Use this skill for architecting cloud solutions, setting up CI/CD pipelines, implementing observability, and following SRE best practices.
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Build autonomous AI agents with Claude Agent SDK. Structured outputs guarantee JSON schema validation, with plugins system and hooks for event-driven workflows. Prevents 14 documented errors. Use when: building coding agents, SRE systems, security auditors, or troubleshooting CLI not found, structured output validation, session forking errors, MCP config issues, subagent cleanup.
Guides architects on when and how to use goal-seeking agents as a design pattern. This skill helps evaluate whether autonomous agents are appropriate for a given problem, how to structure their objectives, integrate with goal_agent_generator, and reference real amplihack examples like AKS SRE automation, CI diagnostics, pre-commit workflows, and fix-agent pattern matching.
Expert knowledge for Azure Reliability development including best practices, decision making, architecture & design patterns, limits & quotas, and deployment. Use when designing AZ zone/zone-redundant setups, resilient Functions, AKS, MySQL HA migrations, or Queue Storage limits, and other Azure Reliability related development tasks. Not for Azure Resiliency (use azure-resiliency), Azure Monitor (use azure-monitor), Azure Service Health (use azure-service-health), Azure Sre Agent (use azure-sre-agent).
Review a resume like a professional resume writer AND an ATS (Applicant Tracking System) — produce an ATS score (0-100), a level detection (fresher / mid / senior), a breakdown of what's working, what's failing, the 7 Deadly Sins check, priority fixes, concrete before→after bullet rewrites, and missing keywords. Use this skill whenever the user asks to review, audit, critique, score, rate, roast, or improve a resume or CV, or types /resume-review, or drops a resume PDF with a job description. Trigger EVEN IF the user just says "look at my CV", "check my resume", "is this resume good", "why am I not getting interviews", "roast my resume", or shares a resume file without explicit review instructions — any interaction involving a resume file or CV should invoke this skill. Especially strong for DevOps, SRE, Cloud, Platform, and Infrastructure engineering resumes, but works for any technical resume.
Disaster recovery drill exercises and security checklists for web application projects (SPA, SSR, full-stack web apps). Focused on solo/indie developers using free-tier infrastructure (Vercel, Supabase, Cloudflare, Netlify, Railway, etc.). Bridges big-tech best practices (NIST, Google SRE DiRT, ISO 22301) to indie scale. Use when the user mentions drills, disaster recovery, security audit, incident simulation, project health check, resilience testing, backup strategies, secret rotation, or incident response for web projects. Not for mobile apps, desktop software, CLI tools, or games.
Complete ClickHouse operations guide for DevOps and SRE teams managing production deployments. Provides practical guidance on monitoring essential metrics (query latency, throughput, memory, disk), introspecting system tables, performance analysis, scaling strategies (vertical and horizontal), backup/disaster recovery, tuning at query/server/table levels, and troubleshooting common issues. Use when diagnosing ClickHouse problems, optimizing performance, planning capacity, setting up monitoring, implementing backups, or managing production clusters. Includes resource management strategies for disk space, connections, and background operations plus production checklists.
Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows. Use PROACTIVELY for monitoring infrastructure, performance optimization, or production reliability.
Injects managed chaos into environments to test system resilience. Validates that self-healing and monitoring systems work as expected under stress.
An engineering runbook — service overview, alerts table, dashboards links, common procedures with copy-pasteable commands, on-call rotation, and an incident-response checklist. Use when the brief mentions "runbook", "ops doc", "on-call guide", "SRE doc", or "运维手册".