Loading...
Loading...
Found 93 Skills
Establishes instrumentation, monitoring, and alerting foundations.
Configure Prometheus Alertmanager with routing trees, receivers (Slack, PagerDuty, email), inhibition rules, silences, and notification templates for actionable incident alerting. Use when implementing proactive monitoring with automated incident detection, routing alerts to the appropriate team by severity, reducing alert fatigue through grouping and deduplication, integrating with on-call systems like PagerDuty, or migrating from legacy alerting to Prometheus-based alerting.
Prometheus monitoring expert for PromQL, alerting rules, Grafana dashboards, and observability
Unified decision tree for web research and competitive monitoring. Auto-selects WebFetch, Tavily, or agent-browser based on target site characteristics and available API keys. Includes competitor page tracking, snapshot diffing, and change alerting. Use when researching web content, scraping, extracting raw markdown, capturing documentation, or monitoring competitor changes.
Build interactive financial KPI dashboards with customizable metrics, drill-down analysis, variance explanations, and automated threshold-based alerting
Configure Harness AI-powered operations (AIDA) via MCP. Set up predictive failure analysis with ML models for memory leaks, disk exhaustion, connection pool saturation, and latency degradation. Configure intelligent alert correlation and noise reduction to reduce alert volume. Use when asked to set up predictive failure analysis, configure AI-powered alerting, reduce alert noise, or enable ML-based anomaly detection. Do NOT use for pipeline debugging (use debug-pipeline instead) or SLO management (use manage-slos instead). Trigger phrases: AIDA, predictive failure, alert correlation, noise reduction, anomaly detection, AI ops, predictive analysis, alert fatigue, ML alerting, intelligent alerting.
Create, update, and manage Oodle monitors — alerting thresholds, query scoping, and best practices to avoid alert fatigue.
Monitoring, logging, and tracing implementation using OpenTelemetry as the unified standard. Use when building production systems requiring visibility into performance, errors, and behavior. Covers OpenTelemetry (metrics, logs, traces), Prometheus, Grafana, Loki, Jaeger, Tempo, structured logging (structlog, tracing, slog, pino), and alerting.
Prometheus monitoring and alerting for cloud-native observability. USE WHEN: Writing PromQL queries, configuring Prometheus scrape targets, creating alerting rules, setting up recording rules, instrumenting applications with Prometheus metrics, configuring service discovery. DO NOT USE: For building dashboards (use /grafana), for log analysis (use /logging-observability), for general observability architecture (use senior-software-engineer with infrastructure focus). TRIGGERS: metrics, prometheus, promql, counter, gauge, histogram, summary, alert, alertmanager, alerting rule, recording rule, scrape, target, label, service discovery, relabeling, exporter, instrumentation, slo, error budget.
Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Real-time monitoring of ClickHouse metrics, events, and asynchronous metrics. Use for load average, connections, queue monitoring, and resource saturation.
Use when building comprehensive monitoring and observability systems.