Loading...
Loading...
Found 41 Skills
Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
Upgrade any Pulumi provider to a newer version and reconcile the resulting diff. Use when users want to upgrade or update a provider (including editing package.json, requirements.txt, pyproject.toml, go.mod, or Pulumi.yaml to bump a provider SDK), check for breaking changes before or during an upgrade, fix resources that broke after a provider upgrade, or resolve unexpected replacements, creates, or deletes in a post-upgrade preview. Applies to all providers (aws, azure-native, gcp, kubernetes, aws-native, cloudflare, datadog, etc.) — not just Tier 1. Do NOT use for querying which stacks use what package versions; use skill `package-usage` for cross-stack audits. Do NOT use for general infrastructure tasks.
Configures log and metric export for CockroachDB Cloud clusters to external monitoring services including AWS CloudWatch, GCP Cloud Logging, and Datadog. Use when setting up log export for audit compliance, configuring metric export for monitoring, or troubleshooting log delivery issues.
Query APIs, files, and live sources using Coral SQL. Use when the user asks about data from GitHub, Slack, Linear, Datadog, Sentry, or other connected sources.
Redis observability guidance — which metrics to monitor (memory, connections, hit ratio, ops/sec, rejected connections), which built-in commands to reach for during incident triage (SLOWLOG, INFO, MEMORY DOCTOR, CLIENT LIST, FT.PROFILE), and when to use the Redis Insight GUI. Use when setting up monitoring or alerts for a Redis instance, diagnosing a performance regression, profiling a slow FT.SEARCH query, or wiring Redis metrics into Prometheus, Datadog, or similar.
Use when adding logging to services, setting up monitoring, creating alerts, debugging production issues, designing SLIs/SLOs, or implementing structured logging (Pino, Winston), metrics (Prometheus, DataDog, CloudWatch), or distributed tracing (OpenTelemetry).
OpenTelemetry, distributed tracing, structured logging, metrics (Prometheus, Grafana, Datadog). Use when implementing monitoring, tracing, or debugging production issues.
Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).
Export cost-tracking telemetry in Prometheus textfile or webhook JSON formats — for external observability (Grafana, Datadog, custom dashboards)
Use this skill when working with SigNoz - open-source observability platform for application monitoring, distributed tracing, log management, metrics, alerts, and dashboards. Triggers on SigNoz setup, OpenTelemetry instrumentation for SigNoz, sending traces/logs/metrics to SigNoz, creating SigNoz dashboards, configuring SigNoz alerts, exception monitoring, and migrating from Datadog/Grafana/New Relic to SigNoz.
Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
Root cause analysis on production LLM traces. Diagnoses why an LLM application is failing — works from eval judge verdicts, runtime errors, or structural anomalies depending on what signals are present. Walks the span tree from symptom to root cause. Use when user says "what's wrong with my app", "why is my eval failing", "analyze errors", "root cause analysis", "diagnose failures", or wants to understand production failure patterns.