Loading...
Loading...
Health monitoring knowledge and procedures for infrastructure platforms. Use when assessing system health, running health audits, or setting up monitoring.
npx skill4agent add doubleslashse/claude-marketplace health-checks| Check Type | Frequency | When |
|---|---|---|
| Quick | Every deploy | After any deployment |
| Daily | Daily | Morning/start of business |
| Weekly | Weekly | Beginning of week |
| Deep | Monthly | Beginning of month |
| Full Audit | Quarterly | Scheduled maintenance window |
GREEN - All systems healthy
- No critical issues
- Metrics within normal ranges
- Advisory count: 0
YELLOW - Warning state
- Non-critical issues present
- Metrics approaching limits
- Performance advisories present
RED - Critical state
- Service impaired or unavailable
- Critical metrics exceeded
- Security advisories present
- Immediate action required| Condition | Status |
|---|---|
| Security advisory exists | RED |
| Service unavailable | RED |
| Error rate > 5% | RED |
| Connection utilization > 85% | RED |
| CI success rate < 75% | RED |
| Performance advisory exists | YELLOW |
| Error rate 1-5% | YELLOW |
| Connection utilization 70-85% | YELLOW |
| CI success rate 75-90% | YELLOW |
| Long-running queries present | YELLOW |
| All metrics normal | GREEN |
| Platform | Metric | Good | Warning | Critical |
|---|---|---|---|---|
| Database | Connection % | <70% | 70-85% | >85% |
| Database | Query Duration | <100ms | 100-500ms | >500ms |
| Database | Dead Rows % | <10% | 10-20% | >20% |
| API | Error Rate | <1% | 1-5% | >5% |
| API | Response Time P95 | <500ms | 500-2000ms | >2000ms |
| CI/CD | Success Rate | >90% | 75-90% | <75% |
| CI/CD | Build Time | <5min | 5-15min | >15min |
1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test# Infrastructure Health Report
**Generated**: {TIMESTAMP}
**Report Type**: {Quick | Daily | Weekly | Monthly}
**Overall Status**: {GREEN | YELLOW | RED}
## Executive Summary
{2-3 sentence overview}
## Platform Status
| Platform | Status | Issues | Warnings |
|----------|--------|--------|----------|
| GitHub Actions | {STATUS} | {N} | {N} |
| Railway | {STATUS} | {N} | {N} |
| Supabase | {STATUS} | {N} | {N} |
| Postgres | {STATUS} | {N} | {N} |
## Key Metrics
### Database
- Connections: {N}/{MAX} ({PCT}%)
- Query P95: {MS}ms
- Dead Rows: {PCT}%
### API
- Error Rate: {PCT}%
- Response Time P95: {MS}ms
### CI/CD
- Success Rate: {PCT}%
- Avg Build Time: {MIN}m
## Advisories
### Security
{List or "None"}
### Performance
{List or "None"}
## Issues Requiring Attention
### Immediate
{List or "None"}
### This Week
{List or "None"}
## Trends
{Notable changes from previous period}
## Recommendations
{Specific actions to improve health}
---
*Next health check: {TIMESTAMP}*1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changes