health-checks
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHealth Checks Skill
健康检查Skill
Overview
概述
This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.
本Skill提供了针对GitHub Actions、Railway、Supabase和Postgres的基础设施健康监控知识与流程。
Health Check Philosophy
健康检查理念
Why Regular Health Checks?
为何要定期进行健康检查?
- Proactive Detection: Find issues before users do
- Trend Identification: Spot degradation early
- Capacity Planning: Know when to scale
- Compliance: Maintain system hygiene
- Documentation: Track system state over time
- 主动检测:在用户发现问题前找出隐患
- 趋势识别:提前发现性能退化
- 容量规划:知晓何时需要扩容
- 合规性:维持系统健康状态
- 文档记录:跟踪系统状态随时间的变化
Health Check Frequency
健康检查频率
| Check Type | Frequency | When |
|---|---|---|
| Quick | Every deploy | After any deployment |
| Daily | Daily | Morning/start of business |
| Weekly | Weekly | Beginning of week |
| Deep | Monthly | Beginning of month |
| Full Audit | Quarterly | Scheduled maintenance window |
| 检查类型 | 频率 | 执行时机 |
|---|---|---|
| 快速检查 | 每次部署后 | 任何部署完成后 |
| 每日检查 | 每日 | 早晨/工作开始时 |
| 每周检查 | 每周 | 每周开始时 |
| 深度检查 | 每月 | 每月开始时 |
| 全面审计 | 每季度 | 预定维护窗口期 |
Health Status Framework
健康状态框架
Traffic Light System
交通灯状态系统
GREEN - All systems healthy
- No critical issues
- Metrics within normal ranges
- Advisory count: 0
YELLOW - Warning state
- Non-critical issues present
- Metrics approaching limits
- Performance advisories present
RED - Critical state
- Service impaired or unavailable
- Critical metrics exceeded
- Security advisories present
- Immediate action requiredGREEN - 所有系统运行正常
- 无关键问题
- 指标在正常范围内
- 预警数量:0
YELLOW - 警告状态
- 存在非关键问题
- 指标接近阈值
- 存在性能预警
RED - 严重状态
- 服务受损或不可用
- 关键指标超出阈值
- 存在安全预警
- 需要立即采取行动Status Determination Rules
状态判定规则
| Condition | Status |
|---|---|
| Security advisory exists | RED |
| Service unavailable | RED |
| Error rate > 5% | RED |
| Connection utilization > 85% | RED |
| CI success rate < 75% | RED |
| Performance advisory exists | YELLOW |
| Error rate 1-5% | YELLOW |
| Connection utilization 70-85% | YELLOW |
| CI success rate 75-90% | YELLOW |
| Long-running queries present | YELLOW |
| All metrics normal | GREEN |
| 条件 | 状态 |
|---|---|
| 存在安全预警 | RED |
| 服务不可用 | RED |
| 错误率 > 5% | RED |
| 连接使用率 > 85% | RED |
| CI成功率 < 75% | RED |
| 存在性能预警 | YELLOW |
| 错误率 1-5% | YELLOW |
| 连接使用率 70-85% | YELLOW |
| CI成功率 75-90% | YELLOW |
| 存在长时间运行的查询 | YELLOW |
| 所有指标正常 | GREEN |
Health Metrics
健康指标
Key Performance Indicators
关键性能指标
| Platform | Metric | Good | Warning | Critical |
|---|---|---|---|---|
| Database | Connection % | <70% | 70-85% | >85% |
| Database | Query Duration | <100ms | 100-500ms | >500ms |
| Database | Dead Rows % | <10% | 10-20% | >20% |
| API | Error Rate | <1% | 1-5% | >5% |
| API | Response Time P95 | <500ms | 500-2000ms | >2000ms |
| CI/CD | Success Rate | >90% | 75-90% | <75% |
| CI/CD | Build Time | <5min | 5-15min | >15min |
| 平台 | 指标 | 正常范围 | 警告范围 | 严重范围 |
|---|---|---|---|---|
| 数据库 | 连接使用率 | <70% | 70-85% | >85% |
| 数据库 | 查询时长 | <100ms | 100-500ms | >500ms |
| 数据库 | 死行占比 | <10% | 10-20% | >20% |
| API | 错误率 | <1% | 1-5% | >5% |
| API | P95响应时间 | <500ms | 500-2000ms | >2000ms |
| CI/CD | 成功率 | >90% | 75-90% | <75% |
| CI/CD | 构建时长 | <5min | 5-15min | >15min |
Platform-Specific Metrics
平台专属指标
Supabase
Supabase
- API error rate
- Auth failure rate
- Storage utilization
- Edge function cold starts
- Realtime connection count
- Advisory count (security/performance)
- API错误率
- 认证失败率
- 存储使用率
- Edge函数冷启动次数
- 实时连接数
- 预警数量(安全/性能)
GitHub Actions
GitHub Actions
- Workflow success rate
- Average build time
- Queue wait time
- Cache hit rate
- Failed workflow count
- 工作流成功率
- 平均构建时长
- 队列等待时间
- 缓存命中率
- 失败工作流数量
Railway
Railway
- Service uptime
- Deploy success rate
- Memory utilization
- CPU utilization
- Health check pass rate
- 服务上线率
- 部署成功率
- 内存使用率
- CPU使用率
- 健康检查通过率
Postgres
Postgres
- Connection utilization
- Query duration distribution
- Lock contention
- Dead tuple ratio
- Index usage efficiency
- Table bloat
- 连接使用率
- 查询时长分布
- 锁竞争情况
- 死元组占比
- 索引使用效率
- 表膨胀情况
Health Check Procedures
健康检查流程
Quick Health Check (5 min)
快速健康检查(5分钟)
Purpose: Verify basic system functionality
1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status目的:验证系统基本功能
1. [ ] 检查所有平台是否存在活跃事件
2. [ ] 验证所有服务可正常响应
3. [ ] 检查是否存在严重预警
4. [ ] 查看过去1小时的错误率
5. [ ] 检查连接池状态Daily Health Check (15 min)
每日健康检查(15分钟)
Purpose: Assess overall system health
1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization目的:评估系统整体健康状况
1. [ ] 执行快速健康检查
2. [ ] 查看24小时错误趋势
3. [ ] 检查CI/CD成功率
4. [ ] 查看所有预警信息
5. [ ] 检查慢查询日志
6. [ ] 验证备份已完成
7. [ ] 查看资源使用率Weekly Health Check (30 min)
每周健康检查(30分钟)
Purpose: Comprehensive review and trending
1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations目的:全面审查与趋势分析
1. [ ] 执行每日健康检查
2. [ ] 分析每周错误模式
3. [ ] 查看索引使用统计
4. [ ] 检查表膨胀情况
5. [ ] 查看连接模式
6. [ ] 评估容量趋势
7. [ ] 查看部署频率
8. [ ] 检查证书有效期Monthly Deep Check (1+ hours)
月度深度检查(1小时以上)
Purpose: Full system audit
1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test目的:全系统审计
1. [ ] 执行每周健康检查
2. [ ] 全面索引分析
3. [ ] 查询性能审查
4. [ ] 安全配置审计
5. [ ] 容量规划审查
6. [ ] 成本分析
7. [ ] 文档审查
8. [ ] 灾难恢复测试Alert Thresholds
告警阈值
Immediate Alerts (Page)
即时告警(页面通知)
- Service unavailable > 1 minute
- Error rate > 10%
- Database connections > 90%
- Security advisory created
- Deployment failure (production)
- Health check failure > 5 minutes
- 服务不可用超过1分钟
- 错误率 > 10%
- 数据库连接使用率 > 90%
- 新增安全预警
- 生产环境部署失败
- 健康检查失败超过5分钟
Warning Alerts (Slack/Email)
警告告警(Slack/邮件)
- Error rate > 2%
- Database connections > 75%
- Performance advisory created
- Build time increase > 50%
- Response time P95 > 1s
- Disk usage > 80%
- 错误率 > 2%
- 数据库连接使用率 > 75%
- 新增性能预警
- 构建时长增加超过50%
- P95响应时间 > 1s
- 磁盘使用率 > 80%
Info Alerts (Daily Digest)
信息告警(每日摘要)
- New advisory (any type)
- Build time change
- Resource trend change
- Configuration change
- 新增任何类型的预警
- 构建时长变化
- 资源趋势变化
- 配置变更
Health Report Template
健康报告模板
markdown
undefinedmarkdown
undefinedInfrastructure Health Report
基础设施健康报告
Generated: {TIMESTAMP}
Report Type: {Quick | Daily | Weekly | Monthly}
Overall Status: {GREEN | YELLOW | RED}
生成时间: {TIMESTAMP}
报告类型: {快速 | 每日 | 每周 | 月度}
整体状态: {GREEN | YELLOW | RED}
Executive Summary
执行摘要
{2-3 sentence overview}
{2-3句话概述}
Platform Status
平台状态
| Platform | Status | Issues | Warnings |
|---|---|---|---|
| GitHub Actions | {STATUS} | {N} | {N} |
| Railway | {STATUS} | {N} | {N} |
| Supabase | {STATUS} | {N} | {N} |
| Postgres | {STATUS} | {N} | {N} |
| 平台 | 状态 | 问题数量 | 警告数量 |
|---|---|---|---|
| GitHub Actions | {状态} | {N} | {N} |
| Railway | {状态} | {N} | {N} |
| Supabase | {状态} | {N} | {N} |
| Postgres | {状态} | {N} | {N} |
Key Metrics
关键指标
Database
数据库
- Connections: {N}/{MAX} ({PCT}%)
- Query P95: {MS}ms
- Dead Rows: {PCT}%
- 连接数: {N}/{MAX} ({PCT}%)
- P95查询时长: {MS}ms
- 死行占比: {PCT}%
API
API
- Error Rate: {PCT}%
- Response Time P95: {MS}ms
- 错误率: {PCT}%
- P95响应时间: {MS}ms
CI/CD
CI/CD
- Success Rate: {PCT}%
- Avg Build Time: {MIN}m
- 成功率: {PCT}%
- 平均构建时长: {MIN}分钟
Advisories
预警信息
Security
安全
{List or "None"}
{列表或“无”}
Performance
性能
{List or "None"}
{列表或“无”}
Issues Requiring Attention
需要关注的问题
Immediate
紧急
{List or "None"}
{列表或“无”}
This Week
本周内
{List or "None"}
{列表或“无”}
Trends
趋势分析
{Notable changes from previous period}
{与上期相比的显著变化}
Recommendations
建议
{Specific actions to improve health}
Next health check: {TIMESTAMP}
undefined{改善健康状况的具体行动}
下次健康检查时间: {TIMESTAMP}
undefinedRemediation Playbooks
故障修复手册
High Connection Utilization
连接使用率过高
1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling1. 检查是否存在连接泄漏
2. 识别空闲连接
3. 查看连接池设置
4. 考虑使用连接池工具(PgBouncer/Supavisor)
5. 优化应用的连接处理逻辑High Error Rate
错误率过高
1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related1. 识别错误类型
2. 检查最近的部署记录
3. 查看受影响的端点
4. 检查下游依赖
5. 若与部署相关则回滚版本Slow Queries
慢查询
1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching1. 识别慢查询(pg_stat_statements)
2. 执行EXPLAIN ANALYZE
3. 检查是否缺少索引
4. 查看查询模式
5. 考虑查询优化或缓存策略Build Failures
构建失败
1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changesSee checklists.md for detailed health check checklists.
1. 查看失败日志
2. 检查是否存在不稳定测试
3. 验证依赖是否可用
4. 检查环境问题
5. 查看最近的变更详细健康检查清单请参考checklists.md。