health-checks

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Health Checks Skill

健康检查Skill

Overview

概述

This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.
本Skill提供了针对GitHub Actions、Railway、Supabase和Postgres的基础设施健康监控知识与流程。

Health Check Philosophy

健康检查理念

Why Regular Health Checks?

为何要定期进行健康检查?

  1. Proactive Detection: Find issues before users do
  2. Trend Identification: Spot degradation early
  3. Capacity Planning: Know when to scale
  4. Compliance: Maintain system hygiene
  5. Documentation: Track system state over time
  1. 主动检测:在用户发现问题前找出隐患
  2. 趋势识别:提前发现性能退化
  3. 容量规划:知晓何时需要扩容
  4. 合规性:维持系统健康状态
  5. 文档记录:跟踪系统状态随时间的变化

Health Check Frequency

健康检查频率

Check TypeFrequencyWhen
QuickEvery deployAfter any deployment
DailyDailyMorning/start of business
WeeklyWeeklyBeginning of week
DeepMonthlyBeginning of month
Full AuditQuarterlyScheduled maintenance window
检查类型频率执行时机
快速检查每次部署后任何部署完成后
每日检查每日早晨/工作开始时
每周检查每周每周开始时
深度检查每月每月开始时
全面审计每季度预定维护窗口期

Health Status Framework

健康状态框架

Traffic Light System

交通灯状态系统

GREEN  - All systems healthy
         - No critical issues
         - Metrics within normal ranges
         - Advisory count: 0

YELLOW - Warning state
         - Non-critical issues present
         - Metrics approaching limits
         - Performance advisories present

RED    - Critical state
         - Service impaired or unavailable
         - Critical metrics exceeded
         - Security advisories present
         - Immediate action required
GREEN  - 所有系统运行正常
         - 无关键问题
         - 指标在正常范围内
         - 预警数量:0

YELLOW - 警告状态
         - 存在非关键问题
         - 指标接近阈值
         - 存在性能预警

RED    - 严重状态
         - 服务受损或不可用
         - 关键指标超出阈值
         - 存在安全预警
         - 需要立即采取行动

Status Determination Rules

状态判定规则

ConditionStatus
Security advisory existsRED
Service unavailableRED
Error rate > 5%RED
Connection utilization > 85%RED
CI success rate < 75%RED
Performance advisory existsYELLOW
Error rate 1-5%YELLOW
Connection utilization 70-85%YELLOW
CI success rate 75-90%YELLOW
Long-running queries presentYELLOW
All metrics normalGREEN
条件状态
存在安全预警RED
服务不可用RED
错误率 > 5%RED
连接使用率 > 85%RED
CI成功率 < 75%RED
存在性能预警YELLOW
错误率 1-5%YELLOW
连接使用率 70-85%YELLOW
CI成功率 75-90%YELLOW
存在长时间运行的查询YELLOW
所有指标正常GREEN

Health Metrics

健康指标

Key Performance Indicators

关键性能指标

PlatformMetricGoodWarningCritical
DatabaseConnection %<70%70-85%>85%
DatabaseQuery Duration<100ms100-500ms>500ms
DatabaseDead Rows %<10%10-20%>20%
APIError Rate<1%1-5%>5%
APIResponse Time P95<500ms500-2000ms>2000ms
CI/CDSuccess Rate>90%75-90%<75%
CI/CDBuild Time<5min5-15min>15min
平台指标正常范围警告范围严重范围
数据库连接使用率<70%70-85%>85%
数据库查询时长<100ms100-500ms>500ms
数据库死行占比<10%10-20%>20%
API错误率<1%1-5%>5%
APIP95响应时间<500ms500-2000ms>2000ms
CI/CD成功率>90%75-90%<75%
CI/CD构建时长<5min5-15min>15min

Platform-Specific Metrics

平台专属指标

Supabase

Supabase

  • API error rate
  • Auth failure rate
  • Storage utilization
  • Edge function cold starts
  • Realtime connection count
  • Advisory count (security/performance)
  • API错误率
  • 认证失败率
  • 存储使用率
  • Edge函数冷启动次数
  • 实时连接数
  • 预警数量(安全/性能)

GitHub Actions

GitHub Actions

  • Workflow success rate
  • Average build time
  • Queue wait time
  • Cache hit rate
  • Failed workflow count
  • 工作流成功率
  • 平均构建时长
  • 队列等待时间
  • 缓存命中率
  • 失败工作流数量

Railway

Railway

  • Service uptime
  • Deploy success rate
  • Memory utilization
  • CPU utilization
  • Health check pass rate
  • 服务上线率
  • 部署成功率
  • 内存使用率
  • CPU使用率
  • 健康检查通过率

Postgres

Postgres

  • Connection utilization
  • Query duration distribution
  • Lock contention
  • Dead tuple ratio
  • Index usage efficiency
  • Table bloat
  • 连接使用率
  • 查询时长分布
  • 锁竞争情况
  • 死元组占比
  • 索引使用效率
  • 表膨胀情况

Health Check Procedures

健康检查流程

Quick Health Check (5 min)

快速健康检查(5分钟)

Purpose: Verify basic system functionality
1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status
目的:验证系统基本功能
1. [ ] 检查所有平台是否存在活跃事件
2. [ ] 验证所有服务可正常响应
3. [ ] 检查是否存在严重预警
4. [ ] 查看过去1小时的错误率
5. [ ] 检查连接池状态

Daily Health Check (15 min)

每日健康检查(15分钟)

Purpose: Assess overall system health
1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization
目的:评估系统整体健康状况
1. [ ] 执行快速健康检查
2. [ ] 查看24小时错误趋势
3. [ ] 检查CI/CD成功率
4. [ ] 查看所有预警信息
5. [ ] 检查慢查询日志
6. [ ] 验证备份已完成
7. [ ] 查看资源使用率

Weekly Health Check (30 min)

每周健康检查(30分钟)

Purpose: Comprehensive review and trending
1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations
目的:全面审查与趋势分析
1. [ ] 执行每日健康检查
2. [ ] 分析每周错误模式
3. [ ] 查看索引使用统计
4. [ ] 检查表膨胀情况
5. [ ] 查看连接模式
6. [ ] 评估容量趋势
7. [ ] 查看部署频率
8. [ ] 检查证书有效期

Monthly Deep Check (1+ hours)

月度深度检查(1小时以上)

Purpose: Full system audit
1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test
目的:全系统审计
1. [ ] 执行每周健康检查
2. [ ] 全面索引分析
3. [ ] 查询性能审查
4. [ ] 安全配置审计
5. [ ] 容量规划审查
6. [ ] 成本分析
7. [ ] 文档审查
8. [ ] 灾难恢复测试

Alert Thresholds

告警阈值

Immediate Alerts (Page)

即时告警(页面通知)

  • Service unavailable > 1 minute
  • Error rate > 10%
  • Database connections > 90%
  • Security advisory created
  • Deployment failure (production)
  • Health check failure > 5 minutes
  • 服务不可用超过1分钟
  • 错误率 > 10%
  • 数据库连接使用率 > 90%
  • 新增安全预警
  • 生产环境部署失败
  • 健康检查失败超过5分钟

Warning Alerts (Slack/Email)

警告告警(Slack/邮件)

  • Error rate > 2%
  • Database connections > 75%
  • Performance advisory created
  • Build time increase > 50%
  • Response time P95 > 1s
  • Disk usage > 80%
  • 错误率 > 2%
  • 数据库连接使用率 > 75%
  • 新增性能预警
  • 构建时长增加超过50%
  • P95响应时间 > 1s
  • 磁盘使用率 > 80%

Info Alerts (Daily Digest)

信息告警(每日摘要)

  • New advisory (any type)
  • Build time change
  • Resource trend change
  • Configuration change
  • 新增任何类型的预警
  • 构建时长变化
  • 资源趋势变化
  • 配置变更

Health Report Template

健康报告模板

markdown
undefined
markdown
undefined

Infrastructure Health Report

基础设施健康报告

Generated: {TIMESTAMP} Report Type: {Quick | Daily | Weekly | Monthly} Overall Status: {GREEN | YELLOW | RED}
生成时间: {TIMESTAMP} 报告类型: {快速 | 每日 | 每周 | 月度} 整体状态: {GREEN | YELLOW | RED}

Executive Summary

执行摘要

{2-3 sentence overview}
{2-3句话概述}

Platform Status

平台状态

PlatformStatusIssuesWarnings
GitHub Actions{STATUS}{N}{N}
Railway{STATUS}{N}{N}
Supabase{STATUS}{N}{N}
Postgres{STATUS}{N}{N}
平台状态问题数量警告数量
GitHub Actions{状态}{N}{N}
Railway{状态}{N}{N}
Supabase{状态}{N}{N}
Postgres{状态}{N}{N}

Key Metrics

关键指标

Database

数据库

  • Connections: {N}/{MAX} ({PCT}%)
  • Query P95: {MS}ms
  • Dead Rows: {PCT}%
  • 连接数: {N}/{MAX} ({PCT}%)
  • P95查询时长: {MS}ms
  • 死行占比: {PCT}%

API

API

  • Error Rate: {PCT}%
  • Response Time P95: {MS}ms
  • 错误率: {PCT}%
  • P95响应时间: {MS}ms

CI/CD

CI/CD

  • Success Rate: {PCT}%
  • Avg Build Time: {MIN}m
  • 成功率: {PCT}%
  • 平均构建时长: {MIN}分钟

Advisories

预警信息

Security

安全

{List or "None"}
{列表或“无”}

Performance

性能

{List or "None"}
{列表或“无”}

Issues Requiring Attention

需要关注的问题

Immediate

紧急

{List or "None"}
{列表或“无”}

This Week

本周内

{List or "None"}
{列表或“无”}

Trends

趋势分析

{Notable changes from previous period}
{与上期相比的显著变化}

Recommendations

建议

{Specific actions to improve health}

Next health check: {TIMESTAMP}
undefined
{改善健康状况的具体行动}

下次健康检查时间: {TIMESTAMP}
undefined

Remediation Playbooks

故障修复手册

High Connection Utilization

连接使用率过高

1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling
1. 检查是否存在连接泄漏
2. 识别空闲连接
3. 查看连接池设置
4. 考虑使用连接池工具(PgBouncer/Supavisor)
5. 优化应用的连接处理逻辑

High Error Rate

错误率过高

1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related
1. 识别错误类型
2. 检查最近的部署记录
3. 查看受影响的端点
4. 检查下游依赖
5. 若与部署相关则回滚版本

Slow Queries

慢查询

1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching
1. 识别慢查询(pg_stat_statements)
2. 执行EXPLAIN ANALYZE
3. 检查是否缺少索引
4. 查看查询模式
5. 考虑查询优化或缓存策略

Build Failures

构建失败

1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changes
See checklists.md for detailed health check checklists.
1. 查看失败日志
2. 检查是否存在不稳定测试
3. 验证依赖是否可用
4. 检查环境问题
5. 查看最近的变更
详细健康检查清单请参考checklists.md