health-checks

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Health Checks Skill

健康检查Skill

Overview

概述

This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.

本Skill提供了针对GitHub Actions、Railway、Supabase和Postgres的基础设施健康监控知识与流程。

Health Check Philosophy

健康检查理念

Why Regular Health Checks?

为何要定期进行健康检查？

Proactive Detection: Find issues before users do
Trend Identification: Spot degradation early
Capacity Planning: Know when to scale
Compliance: Maintain system hygiene
Documentation: Track system state over time

主动检测：在用户发现问题前找出隐患
趋势识别：提前发现性能退化
容量规划：知晓何时需要扩容
合规性：维持系统健康状态
文档记录：跟踪系统状态随时间的变化

Health Check Frequency

健康检查频率

Check Type	Frequency	When
Quick	Every deploy	After any deployment
Daily	Daily	Morning/start of business
Weekly	Weekly	Beginning of week
Deep	Monthly	Beginning of month
Full Audit	Quarterly	Scheduled maintenance window

检查类型	频率	执行时机
快速检查	每次部署后	任何部署完成后
每日检查	每日	早晨/工作开始时
每周检查	每周	每周开始时
深度检查	每月	每月开始时
全面审计	每季度	预定维护窗口期

Health Status Framework

健康状态框架

Traffic Light System

交通灯状态系统

GREEN  - All systems healthy
         - No critical issues
         - Metrics within normal ranges
         - Advisory count: 0

YELLOW - Warning state
         - Non-critical issues present
         - Metrics approaching limits
         - Performance advisories present

RED    - Critical state
         - Service impaired or unavailable
         - Critical metrics exceeded
         - Security advisories present
         - Immediate action required

GREEN  - 所有系统运行正常
         - 无关键问题
         - 指标在正常范围内
         - 预警数量：0

YELLOW - 警告状态
         - 存在非关键问题
         - 指标接近阈值
         - 存在性能预警

RED    - 严重状态
         - 服务受损或不可用
         - 关键指标超出阈值
         - 存在安全预警
         - 需要立即采取行动

Status Determination Rules

状态判定规则

Condition	Status
Security advisory exists	RED
Service unavailable	RED
Error rate > 5%	RED
Connection utilization > 85%	RED
CI success rate < 75%	RED
Performance advisory exists	YELLOW
Error rate 1-5%	YELLOW
Connection utilization 70-85%	YELLOW
CI success rate 75-90%	YELLOW
Long-running queries present	YELLOW
All metrics normal	GREEN

条件	状态
存在安全预警	RED
服务不可用	RED
错误率 > 5%	RED
连接使用率 > 85%	RED
CI成功率 < 75%	RED
存在性能预警	YELLOW
错误率 1-5%	YELLOW
连接使用率 70-85%	YELLOW
CI成功率 75-90%	YELLOW
存在长时间运行的查询	YELLOW
所有指标正常	GREEN

Health Metrics

健康指标

Key Performance Indicators

关键性能指标

Platform	Metric	Good	Warning	Critical
Database	Connection %	<70%	70-85%	>85%
Database	Query Duration	<100ms	100-500ms	>500ms
Database	Dead Rows %	<10%	10-20%	>20%
API	Error Rate	<1%	1-5%	>5%
API	Response Time P95	<500ms	500-2000ms	>2000ms
CI/CD	Success Rate	>90%	75-90%	<75%
CI/CD	Build Time	<5min	5-15min	>15min

平台	指标	正常范围	警告范围	严重范围
数据库	连接使用率	<70%	70-85%	>85%
数据库	查询时长	<100ms	100-500ms	>500ms
数据库	死行占比	<10%	10-20%	>20%
API	错误率	<1%	1-5%	>5%
API	P95响应时间	<500ms	500-2000ms	>2000ms
CI/CD	成功率	>90%	75-90%	<75%
CI/CD	构建时长	<5min	5-15min	>15min

Platform-Specific Metrics

平台专属指标

Supabase

API error rate
Auth failure rate
Storage utilization
Edge function cold starts
Realtime connection count
Advisory count (security/performance)

API错误率
认证失败率
存储使用率
Edge函数冷启动次数
实时连接数
预警数量（安全/性能）

GitHub Actions

Workflow success rate
Average build time
Queue wait time
Cache hit rate
Failed workflow count

工作流成功率
平均构建时长
队列等待时间
缓存命中率
失败工作流数量

Railway

Service uptime
Deploy success rate
Memory utilization
CPU utilization
Health check pass rate

服务上线率
部署成功率
内存使用率
CPU使用率
健康检查通过率

Postgres

Connection utilization
Query duration distribution
Lock contention
Dead tuple ratio
Index usage efficiency
Table bloat

连接使用率
查询时长分布
锁竞争情况
死元组占比
索引使用效率
表膨胀情况

Health Check Procedures

健康检查流程

Quick Health Check (5 min)

快速健康检查（5分钟）

Purpose: Verify basic system functionality

1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status

目的：验证系统基本功能

1. [ ] 检查所有平台是否存在活跃事件
2. [ ] 验证所有服务可正常响应
3. [ ] 检查是否存在严重预警
4. [ ] 查看过去1小时的错误率
5. [ ] 检查连接池状态

Daily Health Check (15 min)

每日健康检查（15分钟）

Purpose: Assess overall system health

1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization

目的：评估系统整体健康状况

1. [ ] 执行快速健康检查
2. [ ] 查看24小时错误趋势
3. [ ] 检查CI/CD成功率
4. [ ] 查看所有预警信息
5. [ ] 检查慢查询日志
6. [ ] 验证备份已完成
7. [ ] 查看资源使用率

Weekly Health Check (30 min)

每周健康检查（30分钟）

Purpose: Comprehensive review and trending

1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations

目的：全面审查与趋势分析

1. [ ] 执行每日健康检查
2. [ ] 分析每周错误模式
3. [ ] 查看索引使用统计
4. [ ] 检查表膨胀情况
5. [ ] 查看连接模式
6. [ ] 评估容量趋势
7. [ ] 查看部署频率
8. [ ] 检查证书有效期

Monthly Deep Check (1+ hours)

月度深度检查（1小时以上）

Purpose: Full system audit

1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test

目的：全系统审计

1. [ ] 执行每周健康检查
2. [ ] 全面索引分析
3. [ ] 查询性能审查
4. [ ] 安全配置审计
5. [ ] 容量规划审查
6. [ ] 成本分析
7. [ ] 文档审查
8. [ ] 灾难恢复测试

Alert Thresholds

告警阈值

Immediate Alerts (Page)

即时告警（页面通知）

Service unavailable > 1 minute
Error rate > 10%
Database connections > 90%
Security advisory created
Deployment failure (production)
Health check failure > 5 minutes

服务不可用超过1分钟
错误率 > 10%
数据库连接使用率 > 90%
新增安全预警
生产环境部署失败
健康检查失败超过5分钟

Warning Alerts (Slack/Email)

警告告警（Slack/邮件）

Error rate > 2%
Database connections > 75%
Performance advisory created
Build time increase > 50%
Response time P95 > 1s
Disk usage > 80%

错误率 > 2%
数据库连接使用率 > 75%
新增性能预警
构建时长增加超过50%
P95响应时间 > 1s
磁盘使用率 > 80%

Info Alerts (Daily Digest)

信息告警（每日摘要）

New advisory (any type)
Build time change
Resource trend change
Configuration change

新增任何类型的预警
构建时长变化
资源趋势变化
配置变更

Health Report Template

健康报告模板

markdown

undefined

markdown

undefined

Infrastructure Health Report

基础设施健康报告

Generated: {TIMESTAMP} Report Type: {Quick | Daily | Weekly | Monthly} Overall Status: {GREEN | YELLOW | RED}

生成时间: {TIMESTAMP} 报告类型: {快速 | 每日 | 每周 | 月度} 整体状态: {GREEN | YELLOW | RED}

Executive Summary

执行摘要

{2-3 sentence overview}

{2-3句话概述}

Platform Status

平台状态

Platform	Status	Issues	Warnings
GitHub Actions	{STATUS}	{N}	{N}
Railway	{STATUS}	{N}	{N}
Supabase	{STATUS}	{N}	{N}
Postgres	{STATUS}	{N}	{N}

平台	状态	问题数量	警告数量
GitHub Actions	{状态}	{N}	{N}
Railway	{状态}	{N}	{N}
Supabase	{状态}	{N}	{N}
Postgres	{状态}	{N}	{N}

Key Metrics

关键指标

Database

数据库

Connections: {N}/{MAX} ({PCT}%)
Query P95: {MS}ms
Dead Rows: {PCT}%

连接数: {N}/{MAX} ({PCT}%)
P95查询时长: {MS}ms
死行占比: {PCT}%

API

Error Rate: {PCT}%
Response Time P95: {MS}ms

错误率: {PCT}%
P95响应时间: {MS}ms

CI/CD

Success Rate: {PCT}%
Avg Build Time: {MIN}m

成功率: {PCT}%
平均构建时长: {MIN}分钟

Advisories

预警信息

Security

安全

{List or "None"}

{列表或“无”}

Performance

性能

{List or "None"}

{列表或“无”}

Issues Requiring Attention

需要关注的问题

Immediate

紧急

{List or "None"}

{列表或“无”}

This Week

本周内

{List or "None"}

{列表或“无”}

Trends

趋势分析

{Notable changes from previous period}

{与上期相比的显著变化}

Recommendations

建议

{Specific actions to improve health}

Next health check: {TIMESTAMP}

undefined

{改善健康状况的具体行动}

下次健康检查时间: {TIMESTAMP}

undefined

Remediation Playbooks

故障修复手册

High Connection Utilization

连接使用率过高

1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling

1. 检查是否存在连接泄漏
2. 识别空闲连接
3. 查看连接池设置
4. 考虑使用连接池工具（PgBouncer/Supavisor）
5. 优化应用的连接处理逻辑

High Error Rate

错误率过高

1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related

1. 识别错误类型
2. 检查最近的部署记录
3. 查看受影响的端点
4. 检查下游依赖
5. 若与部署相关则回滚版本

Slow Queries

慢查询

1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching

1. 识别慢查询（pg_stat_statements）
2. 执行EXPLAIN ANALYZE
3. 检查是否缺少索引
4. 查看查询模式
5. 考虑查询优化或缓存策略

Build Failures

构建失败

1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changes

See checklists.md for detailed health check checklists.

1. 查看失败日志
2. 检查是否存在不稳定测试
3. 验证依赖是否可用
4. 检查环境问题
5. 查看最近的变更

详细健康检查清单请参考checklists.md。