Service Health Check Skill

Operator Context

This skill operates as an operator for service health monitoring workflows, configuring Claude's behavior for structured, read-only health assessment. It implements the Discover-Check-Report pattern — find services, gather health signals, produce actionable output — with deterministic process and health file evaluation.

Hardcoded Behaviors (Always Apply)

Read-Only: NEVER restart, stop, or modify services — report only
CLAUDE.md Compliance: Read and follow repository CLAUDE.md before checking
No Side Effects: Only read process tables, health files, and ports — no writes
Structured Output: Always produce machine-parseable health report
Evidence-Based Status: Every status determination requires at least one concrete signal (process check, health file, or port probe)

Default Behaviors (ON unless disabled)

Process Verification: Check process existence via pgrep/ps before anything else
Staleness Detection: Flag health files older than configured threshold (default 300s)
Port Listening Check: Verify expected ports are bound when port is configured
Actionable Recommendations: Provide specific commands to resolve issues
Staleness Threshold Enforcement: Default 300s, configurable per service

Optional Behaviors (OFF unless enabled)

Auto-Restart Execution: Run restart commands (requires explicit user flag)
Metrics Collection: Gather detailed performance metrics from health files
Alert Integration: Format output for monitoring system ingestion
Historical Comparison: Compare against previous health snapshots

What This Skill CAN Do

Check if processes are running via pgrep/ps
Parse JSON health files for status, connection state, and metrics
Detect stale health data based on configurable thresholds
Verify ports are listening with ss/netstat
Produce structured health reports with actionable restart recommendations
Evaluate service degradation (disconnected, reconnecting states)

What This Skill CANNOT Do

Restart, stop, or modify services (report-only by design)
Perform deep log analysis (use systematic-debugging instead)
Probe remote health endpoints over HTTP (use endpoint-validator instead)
Inspect container internals (basic host-level process checks only)
Authenticate against secured health endpoints
Skip the Discover phase — services must be identified before checking

Instructions

Phase 1: DISCOVER

Goal: Identify all services to check before running any health probes.

Step 1: Locate service definitions

Search for service configuration in this order:

```
services.json
```
in project root
Docker/docker-compose files for service definitions
systemd unit files or process manager configs
User-provided service specification

Step 2: Build service manifest

For each service, establish:

markdown

## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |

Step 3: Validate manifest

Confirm each process pattern is specific enough to avoid false matches
Verify health file paths are absolute
Ensure port numbers are within valid range (1-65535)

Gate: Service manifest complete with at least one service. Proceed only when gate passes.

Phase 2: CHECK

Goal: Gather health signals for every service in the manifest.

Step 1: Check process status

For each service, run process check:

bash

pgrep -f "<process_pattern>"

Record: running (true/false), PIDs, process count.

Step 2: Parse health files (if configured)

Read and parse JSON health files. Evaluate:

Does the file exist?
Does it parse as valid JSON?
How old is the timestamp (staleness)?
What status does the service self-report?
What is the connection state?

Step 3: Probe ports (if configured)

Check if expected ports are listening:

bash

ss -tlnp "sport = :<port>"

Flag processes that are running but not listening on expected ports.

Step 4: Evaluate health per service

Apply this decision tree:

Process not running → DOWN
Process running + health file missing → WARNING
Process running + health file stale → WARNING (restart recommended)
Process running + status=error → ERROR (restart recommended)
Process running + disconnected > 30min → WARNING (restart recommended)
Process running + disconnected < 30min → DEGRADED (allow reconnection)
Process running + healthy → HEALTHY
Process running + no health file configured → RUNNING (limited visibility)

Gate: All services evaluated with evidence-based status. Proceed only when gate passes.

Phase 3: REPORT

Goal: Produce structured, actionable health report.

Step 1: Generate summary

SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N

RESULTS:
  service-name         [OK  ] HEALTHY     PID 12345, uptime 2d 4h
  background-worker    [WARN] WARNING     Health file stale (15 min)
  cache-service        [DOWN] DOWN        Process not found

RECOMMENDATIONS:
  background-worker: Restart recommended - health file not updated in 900s
  cache-service: Start service - process not running

SUGGESTED ACTIONS:
  systemctl restart background-worker
  systemctl start cache-service

Step 2: Set exit status

All HEALTHY/RUNNING → exit 0
Any WARNING/DEGRADED/ERROR/DOWN → exit 1

Step 3: Present to user

Lead with the summary line (X/N healthy)
Highlight any services needing action
Provide copy-pasteable commands for remediation
If user has auto-restart enabled, confirm before executing

Gate: Report delivered with actionable recommendations for all non-healthy services.

Examples

Example 1: Routine Health Check

User says: "Are all services up?" Actions:

Locate services.json, build manifest (DISCOVER)
Check each process, parse health files, probe ports (CHECK)
Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed

Example 2: Stale Worker Detection

User says: "The background worker seems stuck" Actions:

Identify worker service from config (DISCOVER)
Find process running but health file 20 minutes stale (CHECK)
Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command

Error Handling

Error: "No Service Configuration Found"

Cause: No services.json, docker-compose, or systemd units discovered Solution:

Ask user for service name and process pattern
Build minimal manifest from user input
Proceed with manual configuration

Error: "Process Pattern Matches Too Many PIDs"

Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:

Narrow pattern with full command path or arguments
Use
```
ps aux | grep
```
to identify distinguishing arguments
Update manifest with more specific pattern

Error: "Health File Exists But Cannot Parse"

Cause: Malformed JSON, permissions issue, or file being written during read Solution:

Check file permissions with
```
ls -la
```
Attempt raw read to inspect content
If mid-write, retry after 2-second delay
Report as WARNING with parse error details

Anti-Patterns

Anti-Pattern 1: Restarting Without Diagnosing

What it looks like: Service shows WARNING, immediately run

systemctl restart

Why wrong: Masks root cause. Service may crash again immediately. Do instead: Report finding, let user decide. Never auto-restart without explicit flag.

Anti-Pattern 2: Trusting Health File Alone

What it looks like: Health file says "healthy" so skip process check Why wrong: Process could be zombie, health file could be stale from before crash. Do instead: Always check process status independently of health file content.

Anti-Pattern 3: Ignoring Port Mismatch

What it looks like: Process running, skip port check, report HEALTHY Why wrong: Process may have started but failed to bind port — effectively down. Do instead: When port is configured, always verify it is listening.

Anti-Pattern 4: Broad Process Patterns

What it looks like: Using "python" as process pattern for a Flask app Why wrong: Matches every Python process on the system, giving false positives. Do instead: Use specific patterns like

gunicorn.*myapp:app

or full command paths.

References

This skill uses these shared patterns:

Anti-Rationalization - Prevents shortcut rationalizations
Verification Checklist - Pre-completion checks

Domain-Specific Anti-Rationalization

Rationalization	Why It's Wrong	Required Action
"Process is running, must be healthy"	Running ≠ functional	Check health file and port
"Health file looks fine"	File could be stale from before crash	Verify timestamp freshness
"Just restart it"	Restart masks root cause	Report first, restart only if flagged
"No config, skip the check"	User still needs an answer	Ask user for service details

Health File Format Reference

Services should write health files as:

json

{
    "timestamp": "ISO8601, updated every 30-60s",
    "status": "healthy|degraded|error",
    "connection": "connected|disconnected|reconnecting",
    "last_activity": "ISO8601 of last meaningful action",
    "running": true,
    "uptime_seconds": 12345,
    "metrics": {}
}

service-health-check

NPX Install

Tags

SKILL.md Content