Works with infrastructure-monitor.sh script, systemd timer, ntfy.sh push notifications,

支持infrastructure-monitor.sh脚本、systemd定时器、ntfy.sh推送通知

Infrastructure Monitoring Setup Skill

基础设施监控搭建技能

Complete setup and configuration of automated infrastructure monitoring with mobile push notifications and auto-recovery capabilities.

完成带有移动推送通知和自动恢复功能的自动化基础设施监控的完整搭建与配置。

Quick Start

快速开始

Quick setup for monitoring (5 minutes):

bash

undefined

监控快速搭建（5分钟完成）：

bash

undefined

1. Create unique ntfy topic

TOPIC="infra-$(openssl rand -hex 8)" echo "Your topic: $TOPIC"

2. Add to .env

echo "ALERT_ENABLED=true" >> /home/dawiddutoit/projects/network/.env echo "NTFY_SERVER=https://ntfy.sh" >> /home/dawiddutoit/projects/network/.env echo "NTFY_TOPIC=$TOPIC" >> /home/dawiddutoit/projects/network/.env echo "AUTO_RECOVER=true" >> /home/dawiddutoit/projects/network/.env

3. Install systemd service

sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.* /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable --now infrastructure-monitor.timer

4. Test

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


Then install ntfy app on phone and subscribe to your topic.

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


随后在手机上安装ntfy应用并订阅你的主题。

When to Use This Skill
What This Skill Does
Instructions
- 3.1 Install ntfy Mobile App
- 3.2 Configure Monitoring in .env
- 3.3 Install Systemd Timer
- 3.4 Test Monitoring and Alerts
- 3.5 Configure Home Assistant Integration (Optional)
- 3.6 Verify Auto-Recovery
- 3.7 View Monitoring Logs
Supporting Files
Expected Outcomes
Requirements
Red Flags to Avoid

何时使用该技能
该技能的功能
操作指南
- 3.1 安装ntfy移动应用
- 3.2 在.env中配置监控
- 3.3 安装Systemd定时器
- 3.4 测试监控与告警
- 3.5 配置Home Assistant集成（可选）
- 3.6 验证自动恢复功能
- 3.7 查看监控日志
相关文件
预期效果
前置要求
注意事项

When to Use This Skill

1. 何时使用该技能

Explicit Triggers:

"Setup monitoring"
"Configure mobile alerts"
"Enable auto-recovery"
"Setup ntfy notifications"
"Configure Home Assistant alerts"

Implicit Triggers:

Want to be notified of infrastructure failures
Need automated recovery for common issues
Infrastructure has been down without detection
Want proactive monitoring

Debugging Triggers:

"Why am I not getting alerts?"
"Is monitoring working?"
"How to test notifications?"

明确触发场景：

「搭建监控系统」
「配置移动告警」
「启用自动恢复」
「搭建ntfy通知」
「配置Home Assistant告警」

隐含触发场景：

希望收到基础设施故障通知
需要为常见问题启用自动恢复
基础设施曾在无检测的情况下宕机
希望实现主动监控

排查触发场景：

「为什么收不到告警？」
「监控是否正常运行？」
「如何测试通知功能？」

What This Skill Does

2. 该技能的功能

Mobile Alerts - Configures ntfy.sh push notifications to phone
Auto-Recovery - Enables automatic fixes for common failures
HA Integration - Optional Home Assistant notification integration
Systemd Service - Installs timer to run monitoring every 5 minutes
Tests Setup - Verifies notifications and recovery work
Logs Access - Shows how to view monitoring logs
Troubleshooting - Diagnoses alert delivery issues

移动告警 - 配置ntfy.sh手机推送通知
自动恢复 - 启用常见故障的自动修复功能
Home Assistant集成 - 可选的Home Assistant通知集成
Systemd服务 - 安装定时器实现每5分钟运行一次监控
测试搭建 - 验证通知与恢复功能是否正常
日志访问 - 展示如何查看监控日志
故障排查 - 诊断告警投递问题

Instructions

3. 操作指南

3.1 Install ntfy Mobile App

3.1 安装ntfy移动应用

Install app:

Subscribe to topic:

Open ntfy app
Tap "+" to add subscription
Enter topic:
```
infra-YOUR-RANDOM-ID
```
(you'll generate this in step 3.2)
Server:
```
https://ntfy.sh
```
Tap "Subscribe"

Note: You need the topic ID from step 3.2 before subscribing. Come back here after generating it.

安装应用：

订阅主题：

打开ntfy应用
点击「+」添加订阅
输入主题：
```
infra-YOUR-RANDOM-ID
```
（将在3.2步骤生成）
服务器：
```
https://ntfy.sh
```
点击「订阅」

注意： 你需要先完成3.2步骤获取主题ID，再返回此处完成订阅。

3.2 Configure Monitoring in .env

3.2 在.env中配置监控

Generate unique topic ID:

bash

TOPIC="infra-$(openssl rand -hex 8)"
echo "Your unique topic: $TOPIC"

Save this topic ID - you'll use it in the ntfy app.

Add monitoring configuration to .env:

bash

undefined

生成唯一主题ID：

bash

TOPIC="infra-$(openssl rand -hex 8)"
echo "Your unique topic: $TOPIC"

保存该主题ID，后续将在ntfy应用中使用。

将监控配置添加到.env：

bash

undefined

Navigate to project directory

cd /home/dawiddutoit/projects/network

Add monitoring variables

cat >> .env << EOF

Monitoring & Alerts

ALERT_ENABLED=true NTFY_SERVER=https://ntfy.sh NTFY_TOPIC=$TOPIC AUTO_RECOVER=true EOF


**Verify configuration:**

```bash
grep -A4 "Monitoring & Alerts" /home/dawiddutoit/projects/network/.env

Expected:

undefined

ALERT_ENABLED=true NTFY_SERVER=https://ntfy.sh NTFY_TOPIC=$TOPIC AUTO_RECOVER=true EOF


**验证配置：**

```bash
grep -A4 "Monitoring & Alerts" /home/dawiddutoit/projects/network/.env

预期输出：

undefined

Monitoring & Alerts

ALERT_ENABLED=true NTFY_SERVER=https://ntfy.sh NTFY_TOPIC=infra-a3f7d92b4c8e1f56 AUTO_RECOVER=true


**Configuration options:**

| Variable | Purpose | Default |
|----------|---------|---------|
| `ALERT_ENABLED` | Enable mobile push notifications | `false` |
| `NTFY_SERVER` | ntfy.sh server URL | `https://ntfy.sh` |
| `NTFY_TOPIC` | Unique topic for your alerts | None (required) |
| `AUTO_RECOVER` | Enable automatic recovery | `true` |

**To disable auto-recovery but keep alerts:**
```bash

ALERT_ENABLED=true NTFY_SERVER=https://ntfy.sh NTFY_TOPIC=infra-a3f7d92b4c8e1f56 AUTO_RECOVER=true


**配置选项：**

| 变量名 | 用途 | 默认值 |
|--------|------|--------|
| `ALERT_ENABLED` | 启用移动推送通知 | `false` |
| `NTFY_SERVER` | ntfy.sh服务器地址 | `https://ntfy.sh` |
| `NTFY_TOPIC` | 告警专属主题 | 无（必填） |
| `AUTO_RECOVER` | 启用自动恢复 | `true` |

**如需保留告警但禁用自动恢复：**
```bash

Edit .env

编辑.env

nano /home/dawiddutoit/projects/network/.env

Change: AUTO_RECOVER=false

修改：AUTO_RECOVER=false

undefined

undefined

3.3 Install Systemd Timer

3.3 安装Systemd定时器

Install systemd service and timer to run monitoring every 5 minutes:

bash

undefined

安装systemd服务与定时器，实现每5分钟运行一次监控：

bash

undefined

Copy service files

复制服务文件

sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.service /etc/systemd/system/ sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.timer /etc/systemd/system/

Reload systemd

重载systemd

sudo systemctl daemon-reload

Enable and start timer

启用并启动定时器

sudo systemctl enable infrastructure-monitor.timer sudo systemctl start infrastructure-monitor.timer


**Verify timer is active:**

```bash

sudo systemctl enable infrastructure-monitor.timer sudo systemctl start infrastructure-monitor.timer


**验证定时器是否激活：**

```bash

Check timer status

查看定时器状态

systemctl list-timers infrastructure-monitor.timer

Check service status

查看服务状态

sudo systemctl status infrastructure-monitor.timer


Expected:

● infrastructure-monitor.timer - Run infrastructure monitoring every 5 minutes Loaded: loaded (/etc/systemd/system/infrastructure-monitor.timer; enabled) Active: active (waiting) since...


**Timer configuration:**
- Runs every 5 minutes
- Starts 1 minute after boot
- Persistent (survives reboots)

sudo systemctl status infrastructure-monitor.timer


预期输出：

● infrastructure-monitor.timer - Run infrastructure monitoring every 5 minutes Loaded: loaded (/etc/systemd/system/infrastructure-monitor.timer; enabled) Active: active (waiting) since...


**定时器配置：**
- 每5分钟运行一次
- 系统启动1分钟后开始运行
- 持久化（重启后依然有效）

3.4 Test Monitoring and Alerts

3.4 测试监控与告警

Test monitoring script:

bash

undefined

测试监控脚本：

bash

undefined

Run monitoring manually

手动运行监控

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


Expected output shows:
- Docker containers checked
- Tunnel connectivity tested
- Service health verified
- Network interface status
- Alert sent to ntfy topic

**Test alert delivery:**

Within 30 seconds, you should receive push notification on phone with infrastructure status.

**If no notification received:**

Check ntfy topic subscription:
```bash

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


预期输出包含：
- Docker容器检查结果
- 隧道连通性测试
- 服务健康状态验证
- 网络接口状态
- 已向ntfy主题发送告警

**测试告警投递：**

30秒内，你应在手机上收到包含基础设施状态的推送通知。

**若未收到通知：**

检查ntfy主题订阅：
```bash

Test sending to topic directly

直接向主题发送测试消息

curl -d "Test from infrastructure monitoring" https://ntfy.sh/$TOPIC


If direct curl works but monitoring doesn't:
- Check ALERT_ENABLED=true in .env
- Verify NTFY_TOPIC matches app subscription
- Check script has network access

curl -d "Test from infrastructure monitoring" https://ntfy.sh/$TOPIC


若直接curl请求有效但监控脚本无响应：
- 检查.env中ALERT_ENABLED=true
- 验证NTFY_TOPIC与应用订阅的主题一致
- 检查脚本是否具备网络访问权限

3.5 Configure Home Assistant Integration (Optional)

3.5 配置Home Assistant集成（可选）

Why use Home Assistant integration:

Centralized home automation alerts
Can trigger automations based on infrastructure status
Redundancy with ntfy.sh
Integration with existing HA notifications

Prerequisites:

Home Assistant running and accessible
HA mobile app installed (for notify.mobile_app_* service)

Step 1: Create Long-Lived Access Token

Go to Home Assistant: http://192.168.68.123:8123
Click your profile (bottom left)
Scroll to "Long-Lived Access Tokens"
Click "Create Token"
Name: "Infrastructure Monitoring"
Copy token (shown only once)

Step 2: Find Notification Service Name

In Home Assistant: Developer Tools → Services
Filter by "notify"
Find your mobile app service:
```
notify.mobile_app_your_phone
```

Step 3: Add to .env

bash

undefined

为什么使用Home Assistant集成：

集中管理家庭自动化告警
可根据基础设施状态触发自动化流程
与ntfy.sh形成冗余
与现有Home Assistant通知系统集成

前置要求：

Home Assistant已运行且可访问
已安装Home Assistant移动应用（用于notify.mobile_app_*服务）

步骤1：创建长期访问令牌

打开Home Assistant：http://192.168.68.123:8123
点击左下角个人资料
滚动至「长期访问令牌」
点击「创建令牌」
命名：「基础设施监控」
复制令牌（仅显示一次）

步骤2：查找通知服务名称

在Home Assistant中：开发者工具 → 服务
筛选「notify」
找到你的移动应用服务：
```
notify.mobile_app_your_phone
```

步骤3：添加到.env

bash

undefined

Edit .env

编辑.env

nano /home/dawiddutoit/projects/network/.env

Add HA configuration

添加Home Assistant配置

HA_NOTIFICATIONS_ENABLED=true HA_BASE_URL=http://192.168.68.123:8123 HA_ACCESS_TOKEN=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9... HA_NOTIFY_SERVICE=notify.mobile_app_your_phone


**Step 4: Test HA Notifications**

```bash

HA_NOTIFICATIONS_ENABLED=true HA_BASE_URL=http://192.168.68.123:8123 HA_ACCESS_TOKEN=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9... HA_NOTIFY_SERVICE=notify.mobile_app_your_phone


**步骤4：测试Home Assistant通知**

```bash

Run monitoring (should send to both ntfy and HA)

运行监控（应同时向ntfy和Home Assistant发送通知）

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


Check you receive notification in Home Assistant companion app.

**Troubleshooting HA notifications:**

```bash

/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh


检查是否在Home Assistant companion app中收到通知。

**排查Home Assistant通知问题：**

```bash

Test HA API access

测试Home Assistant API访问

curl -H "Authorization: Bearer YOUR_TOKEN"
http://192.168.68.123:8123/api/

Test notification service

测试通知服务

curl -X POST
-H "Authorization: Bearer YOUR_TOKEN"
-H "Content-Type: application/json"
-d '{"message": "Test from infrastructure monitoring"}'
http://192.168.68.123:8123/api/services/notify/mobile_app_your_phone

undefined

curl -X POST
-H "Authorization: Bearer YOUR_TOKEN"
-H "Content-Type: application/json"
-d '{"message": "Test from infrastructure monitoring"}'
http://192.168.68.123:8123/api/services/notify/mobile_app_your_phone

undefined

3.6 Verify Auto-Recovery

3.6 验证自动恢复功能

Monitor logs to see auto-recovery in action:

bash

undefined

查看日志以观察自动恢复过程：

bash

undefined

View live monitoring logs

实时查看监控日志

sudo journalctl -u infrastructure-monitor.service -f

Or check persistent log

或查看持久化日志

tail -f /var/log/infrastructure-monitor.log


**Auto-recovery capabilities:**

| Issue | Detection | Recovery Action |
|-------|-----------|----------------|
| Stuck cloudflared | No registrations in 10 min | Restart cloudflared container |
| Docker network isolation | Ping fails between containers | Recreate bridge network |
| Inactive Ethernet | WiFi used instead of eth0 | Activate Ethernet connection |
| Service failures | HTTP health checks fail | Restart affected containers |

**Test auto-recovery:**

```bash

tail -f /var/log/infrastructure-monitor.log


**自动恢复能力：**

| 问题 | 检测方式 | 恢复操作 |
|------|----------|----------|
| cloudflared卡顿 | 10分钟内无注册记录 | 重启cloudflared容器 |
| Docker网络隔离 | 容器间ping不通 | 重建桥接网络 |
| 以太网未激活 | 使用WiFi而非eth0 | 激活以太网连接 |
| 服务故障 | HTTP健康检查失败 | 重启受影响的容器 |

**测试自动恢复：**

```bash

Simulate stuck tunnel

模拟隧道卡顿

docker stop cloudflared

Wait 5 minutes (next monitoring run)

等待5分钟（下一次监控运行）

Check logs - should show tunnel restarted

查看日志 - 应显示隧道已重启

Verify tunnel recovered

验证隧道恢复

docker ps | grep cloudflared docker logs cloudflared | grep "Registered tunnel"

undefined

docker ps | grep cloudflared docker logs cloudflared | grep "Registered tunnel"

undefined

3.7 View Monitoring Logs

3.7 查看监控日志

View systemd service logs:

bash

undefined

查看systemd服务日志：

bash

undefined

Live monitoring logs

实时监控日志

sudo journalctl -u infrastructure-monitor.service -f

Last 50 lines

查看最近50行

sudo journalctl -u infrastructure-monitor.service -n 50

Logs from today

查看今日日志

sudo journalctl -u infrastructure-monitor.service --since today

Logs with timestamps

查看带时间戳的日志

sudo journalctl -u infrastructure-monitor.service -o short-iso


**View persistent log file:**

```bash

sudo journalctl -u infrastructure-monitor.service -o short-iso


**查看持久化日志文件：**

```bash

Live tail

实时尾部查看

tail -f /var/log/infrastructure-monitor.log

Last 100 lines

查看最近100行

tail -100 /var/log/infrastructure-monitor.log

Search for errors

搜索错误

grep -i error /var/log/infrastructure-monitor.log

Search for recoveries

搜索恢复记录

grep -i "recovered" /var/log/infrastructure-monitor.log


**Check timer schedule:**

```bash

grep -i "recovered" /var/log/infrastructure-monitor.log


**查看定时器计划：**

```bash

Show next run time

查看下次运行时间

systemctl list-timers infrastructure-monitor.timer

Show timer configuration

查看定时器配置

systemctl cat infrastructure-monitor.timer


**Monitoring controls:**

```bash

systemctl cat infrastructure-monitor.timer


**监控控制命令：**

```bash

Stop monitoring temporarily

临时停止监控

sudo systemctl stop infrastructure-monitor.timer

Restart monitoring

重启监控

sudo systemctl start infrastructure-monitor.timer

Disable monitoring (survives reboot)

禁用监控（重启后依然保持禁用）

sudo systemctl disable infrastructure-monitor.timer

Re-enable monitoring

重新启用监控

sudo systemctl enable infrastructure-monitor.timer

undefined

sudo systemctl enable infrastructure-monitor.timer

undefined

Supporting Files

4. 相关文件

File	Purpose
`references/reference.md`	Monitoring architecture, recovery strategies, ntfy.sh details
`examples/examples.md`	Example configurations, alert formats, log outputs
`scripts/test-notifications.sh`	Test script for alert delivery

文件	用途
`references/reference.md`	监控架构、恢复策略、ntfy.sh详细说明
`examples/examples.md`	示例配置、告警格式、日志输出
`scripts/test-notifications.sh`	告警投递测试脚本

Expected Outcomes

5. 预期效果

Success:

ntfy app receives push notifications
Monitoring runs every 5 minutes
Auto-recovery fixes common failures within 5 minutes
Logs show monitoring activity
Home Assistant notifications working (if configured)

Partial Success:

Monitoring runs but alerts not received (check topic subscription)
Alerts received but auto-recovery disabled (set AUTO_RECOVER=true)

Failure Indicators:

No notifications received after 10 minutes
Timer not running (check systemctl status)
Script fails with errors (check logs)
HA notifications not working (check token/service name)

成功状态：

ntfy应用收到推送通知
监控每5分钟运行一次
自动恢复功能在5分钟内修复常见故障
日志显示监控活动
Home Assistant通知正常（若已配置）

部分成功状态：

监控正常运行但收不到告警（检查主题订阅）
收到告警但自动恢复功能未启用（设置AUTO_RECOVER=true）

失败标识：

10分钟后仍未收到通知
定时器未运行（检查systemctl状态）
脚本运行报错（查看日志）
Home Assistant通知失效（检查令牌/服务名称）

Requirements

6. 前置要求

Infrastructure server running Linux with systemd
Mobile device with ntfy app installed
Internet connectivity for ntfy.sh
.env file with monitoring configuration
Home Assistant (optional, for HA integration)

运行Linux且带systemd的基础设施服务器
安装了ntfy应用的移动设备
可访问ntfy.sh的网络连接
包含监控配置的.env文件
Home Assistant（可选，用于集成）

Red Flags to Avoid

7. 注意事项

Do not use public/guessable ntfy topic (security risk)
Do not share ntfy topic publicly (anyone can subscribe)
Do not disable monitoring without alternative alerting
Do not ignore persistent alerts (investigate root cause)
Do not run monitoring script too frequently (causes noise)
Do not commit .env with ntfy topic to git (privacy)
Do not use AUTO_RECOVER=false without manual monitoring

请勿使用公开/易猜测的ntfy主题（存在安全风险）
请勿公开分享ntfy主题（任何人都可订阅）
请勿在无替代告警方案的情况下禁用监控
请勿忽略持续告警（需排查根本原因）
请勿过于频繁地运行监控脚本（会产生冗余信息）
请勿将包含ntfy主题的.env文件提交至git（隐私风险）
请勿在未配置手动监控的情况下设置AUTO_RECOVER=false

Notes

补充说明

Monitoring checks run every 5 minutes via systemd timer
ntfy.sh is free and doesn't require account
Topic ID should be random and private (security by obscurity)
Auto-recovery attempts fixes before alerting as critical
Alert levels: 🔴 Critical (manual intervention), ⚠️ Warning (recovery in progress)
HA integration is optional and works alongside ntfy.sh
Logs persist across reboots at /var/log/infrastructure-monitor.log
Maximum detection time: 5 minutes (timer interval)
Monitoring survives server reboots (systemd timer enabled)
Use infrastructure-health-check skill for manual on-demand checks

监控检查通过systemd定时器每5分钟运行一次
ntfy.sh是免费服务，无需注册账号
主题ID应随机且私密（通过模糊实现安全）
自动恢复会先尝试修复，再发送严重告警
告警级别：🔴 严重（需人工干预），⚠️ 警告（恢复中）
Home Assistant集成是可选的，可与ntfy.sh同时使用
日志持久化存储在/var/log/infrastructure-monitor.log，重启后依然保留
最大检测时间：5分钟（定时器间隔）
监控在服务器重启后依然有效（systemd定时器已启用）
如需手动按需检查，请使用infrastructure-health-check技能

infrastructure-monitoring-setup

Original

Translation

Infrastructure Monitoring Setup Skill

基础设施监控搭建技能

Quick Start

快速开始

1. Create unique ntfy topic

1. Create unique ntfy topic

2. Add to .env

2. Add to .env

3. Install systemd service

3. Install systemd service

4. Test

4. Test

Table of Contents

目录

When to Use This Skill

1. 何时使用该技能

What This Skill Does

2. 该技能的功能

Instructions

3. 操作指南

3.1 Install ntfy Mobile App

3.1 安装ntfy移动应用

3.2 Configure Monitoring in .env

3.2 在.env中配置监控

Navigate to project directory

Navigate to project directory

Add monitoring variables

Add monitoring variables

Monitoring & Alerts

Monitoring & Alerts

Monitoring & Alerts

Monitoring & Alerts

Edit .env

编辑.env

Change: AUTO_RECOVER=false

修改：AUTO_RECOVER=false

3.3 Install Systemd Timer

3.3 安装Systemd定时器

Copy service files

复制服务文件

Reload systemd

重载systemd

Enable and start timer

启用并启动定时器

Check timer status

查看定时器状态

Check service status

查看服务状态

3.4 Test Monitoring and Alerts

3.4 测试监控与告警

Run monitoring manually

手动运行监控

Test sending to topic directly

直接向主题发送测试消息

3.5 Configure Home Assistant Integration (Optional)

3.5 配置Home Assistant集成（可选）

Edit .env

编辑.env

Add HA configuration

添加Home Assistant配置

Run monitoring (should send to both ntfy and HA)

运行监控（应同时向ntfy和Home Assistant发送通知）

Test HA API access

测试Home Assistant API访问

Test notification service

测试通知服务

3.6 Verify Auto-Recovery

3.6 验证自动恢复功能

View live monitoring logs

实时查看监控日志

Or check persistent log

或查看持久化日志

Simulate stuck tunnel

模拟隧道卡顿

Wait 5 minutes (next monitoring run)

等待5分钟（下一次监控运行）

Check logs - should show tunnel restarted