linux-admin

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When this skill is activated, always start your first response with the 🧢 emoji.

激活此技能后，首次回复请始终以🧢表情开头。

Linux Administration

Linux系统管理

A production-focused Linux administration skill covering shell scripting, service management, networking, and security hardening. This skill treats every Linux system as a production asset - configuration is explicit, changes are auditable, and security is a constraint from the start, not an afterthought. Designed for engineers who need to move confidently between writing a deploy script, debugging a network issue, and locking down a fresh server.

这是一款面向生产环境的Linux系统管理技能，涵盖shell脚本编写、服务管理、网络配置和安全加固。该技能将每台Linux系统都视为生产资产——配置明确、变更可审计，且从一开始就将安全作为约束条件，而非事后补充。专为需要在编写部署脚本、调试网络问题和加固新服务器之间从容切换的工程师设计。

When to use this skill

何时使用此技能

Trigger this skill when the user:

Writes or debugs a bash script (especially anything running in CI, cron, or production)
Creates or modifies a systemd service, timer, socket, or target unit
Configures or audits SSH daemon settings and access controls
Debugs a networking issue (routing, DNS, firewall, port connectivity)
Sets up or modifies iptables/nftables/ufw firewall rules
Manages file permissions, ownership, ACLs, or setuid/setgid bits
Monitors or investigates running processes (CPU, memory, open files, syscalls)
Sets up cron jobs or scheduled tasks
Manages disk space, log rotation, or filesystem mounts

Do NOT trigger this skill for:

Container orchestration specifics (Kubernetes networking, Docker Compose config) - use a Docker/K8s skill instead
Cloud provider IAM, VPC routing, or managed service configuration - those are cloud platform concerns, not OS-level Linux administration

当用户有以下需求时触发此技能：

编写或调试bash脚本（尤其是在CI、cron或生产环境中运行的脚本）
创建或修改systemd服务、定时器、套接字或目标单元
配置或审计SSH守护进程设置和访问控制
调试网络问题（路由、DNS、防火墙、端口连通性）
设置或修改iptables/nftables/ufw防火墙规则
管理文件权限、所有权、ACL或setuid/setgid位
监控或排查运行中的进程（CPU、内存、打开的文件、系统调用）
设置cron任务或定时任务
管理磁盘空间、日志轮转或文件系统挂载

请勿在以下场景触发此技能：

容器编排细节（Kubernetes网络、Docker Compose配置）——请使用Docker/K8s相关技能
云服务商IAM、VPC路由或托管服务配置——这些属于云平台范畴，而非操作系统级的Linux管理

Key principles

核心原则

Principle of least privilege - Every process, user, and service should run with the minimum permissions required. Use dedicated service accounts (not root), restrict file permissions to exactly what is needed, and audit sudo rules regularly.
Automate repeatable tasks - If you run a command twice, script it. Scripts should be idempotent - running them again should produce the same result, not break things. Store scripts in version control.
Log everything that matters - Structured logs, audit logs (auditd), and systemd journal entries are your incident response safety net. Log authentication events, privilege escalations, and configuration changes. Log rotation prevents disk exhaustion.
Immutable servers when possible - Prefer rebuilding servers from a known-good image over patching in place. Use configuration management (Ansible, cloud-init) to define state declaratively. Manual "snowflake" servers drift and fail unpredictably.
Test in staging - Every script, service unit, and firewall rule change should be validated in a non-production environment first. Use
```
--dry-run
```
,
```
bash -n
```
, and
```
iptables --check
```
to validate before applying.

最小权限原则 - 每个进程、用户和服务都应使用完成所需任务的最小权限运行。使用专用服务账户（而非root），严格限制文件权限，定期审计sudo规则。
自动化可重复任务 - 如果一个命令需要执行两次，就将其脚本化。脚本应具备幂等性——重复运行应产生相同结果，而非导致故障。将脚本存储在版本控制系统中。
记录所有重要操作 - 结构化日志、审计日志（auditd）和systemd日志条目是你应对事件的安全保障。记录认证事件、权限提升和配置变更。日志轮转可防止磁盘耗尽。
尽可能使用不可变服务器 - 优先从已知良好的镜像重建服务器，而非原地打补丁。使用配置管理工具（Ansible、cloud-init）声明式定义状态。手动配置的“雪花服务器”会逐渐偏离标准，且故障不可预测。
在预发布环境测试 - 所有脚本、服务单元和防火墙规则变更都应先在非生产环境验证。使用
```
--dry-run
```
、
```
bash -n
```
和
```
iptables --check
```
在应用前进行验证。

Core concepts

核心概念

File permissions

文件权限

Linux permissions have three layers (owner, group, others) and three bits (read, write, execute). Octal notation is the authoritative form.

Octal   Symbolic   Meaning
 0       ---       no permissions
 1       --x       execute only
 2       -w-       write only
 4       r--       read only
 6       rw-       read + write
 7       rwx       read + write + execute

Linux权限分为三个层级（所有者、组、其他用户）和三个权限位（读、写、执行）。八进制表示法是权威形式。

Octal   Symbolic   Meaning
 0       ---       无权限
 1       --x       仅执行
 2       -w-       仅写入
 4       r--       仅读取
 6       rw-       读+写
 7       rwx       读+写+执行

Common patterns

常见示例

chmod 600 ~/.ssh/id_rsa # private key: owner read/write only chmod 644 /etc/nginx/nginx.conf # config: owner rw, others read chmod 755 /usr/local/bin/script # executable: owner rwx, others rx chmod 700 /root/.gnupg # directory: only owner can enter


Special bits:
- `setuid (4xxx)`: executable runs as file owner, not caller. Dangerous on scripts.
- `setgid (2xxx)`: new files in directory inherit group. Useful for shared dirs.
- `sticky (1xxx)`: only file owner can delete in a directory (e.g., `/tmp`).

chmod 600 ~/.ssh/id_rsa # 私钥：仅所有者可读可写 chmod 644 /etc/nginx/nginx.conf # 配置文件：所有者可读可写，其他用户仅可读 chmod 755 /usr/local/bin/script # 可执行文件：所有者可读可写可执行，其他用户可读可执行 chmod 700 /root/.gnupg # 目录：仅所有者可进入


特殊权限位：
- `setuid (4xxx)`：可执行文件以文件所有者身份运行，而非调用者。在脚本上使用此权限存在风险。
- `setgid (2xxx)`：目录中新建的文件继承目录的组权限。适用于共享目录。
- `sticky (1xxx)`：仅文件所有者可删除目录中的文件（例如`/tmp`）。

Process management

进程管理

Key signals for process control:

Signal	Number	Meaning
SIGTERM	15	Polite shutdown - process should clean up
SIGKILL	9	Immediate kill - kernel enforced, unblockable
SIGHUP	1	Reload config (many daemons re-read on SIGHUP)
SIGINT	2	Interrupt (Ctrl+C)
SIGUSR1/2	10/12	Application-defined

niceness

runs from -20 (highest priority) to 19 (lowest). Use

nice -n 10 cmd

for background tasks and

renice

to adjust running processes.

进程控制的关键信号：

Signal	Number	Meaning
SIGTERM	15	优雅关闭 - 进程应完成清理工作
SIGKILL	9	强制终止 - 由内核强制执行，无法被阻止
SIGHUP	1	重载配置（许多守护进程收到此信号后会重新读取配置）
SIGINT	2	中断（Ctrl+C）
SIGUSR1/2	10/12	应用自定义信号

niceness

值范围为-20（最高优先级）到19（最低优先级）。使用

nice -n 10 cmd

运行后台任务，使用

renice

调整运行中进程的优先级。

systemd unit hierarchy

systemd单元层级

Targets (grouping)         -> multi-user.target, network.target
  Services (.service)      -> long-running daemons, oneshot tasks
  Timers (.timer)          -> scheduled execution (replaces cron)
  Sockets (.socket)        -> socket-activated services
  Mounts (.mount)          -> filesystem mounts managed by systemd
  Paths (.path)            -> filesystem change triggers

Dependency directives:

Requires=

(hard),

Wants=

(soft),

After=

(ordering only).

After=network-online.target

is the correct way to wait for network connectivity.

Targets (分组)         -> multi-user.target, network.target
  Services (.service)      -> 长期运行的守护进程、一次性任务
  Timers (.timer)          -> 定时执行任务（替代cron）
  Sockets (.socket)        -> 套接字激活的服务
  Mounts (.mount)          -> 由systemd管理的文件系统挂载
  Paths (.path)            -> 文件系统变更触发任务

依赖指令：

Requires=

（强依赖）、

Wants=

（弱依赖）、

After=

（仅控制启动顺序）。

After=network-online.target

是等待网络连通的正确方式。

Networking stack

网络栈

Key tools and their roles:

Tool	Layer	Purpose
`ip addr` / `ip link`	L2/L3	Interface state, IP addresses, routes
`ip route`	L3	Routing table inspection and management
`ss -tulpn`	L4	Listening ports, socket state, owning process
`iptables -L -n -v`	L3/L4	Firewall rules, packet counts
`dig` / `resolvectl`	DNS	Name resolution debugging
`traceroute` / `mtr`	L3	Path tracing, hop-by-hop latency
`tcpdump`	L2-L7	Packet capture for deep inspection

关键工具及其作用：

Tool	Layer	Purpose
`ip addr` / `ip link`	L2/L3	接口状态、IP地址、路由
`ip route`	L3	路由表查看与管理
`ss -tulpn`	L4	监听端口、套接字状态、所属进程
`iptables -L -n -v`	L3/L4	防火墙规则、数据包计数
`dig` / `resolvectl`	DNS	域名解析调试
`traceroute` / `mtr`	L3	路径追踪、逐跳延迟
`tcpdump`	L2-L7	数据包捕获，用于深度排查

Common tasks

常见任务

Write a robust bash script

编写健壮的bash脚本

Always use the safety triplet at the top of every non-trivial script.

bash

#!/usr/bin/env bash
set -euo pipefail

在所有非简单脚本的开头，务必添加安全三剑客配置。

bash

#!/usr/bin/env bash
set -euo pipefail

-e: exit on error

-e: 遇到错误时退出

-u: treat unset variables as errors

-u: 将未设置的变量视为错误

-o pipefail: pipeline fails if any command in it fails

-o pipefail: 管道中任意命令失败则整个管道失败

Cleanup on exit - runs on success, error, and signals

退出时清理 - 在成功、错误和信号触发时都会运行

TMPDIR_WORK="" cleanup() { local exit_code=$? [[ -n "$TMPDIR_WORK" ]] && rm -rf "$TMPDIR_WORK" exit "$exit_code" } trap cleanup EXIT INT TERM

Argument parsing with defaults and validation

参数解析，带默认值和验证

usage() { echo "Usage: $0 [-e ENV] [-d] <target>" echo " -e ENV Environment (default: staging)" echo " -d Dry-run mode" exit 1 }

ENV="staging" DRY_RUN=false

while getopts ":e:dh" opt; do case $opt in e) ENV="$OPTARG" ;; d) DRY_RUN=true ;; h) usage ;; :) echo "Option -$OPTARG requires an argument." >&2; usage ;; ?) echo "Unknown option: -$OPTARG" >&2; usage ;; esac done shift $((OPTIND - 1))

[[ $# -lt 1 ]] && { echo "Error: target required" >&2; usage; } TARGET="$1"

usage() { echo "Usage: $0 [-e ENV] [-d] <target>" echo " -e ENV 环境（默认：staging）" echo " -d 试运行模式" exit 1 }

ENV="staging" DRY_RUN=false

while getopts ":e:dh" opt; do case $opt in e) ENV="$OPTARG" ;; d) DRY_RUN=true ;; h) usage ;; :) echo "选项 -$OPTARG 需要参数。" >&2; usage ;; ?) echo "未知选项: -$OPTARG" >&2; usage ;; esac done shift $((OPTIND - 1))

[[ $# -lt 1 ]] && { echo "错误：需要指定目标" >&2; usage; } TARGET="$1"

Use mktemp for safe temp directories

使用mktemp创建安全的临时目录

TMPDIR_WORK=$(mktemp -d)

Log with timestamps

带时间戳的日志

log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*"; } log "Starting deploy: env=$ENV target=$TARGET dry_run=$DRY_RUN"

log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*"; } log "开始部署: env=$ENV target=$TARGET dry_run=$DRY_RUN"

Dry-run wrapper

试运行包装函数

run() { if [[ "$DRY_RUN" == true ]]; then echo "[DRY-RUN] $*" else "$@" fi }

run rsync -av --exclude='.git' "./" "deploy@${TARGET}:/opt/app/" log "Deploy complete"

undefined

run() { if [[ "$DRY_RUN" == true ]]; then echo "[试运行] $*" else "$@" fi }

run rsync -av --exclude='.git' "./" "deploy@${TARGET}:/opt/app/" log "部署完成"

undefined

Create a systemd service unit

创建systemd服务单元

A service + timer pair for a scheduled task (replacing cron):

ini

undefined

用于定时任务的服务+定时器组合（替代cron）：

ini

undefined

/etc/systemd/system/db-backup.service

[Unit] Description=Database backup After=network-online.target postgresql.service Wants=network-online.target

[Unit] Description=数据库备份 After=network-online.target postgresql.service Wants=network-online.target

Prevent starting if PostgreSQL is not running

PostgreSQL未运行时禁止启动

Requires=postgresql.service

[Service] Type=oneshot User=backup Group=backup

Requires=postgresql.service

[Service] Type=oneshot User=backup Group=backup

Security hardening

安全加固

NoNewPrivileges=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/backups/db PrivateTmp=true

ExecStart=/usr/local/bin/db-backup.sh StandardOutput=journal StandardError=journal

NoNewPrivileges=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/backups/db PrivateTmp=true

ExecStart=/usr/local/bin/db-backup.sh StandardOutput=journal StandardError=journal

Retry on failure

失败时重试

Restart=on-failure RestartSec=60

[Install] WantedBy=multi-user.target


```ini

Restart=on-failure RestartSec=60

[Install] WantedBy=multi-user.target


```ini

/etc/systemd/system/db-backup.timer

[Unit] Description=Run database backup daily at 02:00 Requires=db-backup.service

[Timer]

[Unit] Description=每日02:00运行数据库备份 Requires=db-backup.service

[Timer]

Run at 02:00 every day

每日02:00运行

OnCalendar=--* 02:00:00

Run immediately if last run was missed (e.g., server was down)

若上次运行错过（例如服务器关机），立即补运行

Persistent=true

Randomize start within 5 minutes to avoid thundering herd

随机延迟最多5分钟启动，避免惊群效应

RandomizedDelaySec=300

[Install] WantedBy=timers.target


```bash

RandomizedDelaySec=300

[Install] WantedBy=timers.target


```bash

Deploy and enable

部署并启用

sudo systemctl daemon-reload sudo systemctl enable --now db-backup.timer

Inspect

查看状态

systemctl status db-backup.timer systemctl list-timers db-backup.timer journalctl -u db-backup.service -n 50

undefined

systemctl status db-backup.timer systemctl list-timers db-backup.timer journalctl -u db-backup.service -n 50

undefined

Configure SSH hardening

配置SSH安全加固

Edit

/etc/ssh/sshd_config

with these settings:

undefined

编辑

/etc/ssh/sshd_config

，添加以下设置：

undefined

/etc/ssh/sshd_config - production hardening

/etc/ssh/sshd_config - 生产环境安全配置

Use SSH protocol 2 only (default in modern OpenSSH, make it explicit)

仅使用SSH协议2（现代OpenSSH默认值，此处明确声明）

Protocol 2

Disable root login - use a dedicated admin user with sudo

禁止root登录 - 使用专用管理员账户配合sudo

PermitRootLogin no

Disable password authentication - key-based only

禁用密码认证 - 仅允许密钥认证

PasswordAuthentication no ChallengeResponseAuthentication no UsePAM yes

Disable X11 forwarding unless needed

除非必要，禁用X11转发

X11Forwarding no

Limit login window to prevent slowloris-style attacks

限制登录窗口时长，防止slowloris类攻击

LoginGraceTime 30 MaxAuthTries 4 MaxSessions 10

Only allow specific groups to SSH

仅允许特定组通过SSH登录

AllowGroups sshusers admins

Restrict ciphers, MACs, and key exchange to modern algorithms

限制加密套件、MAC和密钥交换算法为现代标准

Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com MACs hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com KexAlgorithms curve25519-sha256,curve25519-sha256@libssh.org

Use privilege separation

使用权限分离

UsePrivilegeSeparation sandbox

Log at verbose level to capture key fingerprints on auth

启用详细日志，捕获认证时的密钥指纹

LogLevel VERBOSE

Set idle timeout: disconnect after 15 minutes of inactivity

设置空闲超时：15分钟无活动则断开连接

ClientAliveInterval 300 ClientAliveCountMax 3


```bash

ClientAliveInterval 300 ClientAliveCountMax 3


```bash

Validate before restarting

重启前验证配置

sudo sshd -t

Restart sshd (keep current session open until verified)

重启sshd（在验证新会话可用前，请勿关闭当前会话）

sudo systemctl restart sshd

Verify from a NEW session before closing the old one

从新会话验证配置，确认无误后再关闭旧会话

ssh -v user@host


> Never close your existing SSH session until you have verified a new session works.
> A broken sshd config can lock you out of the server permanently.

ssh -v user@host


> 在验证新SSH会话可正常连接前，切勿关闭现有会话。错误的sshd配置可能导致你永久无法登录服务器。

Debug networking issues

调试网络问题

Follow this workflow top-down:

bash

undefined

遵循以下自上而下的排查流程：

bash

undefined

1. Check interface state and IP assignment

1. 检查接口状态和IP分配

ip addr show ip link show

2. Check routing table

2. 检查路由表

ip route show

Expected: default route via gateway, local subnet route

预期结果：存在默认网关路由和本地子网路由

3. Test gateway reachability

3. 测试网关可达性

ping -c 4 $(ip route | awk '/default/ {print $3}')

4. Test DNS resolution

4. 测试DNS解析

dig +short google.com @8.8.8.8 # direct to external resolver resolvectl query google.com # use system resolver (systemd-resolved) cat /etc/resolv.conf # check configured resolvers

dig +short google.com @8.8.8.8 # 直接使用外部DNS服务器 resolvectl query google.com # 使用系统解析器（systemd-resolved） cat /etc/resolv.conf # 查看已配置的DNS服务器

5. Check listening ports and owning processes

5. 检查监听端口和所属进程

ss -tulpn

-t: TCP -u: UDP -l: listening -p: process -n: no name resolution

-t: TCP -u: UDP -l: 监听中 -p: 进程 -n: 不进行名称解析

6. Test specific port connectivity

6. 测试特定端口连通性

nc -zv 10.0.0.5 5432 # check if port is open timeout 3 bash -c "</dev/tcp/10.0.0.5/5432" && echo open || echo closed

nc -zv 10.0.0.5 5432 # 检查端口是否开放 timeout 3 bash -c "</dev/tcp/10.0.0.5/5432" && echo 开放 || echo 关闭

7. Trace the path

7. 追踪网络路径

traceroute -n 8.8.8.8 # ICMP path tracing mtr --report 8.8.8.8 # continuous path with stats (better than traceroute)

traceroute -n 8.8.8.8 # ICMP路径追踪 mtr --report 8.8.8.8 # 持续路径追踪并统计（优于traceroute）

8. Capture traffic for deep inspection

8. 捕获流量用于深度排查

Capture all traffic on eth0 to/from a host on port 443

捕获eth0接口上与指定主机443端口相关的所有流量

sudo tcpdump -i eth0 -n host 10.0.0.5 and port 443 -w /tmp/capture.pcap

Quick view without saving

实时查看，不保存结果

sudo tcpdump -i eth0 -n port 53 # watch DNS queries live

undefined

sudo tcpdump -i eth0 -n port 53 # 实时监控DNS查询

undefined

Set up firewall rules

设置防火墙规则

Using

ufw

for simple servers, raw

iptables

for complex setups:

bash

undefined

简单服务器推荐使用

ufw

，复杂场景使用原生

iptables

：

bash

undefined

--- ufw approach (recommended for most servers) ---

--- ufw方式（推荐大多数服务器使用） ---

Reset to defaults

重置为默认配置

sudo ufw --force reset sudo ufw default deny incoming sudo ufw default allow outgoing

Allow SSH (do this BEFORE enabling to avoid lockout)

允许SSH（启用前务必配置此项，避免被锁定）

sudo ufw allow 22/tcp comment 'SSH'

Web server

网页服务器

sudo ufw allow 80/tcp comment 'HTTP' sudo ufw allow 443/tcp comment 'HTTPS'

Allow specific source IP for admin access

允许特定源IP访问数据库

sudo ufw allow from 192.168.1.0/24 to any port 5432 comment 'Postgres from internal'

sudo ufw allow from 192.168.1.0/24 to any port 5432 comment '内部网络访问Postgres'

Enable and verify

启用并验证

sudo ufw --force enable sudo ufw status verbose


```bash

sudo ufw --force enable sudo ufw status verbose


```bash

--- iptables approach for precise control ---

--- iptables方式，用于精确控制 ---

Flush existing rules

清空现有规则

iptables -F iptables -X

Default policies: drop everything

默认策略：拒绝所有入站流量

iptables -P INPUT DROP iptables -P FORWARD DROP iptables -P OUTPUT ACCEPT

Allow loopback

允许回环接口流量

iptables -A INPUT -i lo -j ACCEPT iptables -A OUTPUT -o lo -j ACCEPT

Allow established/related connections

允许已建立/相关的连接

iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

Allow SSH (rate-limit to prevent brute force)

允许SSH（限速，防止暴力破解）

iptables -A INPUT -p tcp --dport 22 -m conntrack --ctstate NEW
-m recent --set --name SSH --rsource iptables -A INPUT -p tcp --dport 22 -m conntrack --ctstate NEW
-m recent --update --seconds 60 --hitcount 4 --name SSH --rsource -j DROP iptables -A INPUT -p tcp --dport 22 -j ACCEPT

Allow HTTP/HTTPS

允许HTTP/HTTPS流量

iptables -A INPUT -p tcp -m multiport --dports 80,443 -j ACCEPT

Save rules

保存规则

iptables-save > /etc/iptables/rules.v4

undefined

iptables-save > /etc/iptables/rules.v4

undefined

Manage disk space

管理磁盘空间

bash

undefined

bash

undefined

Check disk usage overview

查看磁盘使用概况

df -hT

-h: human readable -T: show filesystem type

-h: 人类可读格式 -T: 显示文件系统类型

Find large directories (top 10, depth-limited)

查找大目录（前10个，限制深度）

du -h --max-depth=2 /var | sort -rh | head -10

Interactive disk usage explorer (install ncdu first)

交互式磁盘使用分析器（需先安装ncdu）

ncdu /var/log

Find large files

查找大文件

find /var -type f -size +100M -exec ls -lh {} ; 2>/dev/null | sort -k5 -rh

Check journal size and truncate if needed

查看journal日志大小并按需截断

journalctl --disk-usage sudo journalctl --vacuum-size=500M # keep last 500MB sudo journalctl --vacuum-time=30d # keep last 30 days

undefined

journalctl --disk-usage sudo journalctl --vacuum-size=500M # 保留最后500MB日志 sudo journalctl --vacuum-time=30d # 保留最近30天日志

undefined

/etc/logrotate.d/myapp - custom log rotation

/etc/logrotate.d/myapp - 自定义日志轮转配置

/var/log/myapp/*.log { daily rotate 14 compress delaycompress missingok notifempty sharedscripts postrotate systemctl reload myapp 2>/dev/null || true endscript }


```bash

/var/log/myapp/*.log { daily rotate 14 compress delaycompress missingok notifempty sharedscripts postrotate systemctl reload myapp 2>/dev/null || true endscript }


```bash

Test logrotate config without running it

测试logrotate配置，不实际执行

logrotate --debug /etc/logrotate.d/myapp

Force a rotation run

强制执行一次日志轮转

logrotate --force /etc/logrotate.d/myapp

undefined

logrotate --force /etc/logrotate.d/myapp

undefined

Monitor processes

监控进程

bash

undefined

bash

undefined

Overview: CPU, memory, load average

概览：CPU、内存、负载平均值

top -b -n 1 -o %CPU | head -20 # batch mode, sort by CPU htop # interactive, colored, tree view

top -b -n 1 -o %CPU | head -20 # 批处理模式，按CPU排序 htop # 交互式、彩色显示、树形视图

Find what a process is doing

查看进程的详细活动

pid=$(pgrep -x nginx | head -1)

Open files and network connections

打开的文件和网络连接

lsof -p "$pid" # all open files lsof -p "$pid" -i # only network connections lsof -i :8080 # what process owns port 8080

lsof -p "$pid" # 所有打开的文件 lsof -p "$pid" -i # 仅显示网络连接 lsof -i :8080 # 查看占用8080端口的进程

System calls (strace) - use when a process behaves unexpectedly

系统调用（strace）- 进程行为异常时使用

strace -p "$pid" -f -e trace=network # network syscalls only strace -p "$pid" -f -c # count syscall frequency (summary) strace -c cmd arg # profile syscalls of a new command

strace -p "$pid" -f -e trace=network # 仅监控网络相关系统调用 strace -p "$pid" -f -c # 统计系统调用频率（摘要） strace -c cmd arg # 分析新启动命令的系统调用

Memory inspection

内存使用详情

cat /proc/"$pid"/status | grep -E 'Vm|Threads' cat /proc/"$pid"/smaps_rollup # detailed memory breakdown

cat /proc/"$pid"/status | grep -E 'Vm|Threads' cat /proc/"$pid"/smaps_rollup # 详细内存 breakdown

Check zombie/defunct processes

检查僵尸进程

ps aux | awk '$8 == "Z" {print}'

Kill process tree (all children too)

终止进程树（包括所有子进程）

kill -TERM -"$(ps -o pgid= -p "$pid" | tr -d ' ')"

---

kill -TERM -"$(ps -o pgid= -p "$pid" | tr -d ' ')"

---

Error handling

错误处理

Error	Likely cause	Resolution
`Permission denied (publickey)` on SSH	Wrong key, wrong user, or sshd config restricts access	Check `~/.ssh/authorized_keys` permissions (must be 600), verify `AllowGroups` in sshd_config, run `ssh -v` for detail
`Unit not found` in systemctl	Unit file not in a searched path or daemon not reloaded	Run `systemctl daemon-reload` , verify unit file path with `systemctl show -p FragmentPath`
`Job for X failed. See journalctl -xe`	Service exited non-zero at startup	Run `journalctl -u service-name -n 50 --no-pager` to see startup errors
`RTNETLINK answers: File exists` when adding route	Route already exists in the routing table	Check with `ip route show` , delete conflicting route with `ip route del` , then re-add
`iptables: No chain/target/match by that name`	Missing kernel module or typo in chain name	Load module with `modprobe xt_conntrack` , check spelling of built-in chains (INPUT, OUTPUT, FORWARD)
Script exits unexpectedly with no error message	`set -e` triggered on a command that returned non-zero	Add `

Error	可能原因	解决方法
SSH连接时出现 `Permission denied (publickey)`	密钥错误、用户错误或sshd配置限制访问	检查 `~/.ssh/authorized_keys` 权限（必须为600），验证sshd_config中的 `AllowGroups` 配置，运行 `ssh -v` 获取详细信息
systemctl中出现 `Unit not found`	单元文件不在搜索路径中，或未重新加载守护进程	运行 `systemctl daemon-reload` ，使用 `systemctl show -p FragmentPath` 验证单元文件路径
出现 `Job for X failed. See journalctl -xe`	服务启动时退出并返回非零状态	运行 `journalctl -u service-name -n 50 --no-pager` 查看启动错误
添加路由时出现 `RTNETLINK answers: File exists`	路由表中已存在该路由	使用 `ip route show` 检查，删除冲突路由 `ip route del` 后重新添加
出现 `iptables: No chain/target/match by that name`	缺少内核模块或链名拼写错误	使用 `modprobe xt_conntrack` 加载模块，检查内置链名拼写（INPUT、OUTPUT、FORWARD）
脚本无错误提示但意外退出	`set -e` 触发了某个返回非零的命令	对可能合法失败的命令添加`

References

参考资料

For detailed guidance on specific security domains, read the relevant file from the

references/

folder:

```
references/security-hardening.md
```
- SSH, firewall, user management, kernel hardening params, and audit logging checklist

Only load the references file when the current task requires it - it is detailed and will consume context.

如需特定安全领域的详细指导，请阅读

references/

文件夹中的相关文件：

```
references/security-hardening.md
```
- SSH、防火墙、用户管理、内核加固参数和审计日志检查清单

仅当当前任务需要时才加载参考文件——内容详细，会占用上下文资源。