terraform-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTerraform Operational Traps
Terraform操作陷阱
Failure patterns from real deployments. Every item caused an incident. Organized as: exact error → root cause → copy-paste fix.
来自真实部署的失败模式。每一项都曾引发过事故。组织形式为:具体错误 → 根本原因 → 可直接复制的修复方案.
Provisioner traps (symptom → fix)
Provisioner陷阱(症状 → 修复)
docker: not found
in remote-exec
docker: not foundremote-exec中出现docker: not found
docker: not foundcloud-init still installing Docker when provisioner SSHs in.
hcl
provisioner "remote-exec" {
inline = [
"cloud-init status --wait || true",
"which docker || { echo 'FATAL: Docker not ready'; exit 1; }",
]
}cloud-init仍在安装Docker时,provisioner已通过SSH连接进入。
hcl
provisioner "remote-exec" {
inline = [
"cloud-init status --wait || true",
"which docker || { echo 'FATAL: Docker not ready'; exit 1; }",
]
}rsync: connection unexpectedly closed
in local-exec
rsync: connection unexpectedly closedlocal-exec中出现rsync: connection unexpectedly closed
rsync: connection unexpectedly closedTerraform holds its SSH connection open; local-exec rsync opens a second one that gets rejected. Never use local-exec for file transfer to remote. Use tarball + file provisioner:
hcl
provisioner "local-exec" {
command = "tar czf /tmp/src.tar.gz --exclude=node_modules --exclude=.git -C ${path.module}/../../.. myproject"
}
provisioner "file" {
source = "/tmp/src.tar.gz"
destination = "/tmp/src.tar.gz"
}
provisioner "remote-exec" {
inline = ["tar xzf /tmp/src.tar.gz -C /data/ && rm -f /tmp/src.tar.gz"]
}macOS BSD tar: must come BEFORE the source argument.
--excludeTerraform会保持SSH连接处于打开状态;local-exec的rsync会打开第二个连接并被拒绝。切勿使用local-exec向远程主机传输文件。使用tar包 + file provisioner:
hcl
provisioner "local-exec" {
command = "tar czf /tmp/src.tar.gz --exclude=node_modules --exclude=.git -C ${path.module}/../../.. myproject"
}
provisioner "file" {
source = "/tmp/src.tar.gz"
destination = "/tmp/src.tar.gz"
}
provisioner "remote-exec" {
inline = ["tar xzf /tmp/src.tar.gz -C /data/ && rm -f /tmp/src.tar.gz"]
}macOS BSD tar注意事项:必须放在源参数之前。
--excludecloud-init status
shows "running" forever
cloud-init statuscloud-init status
显示"running"无限期持续
cloud-init statusapt-get -yiptables-persistentyaml
- |
echo iptables-persistent iptables-persistent/autosave_v4 boolean true | debconf-set-selections
echo iptables-persistent iptables-persistent/autosave_v6 boolean true | debconf-set-selections
DEBIAN_FRONTEND=noninteractive apt-get install -y iptables-persistentKnown offenders: , , , .
iptables-persistentpostfixmysql-serverwireshark-commonapt-get -yiptables-persistentyaml
- |
echo iptables-persistent iptables-persistent/autosave_v4 boolean true | debconf-set-selections
echo iptables-persistent iptables-persistent/autosave_v6 boolean true | debconf-set-selections
DEBIAN_FRONTEND=noninteractive apt-get install -y iptables-persistent已知会引发问题的包:、、、。
iptables-persistentpostfixmysql-serverwireshark-commonEACCES: permission denied
in container logs, container Restarting
EACCES: permission denied容器日志中出现EACCES: permission denied
,容器处于Restarting状态
EACCES: permission deniedHost volume dirs are root-owned; container runs as non-root (uid 1001). Fix before :
docker compose upbash
mkdir -p /data/myapp/data /data/myapp/logs
chown -R 1001:1001 /data/myapp/data /data/myapp/logsFind UID: grep or in Dockerfile.
adduser.*-uUSER主机卷目录由root用户所有;容器以非root用户(uid 1001)运行。在之前修复:
docker compose upbash
mkdir -p /data/myapp/data /data/myapp/logs
chown -R 1001:1001 /data/myapp/data /data/myapp/logs查找UID:在Dockerfile中搜索或指令。
adduser.*-uUSERProvisioner fails but no diagnostic output
Provisioner失败但无诊断输出
set -edocker logsset -u-ehcl
provisioner "remote-exec" {
inline = [
"set -u",
"docker compose up -d",
"sleep 15",
"docker logs myapp --tail 20 2>&1 || true",
"docker ps --format 'table {{.Names}}\\t{{.Status}}' || true",
"docker ps --filter name=myapp --format '{{.Status}}' | grep -q healthy || exit 1",
]
}set -edocker logsset -u-ehcl
provisioner "remote-exec" {
inline = [
"set -u",
"docker compose up -d",
"sleep 15",
"docker logs myapp --tail 20 2>&1 || true",
"docker ps --format 'table {{.Names}}\\t{{.Status}}' || true",
"docker ps --filter name=myapp --format '{{.Status}}' | grep -q healthy || exit 1",
]
}Container Restarting
— database tables missing
Restarting容器处于Restarting
状态 — 数据库表缺失
RestartingDB migrations not in provisioner. PostgreSQL only runs on empty data dir. Explicitly create DB + run migrations:
docker-entrypoint-initdb.dbash
undefined数据库迁移未包含在provisioner中。PostgreSQL的仅在数据目录为空时运行。需显式创建数据库并执行迁移:
docker-entrypoint-initdb.dbash
undefinedAfter postgres healthy:
在postgres健康后执行:
docker exec pg psql -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='mydb'" | grep -q 1
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"
docker exec pg psql -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='mydb'" | grep -q 1
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"
Idempotent migrations:
幂等迁移:
for f in migrations/*.sql; do
VER=$(basename $f)
APPLIED=$($PSQL -tAc "SELECT 1 FROM schema_migrations WHERE version='$VER'" | tr -d ' ')
[ "$APPLIED" = "1" ] && continue
{ echo 'BEGIN;'; cat $f; echo 'COMMIT;'; } | $PSQL
$PSQL -tAc "INSERT INTO schema_migrations(version) VALUES ('$VER') ON CONFLICT DO NOTHING"
done
undefinedfor f in migrations/*.sql; do
VER=$(basename $f)
APPLIED=$($PSQL -tAc "SELECT 1 FROM schema_migrations WHERE version='$VER'" | tr -d ' ')
[ "$APPLIED" = "1" ] && continue
{ echo 'BEGIN;'; cat $f; echo 'COMMIT;'; } | $PSQL
$PSQL -tAc "INSERT INTO schema_migrations(version) VALUES ('$VER') ON CONFLICT DO NOTHING"
done
undefineddocker compose build
ignores env var override
docker compose builddocker compose build
忽略环境变量覆盖
docker compose buildCompose reads build args from file, not shell env. does NOT work.
.envVAR=x docker compose buildbash
undefinedCompose从文件读取构建参数,而非shell环境变量。无法生效。
.envVAR=x docker compose buildbash
undefinedWRONG
错误写法
DOCKER_WITH_PROXY_MODE=disabled docker compose build
DOCKER_WITH_PROXY_MODE=disabled docker compose build
RIGHT
正确写法
grep -q DOCKER_WITH_PROXY_MODE .env || echo 'DOCKER_WITH_PROXY_MODE=disabled' >> .env
docker compose build
undefinedgrep -q DOCKER_WITH_PROXY_MODE .env || echo 'DOCKER_WITH_PROXY_MODE=disabled' >> .env
docker compose build
undefinedTLS handshake fails: Invalid format for Authorization header
Invalid format for Authorization headerTLS握手失败:Invalid format for Authorization header
Invalid format for Authorization headerCaddy DNS-01 ACME needs a Cloudflare API Token ( prefix, 40+ chars, Bearer auth). A Global API Key (37 hex chars, X-Auth-Key auth) causes . Production may appear to work because it has cached certificates; fresh environments fail on first cert request.
cfut_HTTP 400 Code:6003bash
undefinedCaddy DNS-01 ACME需要Cloudflare API Token(前缀为,长度40+字符,Bearer认证)。使用Global API Key(37位十六进制字符,X-Auth-Key认证)会导致。生产环境可能因缓存证书看似正常工作,但全新环境在首次请求证书时会失败。
cfut_HTTP 400 Code:6003bash
undefinedVerify token format before deploy:
部署前验证令牌格式:
TOKEN=$(grep CLOUDFLARE_API_TOKEN .env | cut -d= -f2)
echo "$TOKEN" | grep -q "^cfut_" || echo "FATAL: needs API Token, not Global Key"
Create scoped token via API:
```bash
curl -s "https://api.cloudflare.com/client/v4/user/tokens" -X POST \
-H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_GLOBAL_KEY" \
-d '{"name":"caddy-dns-acme","policies":[{"effect":"allow",
"resources":{"com.cloudflare.api.account.zone.<ZONE_ID>":"*"},
"permission_groups":[
{"id":"4755a26eedb94da69e1066d98aa820be","name":"DNS Write"},
{"id":"c8fed203ed3043cba015a93ad1616f1f","name":"Zone Read"}]}]}'TOKEN=$(grep CLOUDFLARE_API_TOKEN .env | cut -d= -f2)
echo "$TOKEN" | grep -q "^cfut_" || echo "FATAL: 需要API Token,而非Global Key"
通过API创建限定范围的令牌:
```bash
curl -s "https://api.cloudflare.com/client/v4/user/tokens" -X POST \
-H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_GLOBAL_KEY" \
-d '{"name":"caddy-dns-acme","policies":[{"effect":"allow",
"resources":{"com.cloudflare.api.account.zone.<ZONE_ID>":"*"},
"permission_groups":[
{"id":"4755a26eedb94da69e1066d98aa820be","name":"DNS Write"},
{"id":"c8fed203ed3043cba015a93ad1616f1f","name":"Zone Read"}]}]}'TLS fails on staging but works on production — hardcoded domains
TLS在预发布环境失败但生产环境正常 — 硬编码域名
Caddyfile or compose has literal domain names. Staging Caddy loads production config, tries to get certs for domains it doesn't own → ACME fails.
Caddyfile: Use — Caddy evaluates env vars at startup.
{$VAR}caddy
undefinedCaddyfile或compose中存在字面域名。预发布环境的Caddy加载生产环境配置,尝试为其不拥有的域名获取证书 → ACME失败。
Caddyfile:使用 — Caddy会在启动时解析环境变量。
{$VAR}caddy
undefinedWRONG
错误写法
gpt-6.pro { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }
gpt-6.pro { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }
RIGHT
正确写法
{$LOBEHUB_DOMAIN} { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }
**Compose**: Use `${VAR:?required}` — fail-fast if unset.
```yaml{$LOBEHUB_DOMAIN} { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }
**Compose**:使用`${VAR:?required}` — 若未设置则快速失败。
```yamlWRONG
错误写法
- APP_URL=https://gpt-6.pro
- APP_URL=https://gpt-6.pro
RIGHT
正确写法
- APP_URL=${APP_URL:?APP_URL is required}
Pass the env var to the gateway container so Caddy can read it:
```yaml
environment:
- LOBEHUB_DOMAIN=${LOBEHUB_DOMAIN:?LOBEHUB_DOMAIN is required}
- CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN:?required for DNS-01 TLS}- APP_URL=${APP_URL:?APP_URL is required}
将环境变量传递给网关容器,以便Caddy读取:
```yaml
environment:
- LOBEHUB_DOMAIN=${LOBEHUB_DOMAIN:?LOBEHUB_DOMAIN is required}
- CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN:?required for DNS-01 TLS}OAuth login fails: Social sign in failed
Social sign in failedOAuth登录失败:Social sign in failed
Social sign in failedCasdoor contains hardcoded redirect URIs. only applies init_data on first-ever DB creation — not on restarts. Fix via SQL in provisioner:
init_data.json--createDatabase=truebash
undefinedCasdoor的包含硬编码的重定向URI。仅在首次创建数据库时应用init_data — 重启时不会生效。通过provisioner中的SQL修复:
init_data.json--createDatabase=truebash
undefinedReplace production domain with staging in existing Casdoor DB
在现有Casdoor数据库中将生产域名替换为预发布域名
$PSQL -c "UPDATE application SET redirect_uris = REPLACE(redirect_uris,
'gpt-6.pro', 'staging.gpt-6.pro')
WHERE name='lobechat'
AND redirect_uris LIKE '%gpt-6.pro%'
AND redirect_uris NOT LIKE '%staging.gpt-6.pro%';"
Also check `AUTH_CASDOOR_ISSUER` — it must match the Casdoor subdomain (`auth.staging.example.com`), not the app root domain.$PSQL -c "UPDATE application SET redirect_uris = REPLACE(redirect_uris,
'gpt-6.pro', 'staging.gpt-6.pro')
WHERE name='lobechat'
AND redirect_uris LIKE '%gpt-6.pro%'
AND redirect_uris NOT LIKE '%staging.gpt-6.pro%';"
同时检查`AUTH_CASDOOR_ISSUER` — 它必须匹配Casdoor子域名(如`auth.staging.example.com`),而非应用根域名。Multi-environment isolation
多环境隔离
Before creating a second environment, grep files for hardcoded names. See references/multi-env-isolation.md for the complete matrix.
.tfWill fail on apply (globally unique):
| Resource | Scope | Fix |
|---|---|---|
| SSH key pair | Region | |
| SLS log project | Account | |
| CloudMonitor contact | Account | |
DNS duplication trap: Two environments creating A records for the same name in the same Cloudflare zone → two independent record IDs → DNS round-robin → ~50% traffic to wrong instance. Fix: use subdomain isolation () or separate zones. Remember to create DNS records for ALL subdomains Caddy serves (e.g., , ).
staging.example.comauth.stagingminio.stagingSnapshot cross-contamination: Unfiltered returns ALL account snapshots. New env inherits old 100GB snapshot, fails creating 40GB disk. Gate with variable:
data "alicloud_ecs_snapshots"hcl
locals {
latest_snapshot_id = var.enable_snapshot_recovery && length(local.available_snapshots) > 0
? local.available_snapshots[0].snapshot_id : null
}Do NOT add to the data source — changes its state address, causes drift.
count创建第二个环境前,在文件中搜索硬编码名称。完整矩阵请参见references/multi-env-isolation.md。
.tf应用时会失败(全局唯一):
| 资源 | 作用域 | 修复方案 |
|---|---|---|
| SSH密钥对 | 地域 | |
| SLS日志项目 | 账号 | |
| CloudMonitor联系人 | 账号 | |
DNS重复陷阱:两个环境在同一个Cloudflare区域中为同一名称创建A记录 → 生成两个独立的记录ID → DNS轮询 → 约50%流量流向错误实例。修复方案:使用子域隔离(如)或独立区域。请记住为Caddy服务的所有子域创建DNS记录(如、)。
staging.example.comauth.stagingminio.staging快照交叉污染:未过滤的会返回账号下所有快照。新环境继承旧的100GB快照,导致创建40GB磁盘失败。通过变量进行控制:
data "alicloud_ecs_snapshots"hcl
locals {
latest_snapshot_id = var.enable_snapshot_recovery && length(local.available_snapshots) > 0
? local.available_snapshots[0].snapshot_id : null
}请勿为数据源添加 — 这会改变其状态地址,导致基础设施漂移。
countPre-deploy validation
部署前验证
Run a validation script before to catch configuration errors locally. This eliminates the deploy→discover→fix→redeploy cycle.
terraform applyKey checks (see references/pre-deploy-validation.md):
- — syntax
terraform validate - No hardcoded domains in Caddyfiles or compose files
- Required env vars present (,
LOBEHUB_DOMAIN,CLAUDE4DEV_DOMAIN,CLOUDFLARE_API_TOKEN, etc.)APP_URL - Cloudflare API Token format (not Global API Key)
- DNS records exist for all Caddy-served domains
- Casdoor issuer URL matches subdomain
auth.* - SSH private key exists
Integrate into Makefile: before .
make pre-deploy ENV=stagingmake apply在执行前运行验证脚本,提前在本地捕获配置错误。这可以消除部署→发现→修复→重新部署的循环。
terraform apply关键检查项(详见references/pre-deploy-validation.md):
- — 语法检查
terraform validate - Caddyfile或compose文件中无硬编码域名
- 必要环境变量已存在(如、
LOBEHUB_DOMAIN、CLAUDE4DEV_DOMAIN、CLOUDFLARE_API_TOKEN等)APP_URL - Cloudflare API Token格式正确(非Global API Key)
- Caddy服务的所有域名均存在DNS记录
- Casdoor issuer URL匹配子域名
auth.* - SSH私钥存在
集成到Makefile中:执行前先运行。
make applymake pre-deploy ENV=stagingZero-to-deployment
从零到部署
Fresh disks expose every implicit dependency. See references/zero-to-deploy-checklist.md.
Key items that break provisioners on fresh instances:
- Directories: in cloud-init —
mkdir -p /data/{svc1,svc2}provisioner fails if target dir missingfile - Databases: Explicit — PG init scripts only run on empty data dir
CREATE DATABASE - Migrations: Tracked in table, applied idempotently
schema_migrations - Provisioner ordering: between resources sharing Docker networks
depends_on - Memory: Stop non-critical containers during Docker build on small instances (≤8GB)
- Domain parameterization: Every domain in Caddyfile/compose must be /
{$VAR}${VAR:?required} - Credential format: Caddy needs API Token (), not Global API Key
cfut_
全新磁盘会暴露所有隐式依赖项。完整清单请参见references/zero-to-deploy-checklist.md。
在全新实例上会导致provisioner失败的关键项:
- 目录:在cloud-init中执行— 若目标目录缺失,
mkdir -p /data/{svc1,svc2}provisioner会失败file - 数据库:显式执行— PG初始化脚本仅在数据目录为空时运行
CREATE DATABASE - 迁移:在表中跟踪迁移状态,以幂等方式执行
schema_migrations - Provisioner顺序:共享Docker网络的资源间需添加依赖
depends_on - 内存:在小规格实例(≤8GB)上执行Docker构建时,停止非关键容器
- 域名参数化:Caddyfile/compose中的每个域名必须使用/
{$VAR}${VAR:?required} - 凭证格式:Caddy需要API Token(前缀),而非Global API Key
cfut_