terraform-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Terraform Operational Traps

Terraform操作陷阱

Failure patterns from real deployments. Every item caused an incident. Organized as: exact error → root cause → copy-paste fix.
来自真实部署的失败模式。每一项都曾引发过事故。组织形式为:具体错误 → 根本原因 → 可直接复制的修复方案.

Provisioner traps (symptom → fix)

Provisioner陷阱(症状 → 修复)

docker: not found
in remote-exec

remote-exec中出现
docker: not found

cloud-init still installing Docker when provisioner SSHs in.
hcl
provisioner "remote-exec" {
  inline = [
    "cloud-init status --wait || true",
    "which docker || { echo 'FATAL: Docker not ready'; exit 1; }",
  ]
}
cloud-init仍在安装Docker时,provisioner已通过SSH连接进入。
hcl
provisioner "remote-exec" {
  inline = [
    "cloud-init status --wait || true",
    "which docker || { echo 'FATAL: Docker not ready'; exit 1; }",
  ]
}

rsync: connection unexpectedly closed
in local-exec

local-exec中出现
rsync: connection unexpectedly closed

Terraform holds its SSH connection open; local-exec rsync opens a second one that gets rejected. Never use local-exec for file transfer to remote. Use tarball + file provisioner:
hcl
provisioner "local-exec" {
  command = "tar czf /tmp/src.tar.gz --exclude=node_modules --exclude=.git -C ${path.module}/../../.. myproject"
}
provisioner "file" {
  source      = "/tmp/src.tar.gz"
  destination = "/tmp/src.tar.gz"
}
provisioner "remote-exec" {
  inline = ["tar xzf /tmp/src.tar.gz -C /data/ && rm -f /tmp/src.tar.gz"]
}
macOS BSD tar:
--exclude
must come BEFORE the source argument.
Terraform会保持SSH连接处于打开状态;local-exec的rsync会打开第二个连接并被拒绝。切勿使用local-exec向远程主机传输文件。使用tar包 + file provisioner:
hcl
provisioner "local-exec" {
  command = "tar czf /tmp/src.tar.gz --exclude=node_modules --exclude=.git -C ${path.module}/../../.. myproject"
}
provisioner "file" {
  source      = "/tmp/src.tar.gz"
  destination = "/tmp/src.tar.gz"
}
provisioner "remote-exec" {
  inline = ["tar xzf /tmp/src.tar.gz -C /data/ && rm -f /tmp/src.tar.gz"]
}
macOS BSD tar注意事项:
--exclude
必须放在源参数之前。

cloud-init status
shows "running" forever

cloud-init status
显示"running"无限期持续

apt-get -y
does not suppress debconf dialogs. Packages like
iptables-persistent
block on TTY prompts.
yaml
- |
    echo iptables-persistent iptables-persistent/autosave_v4 boolean true | debconf-set-selections
    echo iptables-persistent iptables-persistent/autosave_v6 boolean true | debconf-set-selections
    DEBIAN_FRONTEND=noninteractive apt-get install -y iptables-persistent
Known offenders:
iptables-persistent
,
postfix
,
mysql-server
,
wireshark-common
.
apt-get -y
不会抑制debconf对话框。像
iptables-persistent
这类包会因TTY提示而阻塞。
yaml
- |
    echo iptables-persistent iptables-persistent/autosave_v4 boolean true | debconf-set-selections
    echo iptables-persistent iptables-persistent/autosave_v6 boolean true | debconf-set-selections
    DEBIAN_FRONTEND=noninteractive apt-get install -y iptables-persistent
已知会引发问题的包:
iptables-persistent
postfix
mysql-server
wireshark-common

EACCES: permission denied
in container logs, container Restarting

容器日志中出现
EACCES: permission denied
,容器处于Restarting状态

Host volume dirs are root-owned; container runs as non-root (uid 1001). Fix before
docker compose up
:
bash
mkdir -p /data/myapp/data /data/myapp/logs
chown -R 1001:1001 /data/myapp/data /data/myapp/logs
Find UID: grep
adduser.*-u
or
USER
in Dockerfile.
主机卷目录由root用户所有;容器以非root用户(uid 1001)运行。在
docker compose up
之前修复:
bash
mkdir -p /data/myapp/data /data/myapp/logs
chown -R 1001:1001 /data/myapp/data /data/myapp/logs
查找UID:在Dockerfile中搜索
adduser.*-u
USER
指令。

Provisioner fails but no diagnostic output

Provisioner失败但无诊断输出

set -e
exits on first error, hiding subsequent
docker logs
output. Use
set -u
without
-e
, put one verification gate at the end:
hcl
provisioner "remote-exec" {
  inline = [
    "set -u",
    "docker compose up -d",
    "sleep 15",
    "docker logs myapp --tail 20 2>&1 || true",
    "docker ps --format 'table {{.Names}}\\t{{.Status}}' || true",
    "docker ps --filter name=myapp --format '{{.Status}}' | grep -q healthy || exit 1",
  ]
}
set -e
会在第一个错误时退出,隐藏后续
docker logs
输出。使用
set -u
而非
-e
,在末尾添加一个验证步骤:
hcl
provisioner "remote-exec" {
  inline = [
    "set -u",
    "docker compose up -d",
    "sleep 15",
    "docker logs myapp --tail 20 2>&1 || true",
    "docker ps --format 'table {{.Names}}\\t{{.Status}}' || true",
    "docker ps --filter name=myapp --format '{{.Status}}' | grep -q healthy || exit 1",
  ]
}

Container
Restarting
— database tables missing

容器处于
Restarting
状态 — 数据库表缺失

DB migrations not in provisioner. PostgreSQL
docker-entrypoint-initdb.d
only runs on empty data dir. Explicitly create DB + run migrations:
bash
undefined
数据库迁移未包含在provisioner中。PostgreSQL的
docker-entrypoint-initdb.d
仅在数据目录为空时运行。需显式创建数据库并执行迁移:
bash
undefined

After postgres healthy:

在postgres健康后执行:

docker exec pg psql -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='mydb'" | grep -q 1
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"
docker exec pg psql -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='mydb'" | grep -q 1
|| docker exec pg psql -U postgres -c "CREATE DATABASE mydb;"

Idempotent migrations:

幂等迁移:

for f in migrations/*.sql; do VER=$(basename $f) APPLIED=$($PSQL -tAc "SELECT 1 FROM schema_migrations WHERE version='$VER'" | tr -d ' ') [ "$APPLIED" = "1" ] && continue { echo 'BEGIN;'; cat $f; echo 'COMMIT;'; } | $PSQL $PSQL -tAc "INSERT INTO schema_migrations(version) VALUES ('$VER') ON CONFLICT DO NOTHING" done
undefined
for f in migrations/*.sql; do VER=$(basename $f) APPLIED=$($PSQL -tAc "SELECT 1 FROM schema_migrations WHERE version='$VER'" | tr -d ' ') [ "$APPLIED" = "1" ] && continue { echo 'BEGIN;'; cat $f; echo 'COMMIT;'; } | $PSQL $PSQL -tAc "INSERT INTO schema_migrations(version) VALUES ('$VER') ON CONFLICT DO NOTHING" done
undefined

docker compose build
ignores env var override

docker compose build
忽略环境变量覆盖

Compose reads build args from
.env
file, not shell env.
VAR=x docker compose build
does NOT work.
bash
undefined
Compose从
.env
文件读取构建参数,而非shell环境变量。
VAR=x docker compose build
无法生效。
bash
undefined

WRONG

错误写法

DOCKER_WITH_PROXY_MODE=disabled docker compose build
DOCKER_WITH_PROXY_MODE=disabled docker compose build

RIGHT

正确写法

grep -q DOCKER_WITH_PROXY_MODE .env || echo 'DOCKER_WITH_PROXY_MODE=disabled' >> .env docker compose build
undefined
grep -q DOCKER_WITH_PROXY_MODE .env || echo 'DOCKER_WITH_PROXY_MODE=disabled' >> .env docker compose build
undefined

TLS handshake fails:
Invalid format for Authorization header

TLS握手失败:
Invalid format for Authorization header

Caddy DNS-01 ACME needs a Cloudflare API Token (
cfut_
prefix, 40+ chars, Bearer auth). A Global API Key (37 hex chars, X-Auth-Key auth) causes
HTTP 400 Code:6003
. Production may appear to work because it has cached certificates; fresh environments fail on first cert request.
bash
undefined
Caddy DNS-01 ACME需要Cloudflare API Token(前缀为
cfut_
,长度40+字符,Bearer认证)。使用Global API Key(37位十六进制字符,X-Auth-Key认证)会导致
HTTP 400 Code:6003
。生产环境可能因缓存证书看似正常工作,但全新环境在首次请求证书时会失败。
bash
undefined

Verify token format before deploy:

部署前验证令牌格式:

TOKEN=$(grep CLOUDFLARE_API_TOKEN .env | cut -d= -f2) echo "$TOKEN" | grep -q "^cfut_" || echo "FATAL: needs API Token, not Global Key"

Create scoped token via API:
```bash
curl -s "https://api.cloudflare.com/client/v4/user/tokens" -X POST \
  -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_GLOBAL_KEY" \
  -d '{"name":"caddy-dns-acme","policies":[{"effect":"allow",
    "resources":{"com.cloudflare.api.account.zone.<ZONE_ID>":"*"},
    "permission_groups":[
      {"id":"4755a26eedb94da69e1066d98aa820be","name":"DNS Write"},
      {"id":"c8fed203ed3043cba015a93ad1616f1f","name":"Zone Read"}]}]}'
TOKEN=$(grep CLOUDFLARE_API_TOKEN .env | cut -d= -f2) echo "$TOKEN" | grep -q "^cfut_" || echo "FATAL: 需要API Token,而非Global Key"

通过API创建限定范围的令牌:
```bash
curl -s "https://api.cloudflare.com/client/v4/user/tokens" -X POST \
  -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_GLOBAL_KEY" \
  -d '{"name":"caddy-dns-acme","policies":[{"effect":"allow",
    "resources":{"com.cloudflare.api.account.zone.<ZONE_ID>":"*"},
    "permission_groups":[
      {"id":"4755a26eedb94da69e1066d98aa820be","name":"DNS Write"},
      {"id":"c8fed203ed3043cba015a93ad1616f1f","name":"Zone Read"}]}]}'

TLS fails on staging but works on production — hardcoded domains

TLS在预发布环境失败但生产环境正常 — 硬编码域名

Caddyfile or compose has literal domain names. Staging Caddy loads production config, tries to get certs for domains it doesn't own → ACME fails.
Caddyfile: Use
{$VAR}
— Caddy evaluates env vars at startup.
caddy
undefined
Caddyfile或compose中存在字面域名。预发布环境的Caddy加载生产环境配置,尝试为其不拥有的域名获取证书 → ACME失败。
Caddyfile:使用
{$VAR}
— Caddy会在启动时解析环境变量。
caddy
undefined

WRONG

错误写法

gpt-6.pro { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }
gpt-6.pro { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }

RIGHT

正确写法

{$LOBEHUB_DOMAIN} { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }

**Compose**: Use `${VAR:?required}` — fail-fast if unset.
```yaml
{$LOBEHUB_DOMAIN} { tls { dns cloudflare {env.CLOUDFLARE_API_TOKEN} } }

**Compose**:使用`${VAR:?required}` — 若未设置则快速失败。
```yaml

WRONG

错误写法

RIGHT

正确写法

  • APP_URL=${APP_URL:?APP_URL is required}

Pass the env var to the gateway container so Caddy can read it:
```yaml
environment:
  - LOBEHUB_DOMAIN=${LOBEHUB_DOMAIN:?LOBEHUB_DOMAIN is required}
  - CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN:?required for DNS-01 TLS}
  • APP_URL=${APP_URL:?APP_URL is required}

将环境变量传递给网关容器,以便Caddy读取:
```yaml
environment:
  - LOBEHUB_DOMAIN=${LOBEHUB_DOMAIN:?LOBEHUB_DOMAIN is required}
  - CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN:?required for DNS-01 TLS}

OAuth login fails:
Social sign in failed

OAuth登录失败:
Social sign in failed

Casdoor
init_data.json
contains hardcoded redirect URIs.
--createDatabase=true
only applies init_data on first-ever DB creation — not on restarts. Fix via SQL in provisioner:
bash
undefined
Casdoor的
init_data.json
包含硬编码的重定向URI。
--createDatabase=true
仅在首次创建数据库时应用init_data — 重启时不会生效。通过provisioner中的SQL修复:
bash
undefined

Replace production domain with staging in existing Casdoor DB

在现有Casdoor数据库中将生产域名替换为预发布域名

$PSQL -c "UPDATE application SET redirect_uris = REPLACE(redirect_uris, 'gpt-6.pro', 'staging.gpt-6.pro') WHERE name='lobechat' AND redirect_uris LIKE '%gpt-6.pro%' AND redirect_uris NOT LIKE '%staging.gpt-6.pro%';"

Also check `AUTH_CASDOOR_ISSUER` — it must match the Casdoor subdomain (`auth.staging.example.com`), not the app root domain.
$PSQL -c "UPDATE application SET redirect_uris = REPLACE(redirect_uris, 'gpt-6.pro', 'staging.gpt-6.pro') WHERE name='lobechat' AND redirect_uris LIKE '%gpt-6.pro%' AND redirect_uris NOT LIKE '%staging.gpt-6.pro%';"

同时检查`AUTH_CASDOOR_ISSUER` — 它必须匹配Casdoor子域名(如`auth.staging.example.com`),而非应用根域名。

Multi-environment isolation

多环境隔离

Before creating a second environment, grep
.tf
files for hardcoded names. See references/multi-env-isolation.md for the complete matrix.
Will fail on apply (globally unique):
ResourceScopeFix
SSH key pairRegion
"${env}-deploy"
SLS log projectAccount
"${env}-logs"
CloudMonitor contactAccount
"${env}-ops"
DNS duplication trap: Two environments creating A records for the same name in the same Cloudflare zone → two independent record IDs → DNS round-robin → ~50% traffic to wrong instance. Fix: use subdomain isolation (
staging.example.com
) or separate zones. Remember to create DNS records for ALL subdomains Caddy serves (e.g.,
auth.staging
,
minio.staging
).
Snapshot cross-contamination: Unfiltered
data "alicloud_ecs_snapshots"
returns ALL account snapshots. New env inherits old 100GB snapshot, fails creating 40GB disk. Gate with variable:
hcl
locals {
  latest_snapshot_id = var.enable_snapshot_recovery && length(local.available_snapshots) > 0
    ? local.available_snapshots[0].snapshot_id : null
}
Do NOT add
count
to the data source — changes its state address, causes drift.
创建第二个环境前,在
.tf
文件中搜索硬编码名称。完整矩阵请参见references/multi-env-isolation.md
应用时会失败(全局唯一):
资源作用域修复方案
SSH密钥对地域
"${env}-deploy"
SLS日志项目账号
"${env}-logs"
CloudMonitor联系人账号
"${env}-ops"
DNS重复陷阱:两个环境在同一个Cloudflare区域中为同一名称创建A记录 → 生成两个独立的记录ID → DNS轮询 → 约50%流量流向错误实例。修复方案:使用子域隔离(如
staging.example.com
)或独立区域。请记住为Caddy服务的所有子域创建DNS记录(如
auth.staging
minio.staging
)。
快照交叉污染:未过滤的
data "alicloud_ecs_snapshots"
会返回账号下所有快照。新环境继承旧的100GB快照,导致创建40GB磁盘失败。通过变量进行控制:
hcl
locals {
  latest_snapshot_id = var.enable_snapshot_recovery && length(local.available_snapshots) > 0
    ? local.available_snapshots[0].snapshot_id : null
}
请勿为数据源添加
count
— 这会改变其状态地址,导致基础设施漂移。

Pre-deploy validation

部署前验证

Run a validation script before
terraform apply
to catch configuration errors locally. This eliminates the deploy→discover→fix→redeploy cycle.
Key checks (see references/pre-deploy-validation.md):
  1. terraform validate
    — syntax
  2. No hardcoded domains in Caddyfiles or compose files
  3. Required env vars present (
    LOBEHUB_DOMAIN
    ,
    CLAUDE4DEV_DOMAIN
    ,
    CLOUDFLARE_API_TOKEN
    ,
    APP_URL
    , etc.)
  4. Cloudflare API Token format (not Global API Key)
  5. DNS records exist for all Caddy-served domains
  6. Casdoor issuer URL matches
    auth.*
    subdomain
  7. SSH private key exists
Integrate into Makefile:
make pre-deploy ENV=staging
before
make apply
.
在执行
terraform apply
前运行验证脚本,提前在本地捕获配置错误。这可以消除部署→发现→修复→重新部署的循环。
关键检查项(详见references/pre-deploy-validation.md):
  1. terraform validate
    — 语法检查
  2. Caddyfile或compose文件中无硬编码域名
  3. 必要环境变量已存在(如
    LOBEHUB_DOMAIN
    CLAUDE4DEV_DOMAIN
    CLOUDFLARE_API_TOKEN
    APP_URL
    等)
  4. Cloudflare API Token格式正确(非Global API Key)
  5. Caddy服务的所有域名均存在DNS记录
  6. Casdoor issuer URL匹配
    auth.*
    子域名
  7. SSH私钥存在
集成到Makefile中:执行
make apply
前先运行
make pre-deploy ENV=staging

Zero-to-deployment

从零到部署

Fresh disks expose every implicit dependency. See references/zero-to-deploy-checklist.md.
Key items that break provisioners on fresh instances:
  1. Directories:
    mkdir -p /data/{svc1,svc2}
    in cloud-init —
    file
    provisioner fails if target dir missing
  2. Databases: Explicit
    CREATE DATABASE
    — PG init scripts only run on empty data dir
  3. Migrations: Tracked in
    schema_migrations
    table, applied idempotently
  4. Provisioner ordering:
    depends_on
    between resources sharing Docker networks
  5. Memory: Stop non-critical containers during Docker build on small instances (≤8GB)
  6. Domain parameterization: Every domain in Caddyfile/compose must be
    {$VAR}
    /
    ${VAR:?required}
  7. Credential format: Caddy needs API Token (
    cfut_
    ), not Global API Key
全新磁盘会暴露所有隐式依赖项。完整清单请参见references/zero-to-deploy-checklist.md
在全新实例上会导致provisioner失败的关键项:
  1. 目录:在cloud-init中执行
    mkdir -p /data/{svc1,svc2}
    — 若目标目录缺失,
    file
    provisioner会失败
  2. 数据库:显式执行
    CREATE DATABASE
    — PG初始化脚本仅在数据目录为空时运行
  3. 迁移:在
    schema_migrations
    表中跟踪迁移状态,以幂等方式执行
  4. Provisioner顺序:共享Docker网络的资源间需添加
    depends_on
    依赖
  5. 内存:在小规格实例(≤8GB)上执行Docker构建时,停止非关键容器
  6. 域名参数化:Caddyfile/compose中的每个域名必须使用
    {$VAR}
    /
    ${VAR:?required}
  7. 凭证格式:Caddy需要API Token(
    cfut_
    前缀),而非Global API Key