cloud-architecture

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cloud Architecture

云架构

Use When

适用场景

Use when designing cloud deployments, Dockerising applications, laying out AWS or GCP environments, choosing a deployment pattern, or moving a workload from a single VM to a resilient multi-AZ topology.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

适用于设计云部署方案、将应用Docker容器化、规划AWS或GCP环境、选择部署模式，或是将工作负载从单一VM迁移至高可用多可用区（multi-AZ）拓扑结构时使用。
任务需要可复用的判断逻辑、领域约束条件或经过验证的工作流，而非临时建议。

Do Not Use When

不适用场景

The task is unrelated to
```
cloud-architecture
```
or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

任务与
```
cloud-architecture
```
无关，或更适合由更专业的配套技能处理。
请求仅需要简单答案，且本技能的约束条件或参考资料无法提供实质性帮助。

Required Inputs

必要输入

Gather relevant project context, constraints, and the concrete problem to solve; load
```
references
```
only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

收集相关项目背景、约束条件及具体待解决问题；仅在需要时加载
```
references
```
内容。
确认期望交付物：设计方案、代码、评审结果、迁移计划、审计报告或文档。

Workflow

工作流

Read this
```
SKILL.md
```
first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

首先阅读本
```
SKILL.md
```
，然后仅加载完成任务所需的参考深度文档。
应用本技能中的有序指导、检查清单和决策规则，而非随意挑选孤立片段。
交付成果需明确说明假设条件、风险及后续工作（若相关）。

Quality Standards

质量标准

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

输出内容需以执行为导向、简洁明了，并与仓库的基线工程标准保持一致。
除非技能明确要求更高标准，否则需兼容现有项目约定。
优先采用可确定、可评审的步骤，而非模糊建议或工具特定的“魔法操作”。

Anti-Patterns

反模式

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

将示例视为可直接复制粘贴的标准答案，而不检查是否适配、约束条件或失败模式。
默认加载所有参考文件，而非按需逐步披露。

Outputs

输出成果

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

符合任务需求的具体结果：实现指导、评审发现、架构决策、模板或生成的工件。
当仅靠现有上下文无法完成任务时，需明确说明假设条件、权衡方案或未解决的缺口。
若能实质性提升执行效果，需列出使用的参考资料、配套技能或后续行动。

Evidence Produced

生成的证据

Category	Artifact	Format	Example
Correctness	Cloud topology decision record	Markdown doc per `skill-composition-standards/references/adr-template.md` covering compute, storage, network, and IAM picks	`docs/cloud/topology-adr.md`
Security	Cloud account hardening checklist	Markdown doc covering root-account, IAM, network, and logging baseline	`docs/cloud/hardening-checklist.md`

分类	工件	格式	示例
正确性	云拓扑决策记录	遵循 `skill-composition-standards/references/adr-template.md` 的Markdown文档，涵盖计算、存储、网络和IAM选择	`docs/cloud/topology-adr.md`
安全性	云账户加固检查清单	涵盖根账户、IAM、网络和日志基线的Markdown文档	`docs/cloud/hardening-checklist.md`

References

参考资料

Use the
```
references/
```
directory for deep detail after reading the core workflow below.

阅读核心工作流后，如需深入细节可使用
```
references/
```
目录下的内容。

Load Order

加载顺序

Load
```
world-class-engineering
```
for the production bar.
Load
```
system-architecture-design
```
for decomposition and contracts.
Load this skill for the cloud runtime shape.

Pair with

cicd-pipelines

for delivery,

cicd-devsecops

for gate policy,

observability-monitoring

for telemetry,

deployment-release-engineering

for rollout, and

reliability-engineering

for failure design.

加载
```
world-class-engineering
```
以获取生产环境标准。
加载
```
system-architecture-design
```
以进行分解和契约定义。
加载本技能以确定云运行时形态。

搭配

cicd-pipelines

用于交付、

cicd-devsecops

用于网关策略、

observability-monitoring

用于遥测、

deployment-release-engineering

用于发布、

reliability-engineering

用于故障设计。

Executable Outputs

可执行输出

For meaningful cloud architecture work produce: workload classification (stateless, stateful, async, batch, scheduled), chosen compute model with rationale, VPC + subnet + routing layout across AZs, Dockerfile (multi-stage, pinned base),

docker-compose.yml

mirroring production, IAM role inventory with least-privilege policies, deployment pattern choice and rollback runbook, cost posture (reserved/on-demand/spot split, Savings Plan assessment), and CDN/TLS/WAF/auto-scaling configuration.

对于有意义的云架构工作，需生成：工作负载分类（无状态、有状态、异步、批处理、定时任务）、选定的计算模型及理由、跨可用区的VPC+子网+路由布局、Dockerfile（多阶段、固定基础镜像）、镜像生产环境的

docker-compose.yml

、遵循最小权限原则的IAM角色清单、部署模式选择及回滚手册、成本配置（预留/按需/竞价实例拆分、Savings Plan评估），以及CDN/TLS/WAF/自动扩缩容配置。

Cloud Provider Selection

云服务商选择

East African SaaS workloads (Uganda, Kenya, Tanzania) weigh four dimensions: latency to users, data-residency obligations under Uganda DPPA 2019, support hours overlapping EAT (UTC+3), and price-per-workload.

Dimension	AWS	GCP	Azure
Closest region	`af-south-1` Cape Town (~30 ms)	`europe-west1` (~160 ms)	`southafricanorth` (~40 ms)
Data-residency fit	Strong (af-south-1 + KMS)	Weak (no ZA region for many services)	Strong (ZA North + Customer Lockbox)
Support in EAT	24/7 Business; EMEA TAM overlap	24/7 Standard	24/7 ProDirect; ZA partners
Managed services breadth	Widest	Data/ML led	Microsoft-stack integration

Default to AWS

af-south-1

for Uganda workloads with S-tier DPPA 2019 data; use Azure

southafricanorth

only for .NET-heavy stacks with an existing EA licence; avoid GCP as primary for DPPA-scoped data until a ZA region is GA.

bash

aws configure set region af-south-1 --profile ug-prod
aws ec2 describe-availability-zones --region af-south-1 --query "AvailabilityZones[].ZoneName"

东非SaaS工作负载（乌干达、肯尼亚、坦桑尼亚）需权衡四个维度：用户延迟、乌干达DPPA 2019法规下的数据驻留要求、与东非时间（UTC+3）重叠的支持时长，以及单位工作负载成本。

维度	AWS	GCP	Azure
最近区域	`af-south-1` 开普敦（~30毫秒）	`europe-west1` （~160毫秒）	`southafricanorth` （~40毫秒）
数据驻留适配性	强（af-south-1 + KMS）	弱（多数服务无南非区域）	强（南非北部区域 + Customer Lockbox）
东非时间支持	7×24小时商务支持；EMEA技术客户经理重叠	7×24小时标准支持	7×24小时ProDirect支持；南非合作伙伴
托管服务广度	最广	以数据/ML为主	与Microsoft栈深度集成

对于受乌干达DPPA 2019法规约束的S级数据工作负载，默认选择AWS

af-south-1

；仅当现有EA许可证且以.NET技术栈为主时，使用Azure

southafricanorth

；在GCP推出南非正式区域前，避免将其作为受DPPA约束数据的主要服务商。

bash

aws configure set region af-south-1 --profile ug-prod
aws ec2 describe-availability-zones --region af-south-1 --query "AvailabilityZones[].ZoneName"

Compute Model Decision Rules

计算模型决策规则

Single app, low traffic, one region → EC2 + Docker Compose, backed by RDS Multi-AZ and S3.
Multiple services, scaling needs, no Kubernetes skill → ECS Fargate with ALB.
Multiple services, platform-ready team, polyglot runtime, multi-tenant isolation → Kubernetes (defer to
```
kubernetes-platform
```
).
Async fan-out, batch, or event pipeline → Lambda + SQS + EventBridge, with state in DynamoDB or RDS.

Kubernetes is a commitment, not a default.

单一应用、低流量、单区域 → EC2 + Docker Compose，搭配RDS多可用区和S3。
多服务、有扩缩容需求、无Kubernetes技能 → ECS Fargate + ALB。
多服务、团队具备平台化能力、多语言运行时、多租户隔离 → Kubernetes（参考
```
kubernetes-platform
```
技能）。
异步扇出、批处理或事件流水线 → Lambda + SQS + EventBridge，状态存储在DynamoDB或RDS中。

Kubernetes是一种承诺，而非默认选项。

Docker Fundamentals

Docker基础

Images are immutable, content-addressed layers. Containers are processes isolated by namespaces and cgroups. Disciplined Dockerfile authorship controls image size, cache behaviour, and attack surface.

镜像是不可变的、基于内容寻址的分层结构。容器是通过命名空间和cgroups实现隔离的进程。规范的Dockerfile编写可控制镜像大小、缓存行为和攻击面。

Dockerfile Checklist

Dockerfile检查清单

Multi-stage: compile/install in
```
builder
```
, copy only runtime artifacts to the final stage.
Pin base images by version and digest (
```
node:22.11.0-slim@sha256:...
```
).
Prefer distroless or
```
alpine
```
for runtime; target image ≤ 200 MB.
Run as non-root (
```
USER nonroot
```
or dedicated UID ≥ 10000). Set
```
WORKDIR
```
,
```
EXPOSE
```
,
```
HEALTHCHECK
```
explicitly.
Secrets via mounted files or orchestrator env — never baked in.
```
.dockerignore
```
excludes
```
.git
```
,
```
node_modules
```
, logs, fixtures, editor config.
Order
```
COPY
```
from least-changing (manifests) to most-changing (source) to preserve layer caching.

多阶段构建：在
```
builder
```
阶段编译/安装，仅将运行时工件复制到最终阶段。
固定基础镜像的版本和摘要（如
```
node:22.11.0-slim@sha256:...
```
）。
运行时优先选择distroless或
```
alpine
```
；目标镜像≤200MB。
以非root用户运行（
```
USER nonroot
```
或专用UID≥10000）。显式设置
```
WORKDIR
```
、
```
EXPOSE
```
、
```
HEALTHCHECK
```
。
通过挂载文件或编排器环境变量管理密钥——绝不要嵌入镜像中。
```
.dockerignore
```
需排除
```
.git
```
、
```
node_modules
```
、日志、测试数据、编辑器配置。
按变更频率从低到高排序
```
COPY
```
指令（先复制清单，再复制源码）以保留层缓存。

Production Node.js Dockerfile

生产环境Node.js Dockerfile

dockerfile

undefined

dockerfile

undefined

syntax=docker/dockerfile:1.7

FROM node:22.11.0-slim@sha256:<digest> AS builder WORKDIR /app COPY package*.json ./ RUN --mount=type=cache,target=/root/.npm npm ci --include=dev COPY . . RUN npm run build && npm prune --omit=dev

FROM gcr.io/distroless/nodejs22-debian12:nonroot AS runtime WORKDIR /app ENV NODE_ENV=production COPY --from=builder --chown=nonroot:nonroot /app/node_modules ./node_modules COPY --from=builder --chown=nonroot:nonroot /app/dist ./dist COPY --from=builder --chown=nonroot:nonroot /app/package.json ./ USER nonroot EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 CMD ["node", "dist/healthcheck.js"] CMD ["dist/server.js"]

undefined

FROM node:22.11.0-slim@sha256:<digest> AS builder WORKDIR /app COPY package*.json ./ RUN --mount=type=cache,target=/root/.npm npm ci --include=dev COPY . . RUN npm run build && npm prune --omit=dev

undefined

Docker Compose

One

docker-compose.yml

in the repo root mirrors production. Named volumes for stateful services; never bind-mount databases. Declare

healthcheck

on every dependency and gate startup with

depends_on.condition: service_healthy

yaml

name: saas-local
services:
  web:
    build: .
    env_file: .env
    ports: ["3000:3000"]
    depends_on:
      db: { condition: service_healthy }
      redis: { condition: service_healthy }
    healthcheck:
      test: ["CMD", "node", "dist/healthcheck.js"]
      interval: 30s
      timeout: 5s
      retries: 3
  db:
    image: postgres:16.4-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_DB: app
    volumes: ["db-data:/var/lib/postgresql/data"]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d app"]
      interval: 10s
      timeout: 3s
      retries: 5
    secrets: [db_password]
  redis:
    image: redis:7.4-alpine
    command: ["redis-server", "--appendonly", "yes"]
    volumes: ["redis-data:/data"]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
volumes:
  db-data: {}
  redis-data: {}
secrets:
  db_password: { file: ./.secrets/db_password }

Commit

.env.example

, ignore

.env

, and provide env through the orchestrator in production. See

references/docker-compose-patterns.md

for the full template.

仓库根目录下的一个

docker-compose.yml

需镜像生产环境状态。为有状态服务使用命名卷；绝不要绑定挂载数据库。为每个依赖项声明

healthcheck

，并通过

depends_on.condition: service_healthy

控制启动顺序。

yaml

name: saas-local
services:
  web:
    build: .
    env_file: .env
    ports: ["3000:3000"]
    depends_on:
      db: { condition: service_healthy }
      redis: { condition: service_healthy }
    healthcheck:
      test: ["CMD", "node", "dist/healthcheck.js"]
      interval: 30s
      timeout: 5s
      retries: 3
  db:
    image: postgres:16.4-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_DB: app
    volumes: ["db-data:/var/lib/postgresql/data"]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d app"]
      interval: 10s
      timeout: 3s
      retries: 5
    secrets: [db_password]
  redis:
    image: redis:7.4-alpine
    command: ["redis-server", "--appendonly", "yes"]
    volumes: ["redis-data:/data"]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
volumes:
  db-data: {}
  redis-data: {}
secrets:
  db_password: { file: ./.secrets/db_password }

提交

.env.example

，忽略

.env

，生产环境通过编排器提供环境变量。完整模板请参考

references/docker-compose-patterns.md

。

AWS Core Services

AWS核心服务

Compute

计算

Instance families:

t3

t4g

burstable (dev, low-traffic),

m6i

m7i

balanced production,

c6i

c7i

CPU-bound,

r6i

r7i

memory-bound,

i4i

NVMe-heavy. Place production instances in private subnets; expose only via ALB/NLB. Build AMIs with Packer or EC2 Image Builder; no manual console edits.

yaml

LaunchTemplate:
  Type: AWS::EC2::LaunchTemplate
  Properties:
    LaunchTemplateName: app-prod-lt
    LaunchTemplateData:
      ImageId: ami-0123456789abcdef0
      InstanceType: m6i.large
      IamInstanceProfile: { Name: app-prod-instance-profile }
      SecurityGroupIds: [sg-app]
      MetadataOptions: { HttpTokens: required, HttpEndpoint: enabled }
AppASG:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2
    MaxSize: 10
    DesiredCapacity: 3
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    VPCZoneIdentifier: [subnet-priv-a, subnet-priv-b, subnet-priv-c]
    LaunchTemplate:
      LaunchTemplateId: !Ref LaunchTemplate
      Version: !GetAtt LaunchTemplate.LatestVersionNumber
    TargetGroupARNs: [!Ref AppTargetGroup]

实例系列：

t3

t4g

突发型（开发、低流量场景），

m6i

m7i

均衡型生产环境，

c6i

c7i

CPU密集型，

r6i

r7i

内存密集型，

i4i

NVMe存储密集型。生产环境实例放置在私有子网中；仅通过ALB/NLB暴露。使用Packer或EC2 Image Builder构建AMI；禁止手动通过控制台修改。

yaml

LaunchTemplate:
  Type: AWS::EC2::LaunchTemplate
  Properties:
    LaunchTemplateName: app-prod-lt
    LaunchTemplateData:
      ImageId: ami-0123456789abcdef0
      InstanceType: m6i.large
      IamInstanceProfile: { Name: app-prod-instance-profile }
      SecurityGroupIds: [sg-app]
      MetadataOptions: { HttpTokens: required, HttpEndpoint: enabled }
AppASG:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2
    MaxSize: 10
    DesiredCapacity: 3
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    VPCZoneIdentifier: [subnet-priv-a, subnet-priv-b, subnet-priv-c]
    LaunchTemplate:
      LaunchTemplateId: !Ref LaunchTemplate
      Version: !GetAtt LaunchTemplate.LatestVersionNumber
    TargetGroupARNs: [!Ref AppTargetGroup]

Storage

存储

Enable default encryption, block public access, and turn on versioning for any data you cannot reconstruct. Lifecycle: transition > 30 days to Standard-IA, > 90 days to Glacier Instant Retrieval, expire multipart uploads > 7 days. Use presigned URLs for customer uploads/downloads; never hand out credentials. Multipart upload threshold ≥ 100 MB; part size 8–16 MB.

bash

aws s3 presign s3://app-prod-uploads/customer/42/invoice.pdf \
  --expires-in 900 --region af-south-1

aws configure set default.s3.multipart_threshold 100MB
aws configure set default.s3.multipart_chunksize 16MB

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyInsecureTransport",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::app-prod-uploads", "arn:aws:s3:::app-prod-uploads/*"],
      "Condition": { "Bool": { "aws:SecureTransport": "false" } }
    }
  ]
}

对所有无法重建的数据启用默认加密、阻止公共访问并开启版本控制。生命周期策略：超过30天的对象转换为Standard-IA存储类，超过90天转换为Glacier即时检索存储类，超过7天的多部分上传自动过期。使用预签名URL处理客户上传/下载；绝不要分发凭证。多部分上传阈值≥100MB；分片大小8-16MB。

bash

aws s3 presign s3://app-prod-uploads/customer/42/invoice.pdf \
  --expires-in 900 --region af-south-1

aws configure set default.s3.multipart_threshold 100MB
aws configure set default.s3.multipart_chunksize 16MB

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyInsecureTransport",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::app-prod-uploads", "arn:aws:s3:::app-prod-uploads/*"],
      "Condition": { "Bool": { "aws:SecureTransport": "false" } }
    }
  ]
}

Database

数据库

Multi-AZ for every production RDS MySQL/PostgreSQL; synchronous standby in a second AZ. Automated backups retention 7–35 days with PITR. Read replicas for read-heavy paths, never for durability. Parameter groups hold tunings; never edit defaults in place.

bash

aws rds create-db-parameter-group --db-parameter-group-name app-pg16-prod \
  --db-parameter-group-family postgres16 --description "Prod PG16 params"
aws rds create-db-instance --db-instance-identifier app-prod \
  --engine postgres --engine-version 16.4 --db-instance-class db.m6i.large \
  --allocated-storage 200 --storage-type gp3 --storage-encrypted \
  --multi-az --backup-retention-period 14 --db-parameter-group-name app-pg16-prod \
  --monitoring-interval 60 --enable-performance-insights

所有生产环境RDS MySQL/PostgreSQL均需启用多可用区；在第二个可用区配置同步备用实例。自动备份保留7-35天，并开启PITR（时点恢复）。读取副本仅用于读密集型场景，绝不用于数据持久化。参数组存储调优配置；禁止直接修改默认值。

bash

aws rds create-db-parameter-group --db-parameter-group-name app-pg16-prod \
  --db-parameter-group-family postgres16 --description "Prod PG16 params"
aws rds create-db-instance --db-instance-identifier app-prod \
  --engine postgres --engine-version 16.4 --db-instance-class db.m6i.large \
  --allocated-storage 200 --storage-type gp3 --storage-encrypted \
  --multi-az --backup-retention-period 14 --db-parameter-group-name app-pg16-prod \
  --monitoring-interval 60 --enable-performance-insights

Serverless

无服务器

Lambda triggers: S3 object-created, SQS queue, API Gateway, EventBridge schedule, DynamoDB Streams. Cold-start mitigation: provisioned concurrency for latency-sensitive paths; a 5-minute EventBridge keep-warm rule as a low-cost fallback. Keep deployment package ≤ 50 MB zipped; container images only when native deps demand it.

bash

aws lambda put-provisioned-concurrency-config \
  --function-name order-api --qualifier live \
  --provisioned-concurrent-executions 5

Lambda触发器：S3对象创建、SQS队列、API Gateway、EventBridge定时任务、DynamoDB流。冷启动缓解策略：对延迟敏感路径配置预置并发；低成本 fallback方案是使用5分钟间隔的EventBridge保活规则。部署包压缩后≤50MB；仅当存在原生依赖时使用容器镜像。

bash

aws lambda put-provisioned-concurrency-config \
  --function-name order-api --qualifier live \
  --provisioned-concurrent-executions 5

IAM

Roles, not users, for workloads — instance profiles on EC2, task roles on ECS. Policy statements scoped to specific ARNs and actions — no

*:*

. CI uses OIDC federation to assume role; no long-lived keys. MFA on every human account; root locked away with hardware MFA.

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AppReadUploads",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::app-prod-uploads/*"
    },
    {
      "Sid": "AppReadSecrets",
      "Effect": "Allow",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "arn:aws:secretsmanager:af-south-1:111122223333:secret:app/prod/*"
    }
  ]
}

工作负载使用角色而非用户——EC2使用实例配置文件，ECS使用任务角色。策略语句限定到特定ARN和操作——禁止使用

*:*

。CI使用OIDC联邦身份来扮演角色；禁止使用长期密钥。所有人工账户启用MFA；根账户通过硬件MFA锁定。

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AppReadUploads",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::app-prod-uploads/*"
    },
    {
      "Sid": "AppReadSecrets",
      "Effect": "Allow",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "arn:aws:secretsmanager:af-south-1:111122223333:secret:app/prod/*"
    }
  ]
}

Networking

网络

Design the VPC across ≥ 3 AZs for production, 2 for non-production. Allocate a /16 and carve /20 public and /20 private subnets per AZ. One NAT gateway per AZ in production — single-AZ NAT is a SPOF and cross-AZ data charges bite.

Layer	CIDR example	Routing
Public subnets	10.20.0.0/20 per AZ	IGW default route
Private app subnets	10.20.32.0/20 per AZ	NAT gateway in same AZ
Private data subnets	10.20.64.0/20 per AZ	No outbound route

Security groups are stateful instance-level allow-lists — the primary tool. NACLs are stateless subnet-level deny/allow lists — use only for coarse boundaries (blocking known-bad CIDRs). Reserve ≥ /18 headroom for peering or Transit Gateway.

bash

aws ec2 create-vpc --cidr-block 10.20.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ug-prod-vpc}]'
aws ec2 create-nat-gateway --subnet-id subnet-pub-a --allocation-id eipalloc-aaa

生产环境VPC需覆盖≥3个可用区，非生产环境覆盖2个可用区。分配/16网段，每个可用区划分/20的公共子网和/20的私有子网。生产环境每个可用区配置一个NAT网关——单可用区NAT是单点故障，且跨可用区数据传输成本较高。

层级	CIDR示例	路由配置
公共子网	每个可用区10.20.0.0/20	默认路由指向IGW
私有应用子网	每个可用区10.20.32.0/20	指向同可用区的NAT网关
私有数据子网	每个可用区10.20.64.0/20	无出站路由

安全组是有状态的实例级允许列表——这是主要工具。NACL是无状态的子网级拒绝/允许列表——仅用于粗粒度边界（阻止已知恶意CIDR）。预留≥/18的网段空间用于对等连接或Transit Gateway。

bash

aws ec2 create-vpc --cidr-block 10.20.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ug-prod-vpc}]'
aws ec2 create-nat-gateway --subnet-id subnet-pub-a --allocation-id eipalloc-aaa

Load Balancers

负载均衡器

Feature	ALB	NLB
Layer	7 (HTTP/HTTPS/gRPC)	4 (TCP/UDP/TLS)
Routing	Host, path, header, query	Port-based
TLS termination	At ALB	Passthrough or at NLB
Sticky sessions	Cookie-based	Source-IP flow hash
Use case	Web APIs, microservices	High-throughput TCP, static IPs, PrivateLink

Health checks hit a dedicated

/healthz

path on a dedicated port when feasible; verify dependencies shallowly — not deeply, or cascading failures evict healthy targets.

bash

aws elbv2 create-target-group --name app-tg-blue --protocol HTTP --port 3000 \
  --vpc-id vpc-0abc --health-check-path /healthz --health-check-interval-seconds 15 \
  --healthy-threshold-count 2 --unhealthy-threshold-count 3 --matcher HttpCode=200
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTPS --port 443 \
  --certificates CertificateArn=$ACM_ARN \
  --ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
  --default-actions Type=forward,TargetGroupArn=$TG_BLUE

特性	ALB	NLB
层级	7层（HTTP/HTTPS/gRPC）	4层（TCP/UDP/TLS）
路由方式	基于主机、路径、头信息、查询参数	基于端口
TLS终止	在ALB层	透传或在NLB层
会话保持	基于Cookie	基于源IP流哈希
适用场景	Web API、微服务	高吞吐量TCP、静态IP、PrivateLink

健康检查尽可能命中专用端口上的

/healthz

路径；浅度验证依赖项——不要深度验证，否则级联故障会驱逐健康目标。

bash

aws elbv2 create-target-group --name app-tg-blue --protocol HTTP --port 3000 \
  --vpc-id vpc-0abc --health-check-path /healthz --health-check-interval-seconds 15 \
  --healthy-threshold-count 2 --unhealthy-threshold-count 3 --matcher HttpCode=200
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTPS --port 443 \
  --certificates CertificateArn=$ACM_ARN \
  --ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
  --default-actions Type=forward,TargetGroupArn=$TG_BLUE

CDN

CloudFront or Cloudflare in front of every static asset and cacheable API response. Enable Origin Shield in a region close to the origin to cut origin fetches by 60–80%. Attach AWS WAF with the Managed Rules Core Rule Set plus Known Bad Inputs and IP-Reputation lists; add a rate-based rule at 2000 requests per 5 minutes per IP for unauthenticated endpoints.

bash

aws cloudfront create-distribution --distribution-config file://cf-dist.json
aws wafv2 create-web-acl --name app-prod-waf --scope CLOUDFRONT --default-action Allow={} \
  --visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=app-prod-waf \
  --rules file://waf-managed-rules.json

Invalidate surgically — never

invalidate /*

on every deploy; use versioned asset paths (

/static/v=<build-sha>/

) and cache-bust only HTML.

所有静态资源和可缓存API响应前需部署CloudFront或Cloudflare。在靠近源站的区域启用Origin Shield，可减少60-80%的源站请求。附加AWS WAF，启用托管规则核心规则集、已知恶意输入规则和IP信誉列表；对未认证端点添加速率限制规则，限制为每IP每5分钟2000次请求。

bash

aws cloudfront create-distribution --distribution-config file://cf-dist.json
aws wafv2 create-web-acl --name app-prod-waf --scope CLOUDFRONT --default-action Allow={} \
  --visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=app-prod-waf \
  --rules file://waf-managed-rules.json

精准执行失效操作——不要在每次部署时执行

invalidate /*

；使用带版本号的资源路径（如

/static/v=<build-sha>/

），仅对HTML进行缓存刷新。

SSL/TLS Automation

SSL/TLS自动化

AWS ALB, CloudFront, API Gateway → ACM certificates: free, auto-renewed, DNS-validated via Route 53.
VPS or single host → Certbot + Let's Encrypt with the installer's systemd timer; nightly cron only when systemd is unavailable.
Kubernetes →
```
cert-manager
```
with a
```
ClusterIssuer
```
for Let's Encrypt ACME HTTP-01 or DNS-01.

bash

aws acm request-certificate --domain-name app.example.co.ug \
  --subject-alternative-names "*.app.example.co.ug" \
  --validation-method DNS --key-algorithm RSA_2048
sudo certbot --nginx -d app.example.co.ug --deploy-hook "systemctl reload nginx"
kubectl apply -f cert-manager/letsencrypt-prod-issuer.yaml

TLS 1.2 minimum, prefer 1.3. Enable HSTS

max-age=31536000; includeSubDomains; preload

once the production cert path is stable.

AWS ALB、CloudFront、API Gateway → ACM证书：免费、自动续期、通过Route 53进行DNS验证。
VPS或单主机 → Certbot + Let's Encrypt，使用安装程序的systemd定时器；仅当systemd不可用时使用夜间cron任务。
Kubernetes →
```
cert-manager
```
搭配
```
ClusterIssuer
```
，使用Let's Encrypt ACME HTTP-01或DNS-01验证方式。

bash

aws acm request-certificate --domain-name app.example.co.ug \
  --subject-alternative-names "*.app.example.co.ug" \
  --validation-method DNS --key-algorithm RSA_2048
sudo certbot --nginx -d app.example.co.ug --deploy-hook "systemctl reload nginx"
kubectl apply -f cert-manager/letsencrypt-prod-issuer.yaml

最低要求TLS 1.2，优先使用TLS 1.3。生产环境证书路径稳定后，启用HSTS

max-age=31536000; includeSubDomains; preload

。

Auto-Scaling

自动扩缩容

Target tracking first, step scaling second, predictive third. Scale on request count per target and P95 latency — not CPU alone.

bash

aws application-autoscaling put-scaling-policy --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount --resource-id service/app-cluster/app-svc \
  --policy-name tt-reqcount --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/alb-arn/tg-arn"
    },
    "ScaleOutCooldown": 60, "ScaleInCooldown": 300
  }'

CPU target 70% for CPU-bound services; never below 40% (wastes capacity). Scheduled scaling for predictable load (EAT business hours 07:00–19:00). Predictive scaling requires ≥ 14 days of CloudWatch history and a regular daily/weekly pattern — otherwise predictions are noise. Warm pools for slow-booting AMIs (> 3 min boot).

优先使用目标追踪，其次是步进扩缩容，最后是预测性扩缩容。基于每个目标的请求数和P95延迟进行扩缩容——不要仅基于CPU。

bash

aws application-autoscaling put-scaling-policy --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount --resource-id service/app-cluster/app-svc \
  --policy-name tt-reqcount --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/alb-arn/tg-arn"
    },
    "ScaleOutCooldown": 60, "ScaleInCooldown": 300
  }'

CPU密集型服务的CPU目标设为70%；绝不低于40%（会浪费容量）。针对可预测负载使用定时扩缩容（东非时间工作日07:00–19:00）。预测性扩缩容需要≥14天的CloudWatch历史数据和规律的每日/每周模式——否则预测结果无效。对于启动缓慢的AMI（启动时间>3分钟）使用预热池。

Zero-Downtime Deployments

零停机部署

Blue-green via ALB target-group swap for stateful-client apps; ASG instance refresh for stateless fleets. Canary for risky changes (pull weight to zero to rollback); shadow for unproven services receiving mirrored traffic. Automatic rollback triggers on health-check failure, 5xx-rate regression > 0.5% over 5 min, or P95 latency regression beyond SLO budget.

Blue-green procedure: register green with

app-tg-green

, wait for all targets

healthy

via

aws elbv2 describe-target-health

, then swap the listener:

bash

aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$TG_GREEN

Hold blue for 30 minutes as a hot rollback target; deregister only after error-rate and latency SLOs hold. Rolling update via ASG instance refresh:

bash

aws autoscaling start-instance-refresh --auto-scaling-group-name app-prod-asg \
  --strategy Rolling --preferences '{
    "MinHealthyPercentage": 90, "InstanceWarmup": 180,
    "CheckpointPercentages": [25, 50, 100], "CheckpointDelay": 600
  }'

Rollback: re-point the listener to

app-tg-blue

(blue-green), or

aws autoscaling cancel-instance-refresh

and roll forward with the prior Launch Template version. Schema migrations must be backwards-compatible across two application versions (expand → migrate → contract). Every deploy writes a signed record: who, what, when, artifact digest.

针对有状态客户端应用，通过ALB目标组切换实现蓝绿部署；针对无状态集群使用ASG实例刷新。对于高风险变更使用金丝雀发布（将流量权重调至0即可回滚）；对于未验证的服务使用影子发布（接收镜像流量）。当健康检查失败、5xx错误率较基准上升>0.5%（持续5分钟）或P95延迟超出SLO预算时，触发自动回滚。

蓝绿部署流程：将绿色环境注册到

app-tg-green

，等待所有目标通过

aws elbv2 describe-target-health

显示为

healthy

，然后切换监听器：

bash

aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$TG_GREEN

保留蓝色环境30分钟作为热回滚目标；仅在错误率和延迟符合SLO后才注销。通过ASG实例刷新实现滚动更新：

bash

aws autoscaling start-instance-refresh --auto-scaling-group-name app-prod-asg \
  --strategy Rolling --preferences '{
    "MinHealthyPercentage": 90, "InstanceWarmup": 180,
    "CheckpointPercentages": [25, 50, 100], "CheckpointDelay": 600
  }'

回滚操作：将监听器重新指向

app-tg-blue

（蓝绿部署），或执行

aws autoscaling cancel-instance-refresh

并使用之前的Launch Template版本向前滚动。 Schema迁移必须在两个应用版本间保持向后兼容（扩展→迁移→收缩）。每次部署需写入签名记录：执行人、内容、时间、工件摘要。

Backup & Disaster Recovery

备份与灾难恢复

Define RTO (how fast to recover) and RPO (how much data loss is tolerable) before picking tools. Typical production SaaS targets RTO ≤ 4 h, RPO ≤ 15 min.

RDS: automated backups retention 7–35 days with PITR; weekly manual snapshots retained 90 days; cross-region snapshot copy to
```
eu-west-1
```
as a sovereignty-preserving DR site.
S3: versioning on every data bucket; lifecycle moves non-current versions to Glacier Deep Archive after 60 days; Cross-Region Replication for critical buckets.
EBS: daily snapshots via AWS Backup with a 30-day retention plan.

bash

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:af-south-1:111122223333:snapshot:app-prod-2026-04-15 \
  --target-db-snapshot-identifier app-prod-2026-04-15-dr \
  --kms-key-id alias/rds-dr --source-region af-south-1 --region eu-west-1
aws s3api put-bucket-versioning --bucket app-prod-uploads --versioning-configuration Status=Enabled

Rehearse restore quarterly — an untested backup is a hypothesis, not a backup.

选择工具前需定义RTO（恢复速度）和RPO（可容忍的数据丢失量）。典型生产SaaS的目标为RTO≤4小时，RPO≤15分钟。

RDS：自动备份保留7-35天并开启PITR；每周手动快照保留90天；跨区域快照复制到
```
eu-west-1
```
作为符合主权要求的灾难恢复站点。
S3：所有数据桶开启版本控制；生命周期策略将非当前版本在60天后迁移至Glacier深度归档；关键桶启用跨区域复制。
EBS：通过AWS Backup每日创建快照，保留30天。

bash

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:af-south-1:111122223333:snapshot:app-prod-2026-04-15 \
  --target-db-snapshot-identifier app-prod-2026-04-15-dr \
  --kms-key-id alias/rds-dr --source-region af-south-1 --region eu-west-1
aws s3api put-bucket-versioning --bucket app-prod-uploads --versioning-configuration Status=Enabled

每季度演练恢复流程——未测试的备份只是假设，而非有效备份。

Cost Optimisation

成本优化

Reserved Instances or Savings Plans for steady baseline (70–80% of average compute); on-demand for burst. Prefer Compute Savings Plans (1y no-upfront starting posture; 3y only when headcount and roadmap are certain) — they apply across EC2, Fargate, Lambda.
Spot for non-critical async workers and CI runners with a graceful shutdown handler for the 2-minute interruption notice.
S3 Intelligent-Tiering on buckets with unpredictable access; tag every resource with
```
Environment
```
,
```
Team
```
,
```
CostCenter
```
,
```
Project
```
and activate these as cost-allocation tags in Billing.
Cost Explorer, Cost Anomaly Detection, and per-environment budgets on from day one.

bash

aws ce list-cost-allocation-tags --status Active --region us-east-1
aws budgets create-budget --account-id 111122223333 --budget '{
  "BudgetName": "ug-prod-monthly",
  "BudgetLimit": { "Amount": "5000", "Unit": "USD" },
  "TimeUnit": "MONTHLY", "BudgetType": "COST",
  "CostFilters": { "TagKeyValue": ["user:Environment$prod"] }
}'

针对稳定基线计算（占平均计算量的70-80%）使用预留实例或Savings Plans；针对突发流量使用按需实例。优先选择Compute Savings Plans（1年期无预付金起步；仅当人员配置和路线图明确时选择3年期）——适用于EC2、Fargate、Lambda。
针对非关键异步工作者和CI运行器使用竞价实例，并实现优雅关闭处理以应对2分钟中断通知。
对访问模式不可预测的桶使用S3智能分层；为每个资源添加
```
Environment
```
、
```
Team
```
、
```
CostCenter
```
、
```
Project
```
标签，并在账单中激活这些标签作为成本分配标签。
从第一天起启用Cost Explorer、成本异常检测和按环境预算。

bash

aws ce list-cost-allocation-tags --status Active --region us-east-1
aws budgets create-budget --account-id 111122223333 --budget '{
  "BudgetName": "ug-prod-monthly",
  "BudgetLimit": { "Amount": "5000", "Unit": "USD" },
  "TimeUnit": "MONTHLY", "BudgetType": "COST",
  "CostFilters": { "TagKeyValue": ["user:Environment$prod"] }
}'

Multi-Region Considerations

多区域考量

Latency from East Africa:
```
af-south-1
```
~ 30 ms;
```
eu-west-1
```
~ 150 ms;
```
us-east-1
```
~ 220 ms. Place user-facing tiers in
```
af-south-1
```
whenever available.
Data residency: Uganda DPPA 2019 requires personal data of Ugandan data subjects to be processed in a jurisdiction with adequate protection;
```
af-south-1
```
with KMS customer-managed keys is the low-friction default. Log the data-flow and cross-border transfer basis in
```
_context/compliance.md
```
.
Replication: active-passive (primary
```
af-south-1
```
, warm standby
```
eu-west-1
```
) is the common starting posture; active-active only when conflict-resolution is designed in (DynamoDB Global Tables, Aurora Global Database with write forwarding). Route 53 health-checked failover records for DR, not client-side retry loops.

bash

aws route53 create-health-check --caller-reference "ug-app-$(date +%s)" --health-check-config file://hc.json
aws dynamodb update-table --table-name orders --replica-updates '[{"Create": {"RegionName": "eu-west-1"}}]'

东非区域延迟：
```
af-south-1
```
~30毫秒；
```
eu-west-1
```
~150毫秒；
```
us-east-1
```
~220毫秒。只要可用，用户面向层应部署在
```
af-south-1
```
。
数据驻留：乌干达DPPA 2019要求乌干达数据主体的个人数据在具备充分保护的司法管辖区处理；
```
af-south-1
```
搭配KMS客户管理密钥是低摩擦的默认选择。在
```
_context/compliance.md
```
中记录数据流和跨境传输依据。
复制：主备模式（主
```
af-south-1
```
，热备
```
eu-west-1
```
）是常见的初始配置；仅当设计了冲突解决机制时才使用多活模式（DynamoDB全局表、支持写入转发的Aurora全局数据库）。使用Route 53健康检查故障转移记录进行灾难恢复，而非客户端重试循环。

bash

aws route53 create-health-check --caller-reference "ug-app-$(date +%s)" --health-check-config file://hc.json
aws dynamodb update-table --table-name orders --replica-updates '[{"Create": {"RegionName": "eu-west-1"}}]'

Security Baseline

安全基线

Enable these on the management account and every member account on day one. All commands are idempotent — safe to re-run.

bash

aws cloudtrail create-trail --name org-trail --s3-bucket-name org-cloudtrail-logs \
  --is-multi-region-trail --is-organization-trail --enable-log-file-validation \
  --kms-key-id alias/cloudtrail
aws cloudtrail start-logging --name org-trail
aws s3api put-bucket-versioning --bucket org-cloudtrail-logs --versioning-configuration Status=Enabled
aws configservice start-configuration-recorder --configuration-recorder-name default
aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES
aws securityhub enable-security-hub --enable-default-standards
aws accessanalyzer create-analyzer --analyzer-name org-analyzer --type ORGANIZATION

CloudTrail: all regions, S3 bucket with versioning, log-file validation, KMS-encrypted.
AWS Config: enable the AWS Foundational Security Best Practices conformance pack.
GuardDuty: detector in every region with S3 and EKS protection on.
Security Hub: aggregate findings in a delegated admin account; resolve Critical/High within team SLO. IAM Access Analyzer: organization-level, reviewed weekly.

在管理账户和所有成员账户启用以下功能，从第一天开始执行。所有命令均为幂等——可安全重复执行。

bash

aws cloudtrail create-trail --name org-trail --s3-bucket-name org-cloudtrail-logs \
  --is-multi-region-trail --is-organization-trail --enable-log-file-validation \
  --kms-key-id alias/cloudtrail
aws cloudtrail start-logging --name org-trail
aws s3api put-bucket-versioning --bucket org-cloudtrail-logs --versioning-configuration Status=Enabled
aws configservice start-configuration-recorder --configuration-recorder-name default
aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES
aws securityhub enable-security-hub --enable-default-standards
aws accessanalyzer create-analyzer --analyzer-name org-analyzer --type ORGANIZATION

CloudTrail：覆盖所有区域，S3桶开启版本控制，日志文件验证，KMS加密。
AWS Config：启用AWS基础安全最佳实践一致性包。
GuardDuty：每个区域启用检测器，开启S3和EKS保护。
Security Hub：在委托管理账户聚合发现结果；按团队SLO解决严重/高危问题。IAM Access Analyzer：组织级，每周评审。

Review Checklist

评审检查清单

Platform Notes

平台说明

Claude Code:
```
aws
```
CLI and
```
docker
```
CLI are the primary surface. Configure profiles with
```
aws configure sso
```
; use named profiles per environment.
Codex: treat every command as a patch candidate; keep commands in shell blocks so they stay portable.

Claude Code：
```
aws
```
CLI和
```
docker
```
CLI是主要操作界面。使用
```
aws configure sso
```
配置配置文件；按环境使用命名配置文件。
Codex：将每个命令视为补丁候选；将命令放在shell块中以保持可移植性。

References

参考资料

references/aws-core-services.md: EC2, S3, RDS, IAM, ALB, ASG, CloudFront CLI recipes.
references/docker-compose-patterns.md: Full local-parity stack template.
references/deployment-patterns.md: Blue-green and canary runbooks with rollback steps.
AWS Well-Architected Framework: aws.amazon.com/architecture/well-architected
Docker Deep Dive — Nigel Poulton (reading programme, Phase 01 priority 1).

references/aws-core-services.md：EC2、S3、RDS、IAM、ALB、ASG、CloudFront CLI示例。
references/docker-compose-patterns.md：完整本地一致栈模板。
references/deployment-patterns.md：蓝绿和金丝雀发布手册及回滚步骤。
AWS Well-Architected框架：aws.amazon.com/architecture/well-architected
Docker Deep Dive — Nigel Poulton（阅读计划，Phase 01优先级1）。