cloud-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCloud Architecture
云架构
<!-- dual-compat-start -->
<!-- dual-compat-start -->
Use When
适用场景
- Use when designing cloud deployments, Dockerising applications, laying out AWS or GCP environments, choosing a deployment pattern, or moving a workload from a single VM to a resilient multi-AZ topology.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
- 适用于设计云部署方案、将应用Docker容器化、规划AWS或GCP环境、选择部署模式,或是将工作负载从单一VM迁移至高可用多可用区(multi-AZ)拓扑结构时使用。
- 任务需要可复用的判断逻辑、领域约束条件或经过验证的工作流,而非临时建议。
Do Not Use When
不适用场景
- The task is unrelated to or would be better handled by a more specific companion skill.
cloud-architecture - The request only needs a trivial answer and none of this skill's constraints or references materially help.
- 任务与无关,或更适合由更专业的配套技能处理。
cloud-architecture - 请求仅需要简单答案,且本技能的约束条件或参考资料无法提供实质性帮助。
Required Inputs
必要输入
- Gather relevant project context, constraints, and the concrete problem to solve; load only as needed.
references - Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
- 收集相关项目背景、约束条件及具体待解决问题;仅在需要时加载内容。
references - 确认期望交付物:设计方案、代码、评审结果、迁移计划、审计报告或文档。
Workflow
工作流
- Read this first, then load only the referenced deep-dive files that are necessary for the task.
SKILL.md - Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
- 首先阅读本,然后仅加载完成任务所需的参考深度文档。
SKILL.md - 应用本技能中的有序指导、检查清单和决策规则,而非随意挑选孤立片段。
- 交付成果需明确说明假设条件、风险及后续工作(若相关)。
Quality Standards
质量标准
- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
- 输出内容需以执行为导向、简洁明了,并与仓库的基线工程标准保持一致。
- 除非技能明确要求更高标准,否则需兼容现有项目约定。
- 优先采用可确定、可评审的步骤,而非模糊建议或工具特定的“魔法操作”。
Anti-Patterns
反模式
- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.
- 将示例视为可直接复制粘贴的标准答案,而不检查是否适配、约束条件或失败模式。
- 默认加载所有参考文件,而非按需逐步披露。
Outputs
输出成果
- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.
- 符合任务需求的具体结果:实现指导、评审发现、架构决策、模板或生成的工件。
- 当仅靠现有上下文无法完成任务时,需明确说明假设条件、权衡方案或未解决的缺口。
- 若能实质性提升执行效果,需列出使用的参考资料、配套技能或后续行动。
Evidence Produced
生成的证据
| Category | Artifact | Format | Example |
|---|---|---|---|
| Correctness | Cloud topology decision record | Markdown doc per | |
| Security | Cloud account hardening checklist | Markdown doc covering root-account, IAM, network, and logging baseline | |
| 分类 | 工件 | 格式 | 示例 |
|---|---|---|---|
| 正确性 | 云拓扑决策记录 | 遵循 | |
| 安全性 | 云账户加固检查清单 | 涵盖根账户、IAM、网络和日志基线的Markdown文档 | |
References
参考资料
- Use the directory for deep detail after reading the core workflow below.
references/
- 阅读核心工作流后,如需深入细节可使用目录下的内容。
references/
Load Order
加载顺序
- Load for the production bar.
world-class-engineering - Load for decomposition and contracts.
system-architecture-design - Load this skill for the cloud runtime shape.
- Pair with for delivery,
cicd-pipelinesfor gate policy,cicd-devsecopsfor telemetry,observability-monitoringfor rollout, anddeployment-release-engineeringfor failure design.reliability-engineering
- 加载以获取生产环境标准。
world-class-engineering - 加载以进行分解和契约定义。
system-architecture-design - 加载本技能以确定云运行时形态。
- 搭配用于交付、
cicd-pipelines用于网关策略、cicd-devsecops用于遥测、observability-monitoring用于发布、deployment-release-engineering用于故障设计。reliability-engineering
Executable Outputs
可执行输出
For meaningful cloud architecture work produce: workload classification (stateless, stateful, async, batch, scheduled), chosen compute model with rationale, VPC + subnet + routing layout across AZs, Dockerfile (multi-stage, pinned base), mirroring production, IAM role inventory with least-privilege policies, deployment pattern choice and rollback runbook, cost posture (reserved/on-demand/spot split, Savings Plan assessment), and CDN/TLS/WAF/auto-scaling configuration.
docker-compose.yml对于有意义的云架构工作,需生成:工作负载分类(无状态、有状态、异步、批处理、定时任务)、选定的计算模型及理由、跨可用区的VPC+子网+路由布局、Dockerfile(多阶段、固定基础镜像)、镜像生产环境的、遵循最小权限原则的IAM角色清单、部署模式选择及回滚手册、成本配置(预留/按需/竞价实例拆分、Savings Plan评估),以及CDN/TLS/WAF/自动扩缩容配置。
docker-compose.ymlCloud Provider Selection
云服务商选择
East African SaaS workloads (Uganda, Kenya, Tanzania) weigh four dimensions: latency to users, data-residency obligations under Uganda DPPA 2019, support hours overlapping EAT (UTC+3), and price-per-workload.
| Dimension | AWS | GCP | Azure |
|---|---|---|---|
| Closest region | | | |
| Data-residency fit | Strong (af-south-1 + KMS) | Weak (no ZA region for many services) | Strong (ZA North + Customer Lockbox) |
| Support in EAT | 24/7 Business; EMEA TAM overlap | 24/7 Standard | 24/7 ProDirect; ZA partners |
| Managed services breadth | Widest | Data/ML led | Microsoft-stack integration |
Default to AWS for Uganda workloads with S-tier DPPA 2019 data; use Azure only for .NET-heavy stacks with an existing EA licence; avoid GCP as primary for DPPA-scoped data until a ZA region is GA.
af-south-1southafricanorthbash
aws configure set region af-south-1 --profile ug-prod
aws ec2 describe-availability-zones --region af-south-1 --query "AvailabilityZones[].ZoneName"东非SaaS工作负载(乌干达、肯尼亚、坦桑尼亚)需权衡四个维度:用户延迟、乌干达DPPA 2019法规下的数据驻留要求、与东非时间(UTC+3)重叠的支持时长,以及单位工作负载成本。
| 维度 | AWS | GCP | Azure |
|---|---|---|---|
| 最近区域 | | | |
| 数据驻留适配性 | 强(af-south-1 + KMS) | 弱(多数服务无南非区域) | 强(南非北部区域 + Customer Lockbox) |
| 东非时间支持 | 7×24小时商务支持;EMEA技术客户经理重叠 | 7×24小时标准支持 | 7×24小时ProDirect支持;南非合作伙伴 |
| 托管服务广度 | 最广 | 以数据/ML为主 | 与Microsoft栈深度集成 |
对于受乌干达DPPA 2019法规约束的S级数据工作负载,默认选择AWS ;仅当现有EA许可证且以.NET技术栈为主时,使用Azure ;在GCP推出南非正式区域前,避免将其作为受DPPA约束数据的主要服务商。
af-south-1southafricanorthbash
aws configure set region af-south-1 --profile ug-prod
aws ec2 describe-availability-zones --region af-south-1 --query "AvailabilityZones[].ZoneName"Compute Model Decision Rules
计算模型决策规则
- Single app, low traffic, one region → EC2 + Docker Compose, backed by RDS Multi-AZ and S3.
- Multiple services, scaling needs, no Kubernetes skill → ECS Fargate with ALB.
- Multiple services, platform-ready team, polyglot runtime, multi-tenant isolation → Kubernetes (defer to ).
kubernetes-platform - Async fan-out, batch, or event pipeline → Lambda + SQS + EventBridge, with state in DynamoDB or RDS.
Kubernetes is a commitment, not a default.
- 单一应用、低流量、单区域 → EC2 + Docker Compose,搭配RDS多可用区和S3。
- 多服务、有扩缩容需求、无Kubernetes技能 → ECS Fargate + ALB。
- 多服务、团队具备平台化能力、多语言运行时、多租户隔离 → Kubernetes(参考技能)。
kubernetes-platform - 异步扇出、批处理或事件流水线 → Lambda + SQS + EventBridge,状态存储在DynamoDB或RDS中。
Kubernetes是一种承诺,而非默认选项。
Docker Fundamentals
Docker基础
Images are immutable, content-addressed layers. Containers are processes isolated by namespaces and cgroups. Disciplined Dockerfile authorship controls image size, cache behaviour, and attack surface.
镜像是不可变的、基于内容寻址的分层结构。容器是通过命名空间和cgroups实现隔离的进程。规范的Dockerfile编写可控制镜像大小、缓存行为和攻击面。
Dockerfile Checklist
Dockerfile检查清单
- Multi-stage: compile/install in , copy only runtime artifacts to the final stage.
builder - Pin base images by version and digest ().
node:22.11.0-slim@sha256:... - Prefer distroless or for runtime; target image ≤ 200 MB.
alpine - Run as non-root (or dedicated UID ≥ 10000). Set
USER nonroot,WORKDIR,EXPOSEexplicitly.HEALTHCHECK - Secrets via mounted files or orchestrator env — never baked in. excludes
.dockerignore,.git, logs, fixtures, editor config.node_modules - Order from least-changing (manifests) to most-changing (source) to preserve layer caching.
COPY
- 多阶段构建:在阶段编译/安装,仅将运行时工件复制到最终阶段。
builder - 固定基础镜像的版本和摘要(如)。
node:22.11.0-slim@sha256:... - 运行时优先选择distroless或;目标镜像≤200MB。
alpine - 以非root用户运行(或专用UID≥10000)。显式设置
USER nonroot、WORKDIR、EXPOSE。HEALTHCHECK - 通过挂载文件或编排器环境变量管理密钥——绝不要嵌入镜像中。需排除
.dockerignore、.git、日志、测试数据、编辑器配置。node_modules - 按变更频率从低到高排序指令(先复制清单,再复制源码)以保留层缓存。
COPY
Production Node.js Dockerfile
生产环境Node.js Dockerfile
dockerfile
undefineddockerfile
undefinedsyntax=docker/dockerfile:1.7
syntax=docker/dockerfile:1.7
FROM node:22.11.0-slim@sha256:<digest> AS builder
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm npm ci --include=dev
COPY . .
RUN npm run build && npm prune --omit=dev
FROM gcr.io/distroless/nodejs22-debian12:nonroot AS runtime
WORKDIR /app
ENV NODE_ENV=production
COPY --from=builder --chown=nonroot:nonroot /app/node_modules ./node_modules
COPY --from=builder --chown=nonroot:nonroot /app/dist ./dist
COPY --from=builder --chown=nonroot:nonroot /app/package.json ./
USER nonroot
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 CMD ["node", "dist/healthcheck.js"]
CMD ["dist/server.js"]
undefinedFROM node:22.11.0-slim@sha256:<digest> AS builder
WORKDIR /app
COPY package*.json ./
RUN --mount=type=cache,target=/root/.npm npm ci --include=dev
COPY . .
RUN npm run build && npm prune --omit=dev
FROM gcr.io/distroless/nodejs22-debian12:nonroot AS runtime
WORKDIR /app
ENV NODE_ENV=production
COPY --from=builder --chown=nonroot:nonroot /app/node_modules ./node_modules
COPY --from=builder --chown=nonroot:nonroot /app/dist ./dist
COPY --from=builder --chown=nonroot:nonroot /app/package.json ./
USER nonroot
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 CMD ["node", "dist/healthcheck.js"]
CMD ["dist/server.js"]
undefinedDocker Compose
Docker Compose
One in the repo root mirrors production. Named volumes for stateful services; never bind-mount databases. Declare on every dependency and gate startup with .
docker-compose.ymlhealthcheckdepends_on.condition: service_healthyyaml
name: saas-local
services:
web:
build: .
env_file: .env
ports: ["3000:3000"]
depends_on:
db: { condition: service_healthy }
redis: { condition: service_healthy }
healthcheck:
test: ["CMD", "node", "dist/healthcheck.js"]
interval: 30s
timeout: 5s
retries: 3
db:
image: postgres:16.4-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: app
volumes: ["db-data:/var/lib/postgresql/data"]
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d app"]
interval: 10s
timeout: 3s
retries: 5
secrets: [db_password]
redis:
image: redis:7.4-alpine
command: ["redis-server", "--appendonly", "yes"]
volumes: ["redis-data:/data"]
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
volumes:
db-data: {}
redis-data: {}
secrets:
db_password: { file: ./.secrets/db_password }Commit , ignore , and provide env through the orchestrator in production. See for the full template.
.env.example.envreferences/docker-compose-patterns.md仓库根目录下的一个需镜像生产环境状态。为有状态服务使用命名卷;绝不要绑定挂载数据库。为每个依赖项声明,并通过控制启动顺序。
docker-compose.ymlhealthcheckdepends_on.condition: service_healthyyaml
name: saas-local
services:
web:
build: .
env_file: .env
ports: ["3000:3000"]
depends_on:
db: { condition: service_healthy }
redis: { condition: service_healthy }
healthcheck:
test: ["CMD", "node", "dist/healthcheck.js"]
interval: 30s
timeout: 5s
retries: 3
db:
image: postgres:16.4-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: app
volumes: ["db-data:/var/lib/postgresql/data"]
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d app"]
interval: 10s
timeout: 3s
retries: 5
secrets: [db_password]
redis:
image: redis:7.4-alpine
command: ["redis-server", "--appendonly", "yes"]
volumes: ["redis-data:/data"]
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
volumes:
db-data: {}
redis-data: {}
secrets:
db_password: { file: ./.secrets/db_password }提交,忽略,生产环境通过编排器提供环境变量。完整模板请参考。
.env.example.envreferences/docker-compose-patterns.mdAWS Core Services
AWS核心服务
Compute
计算
Instance families: / burstable (dev, low-traffic), / balanced production, / CPU-bound, / memory-bound, NVMe-heavy. Place production instances in private subnets; expose only via ALB/NLB. Build AMIs with Packer or EC2 Image Builder; no manual console edits.
t3t4gm6im7ic6ic7ir6ir7ii4iyaml
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: app-prod-lt
LaunchTemplateData:
ImageId: ami-0123456789abcdef0
InstanceType: m6i.large
IamInstanceProfile: { Name: app-prod-instance-profile }
SecurityGroupIds: [sg-app]
MetadataOptions: { HttpTokens: required, HttpEndpoint: enabled }
AppASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2
MaxSize: 10
DesiredCapacity: 3
HealthCheckType: ELB
HealthCheckGracePeriod: 120
VPCZoneIdentifier: [subnet-priv-a, subnet-priv-b, subnet-priv-c]
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
TargetGroupARNs: [!Ref AppTargetGroup]实例系列:/突发型(开发、低流量场景),/均衡型生产环境,/CPU密集型,/内存密集型,NVMe存储密集型。生产环境实例放置在私有子网中;仅通过ALB/NLB暴露。使用Packer或EC2 Image Builder构建AMI;禁止手动通过控制台修改。
t3t4gm6im7ic6ic7ir6ir7ii4iyaml
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: app-prod-lt
LaunchTemplateData:
ImageId: ami-0123456789abcdef0
InstanceType: m6i.large
IamInstanceProfile: { Name: app-prod-instance-profile }
SecurityGroupIds: [sg-app]
MetadataOptions: { HttpTokens: required, HttpEndpoint: enabled }
AppASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2
MaxSize: 10
DesiredCapacity: 3
HealthCheckType: ELB
HealthCheckGracePeriod: 120
VPCZoneIdentifier: [subnet-priv-a, subnet-priv-b, subnet-priv-c]
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
TargetGroupARNs: [!Ref AppTargetGroup]Storage
存储
Enable default encryption, block public access, and turn on versioning for any data you cannot reconstruct. Lifecycle: transition > 30 days to Standard-IA, > 90 days to Glacier Instant Retrieval, expire multipart uploads > 7 days. Use presigned URLs for customer uploads/downloads; never hand out credentials. Multipart upload threshold ≥ 100 MB; part size 8–16 MB.
bash
aws s3 presign s3://app-prod-uploads/customer/42/invoice.pdf \
--expires-in 900 --region af-south-1
aws configure set default.s3.multipart_threshold 100MB
aws configure set default.s3.multipart_chunksize 16MBjson
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyInsecureTransport",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::app-prod-uploads", "arn:aws:s3:::app-prod-uploads/*"],
"Condition": { "Bool": { "aws:SecureTransport": "false" } }
}
]
}对所有无法重建的数据启用默认加密、阻止公共访问并开启版本控制。生命周期策略:超过30天的对象转换为Standard-IA存储类,超过90天转换为Glacier即时检索存储类,超过7天的多部分上传自动过期。使用预签名URL处理客户上传/下载;绝不要分发凭证。多部分上传阈值≥100MB;分片大小8-16MB。
bash
aws s3 presign s3://app-prod-uploads/customer/42/invoice.pdf \
--expires-in 900 --region af-south-1
aws configure set default.s3.multipart_threshold 100MB
aws configure set default.s3.multipart_chunksize 16MBjson
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyInsecureTransport",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::app-prod-uploads", "arn:aws:s3:::app-prod-uploads/*"],
"Condition": { "Bool": { "aws:SecureTransport": "false" } }
}
]
}Database
数据库
Multi-AZ for every production RDS MySQL/PostgreSQL; synchronous standby in a second AZ. Automated backups retention 7–35 days with PITR. Read replicas for read-heavy paths, never for durability. Parameter groups hold tunings; never edit defaults in place.
bash
aws rds create-db-parameter-group --db-parameter-group-name app-pg16-prod \
--db-parameter-group-family postgres16 --description "Prod PG16 params"
aws rds create-db-instance --db-instance-identifier app-prod \
--engine postgres --engine-version 16.4 --db-instance-class db.m6i.large \
--allocated-storage 200 --storage-type gp3 --storage-encrypted \
--multi-az --backup-retention-period 14 --db-parameter-group-name app-pg16-prod \
--monitoring-interval 60 --enable-performance-insights所有生产环境RDS MySQL/PostgreSQL均需启用多可用区;在第二个可用区配置同步备用实例。自动备份保留7-35天,并开启PITR(时点恢复)。读取副本仅用于读密集型场景,绝不用于数据持久化。参数组存储调优配置;禁止直接修改默认值。
bash
aws rds create-db-parameter-group --db-parameter-group-name app-pg16-prod \
--db-parameter-group-family postgres16 --description "Prod PG16 params"
aws rds create-db-instance --db-instance-identifier app-prod \
--engine postgres --engine-version 16.4 --db-instance-class db.m6i.large \
--allocated-storage 200 --storage-type gp3 --storage-encrypted \
--multi-az --backup-retention-period 14 --db-parameter-group-name app-pg16-prod \
--monitoring-interval 60 --enable-performance-insightsServerless
无服务器
Lambda triggers: S3 object-created, SQS queue, API Gateway, EventBridge schedule, DynamoDB Streams. Cold-start mitigation: provisioned concurrency for latency-sensitive paths; a 5-minute EventBridge keep-warm rule as a low-cost fallback. Keep deployment package ≤ 50 MB zipped; container images only when native deps demand it.
bash
aws lambda put-provisioned-concurrency-config \
--function-name order-api --qualifier live \
--provisioned-concurrent-executions 5Lambda触发器:S3对象创建、SQS队列、API Gateway、EventBridge定时任务、DynamoDB流。冷启动缓解策略:对延迟敏感路径配置预置并发;低成本 fallback方案是使用5分钟间隔的EventBridge保活规则。部署包压缩后≤50MB;仅当存在原生依赖时使用容器镜像。
bash
aws lambda put-provisioned-concurrency-config \
--function-name order-api --qualifier live \
--provisioned-concurrent-executions 5IAM
IAM
Roles, not users, for workloads — instance profiles on EC2, task roles on ECS. Policy statements scoped to specific ARNs and actions — no . CI uses OIDC federation to assume role; no long-lived keys. MFA on every human account; root locked away with hardware MFA.
*:*json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AppReadUploads",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::app-prod-uploads/*"
},
{
"Sid": "AppReadSecrets",
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:af-south-1:111122223333:secret:app/prod/*"
}
]
}工作负载使用角色而非用户——EC2使用实例配置文件,ECS使用任务角色。策略语句限定到特定ARN和操作——禁止使用。CI使用OIDC联邦身份来扮演角色;禁止使用长期密钥。所有人工账户启用MFA;根账户通过硬件MFA锁定。
*:*json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AppReadUploads",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::app-prod-uploads/*"
},
{
"Sid": "AppReadSecrets",
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:af-south-1:111122223333:secret:app/prod/*"
}
]
}Networking
网络
Design the VPC across ≥ 3 AZs for production, 2 for non-production. Allocate a /16 and carve /20 public and /20 private subnets per AZ. One NAT gateway per AZ in production — single-AZ NAT is a SPOF and cross-AZ data charges bite.
| Layer | CIDR example | Routing |
|---|---|---|
| Public subnets | 10.20.0.0/20 per AZ | IGW default route |
| Private app subnets | 10.20.32.0/20 per AZ | NAT gateway in same AZ |
| Private data subnets | 10.20.64.0/20 per AZ | No outbound route |
Security groups are stateful instance-level allow-lists — the primary tool. NACLs are stateless subnet-level deny/allow lists — use only for coarse boundaries (blocking known-bad CIDRs). Reserve ≥ /18 headroom for peering or Transit Gateway.
bash
aws ec2 create-vpc --cidr-block 10.20.0.0/16 \
--tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ug-prod-vpc}]'
aws ec2 create-nat-gateway --subnet-id subnet-pub-a --allocation-id eipalloc-aaa生产环境VPC需覆盖≥3个可用区,非生产环境覆盖2个可用区。分配/16网段,每个可用区划分/20的公共子网和/20的私有子网。生产环境每个可用区配置一个NAT网关——单可用区NAT是单点故障,且跨可用区数据传输成本较高。
| 层级 | CIDR示例 | 路由配置 |
|---|---|---|
| 公共子网 | 每个可用区10.20.0.0/20 | 默认路由指向IGW |
| 私有应用子网 | 每个可用区10.20.32.0/20 | 指向同可用区的NAT网关 |
| 私有数据子网 | 每个可用区10.20.64.0/20 | 无出站路由 |
安全组是有状态的实例级允许列表——这是主要工具。NACL是无状态的子网级拒绝/允许列表——仅用于粗粒度边界(阻止已知恶意CIDR)。预留≥/18的网段空间用于对等连接或Transit Gateway。
bash
aws ec2 create-vpc --cidr-block 10.20.0.0/16 \
--tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ug-prod-vpc}]'
aws ec2 create-nat-gateway --subnet-id subnet-pub-a --allocation-id eipalloc-aaaLoad Balancers
负载均衡器
| Feature | ALB | NLB |
|---|---|---|
| Layer | 7 (HTTP/HTTPS/gRPC) | 4 (TCP/UDP/TLS) |
| Routing | Host, path, header, query | Port-based |
| TLS termination | At ALB | Passthrough or at NLB |
| Sticky sessions | Cookie-based | Source-IP flow hash |
| Use case | Web APIs, microservices | High-throughput TCP, static IPs, PrivateLink |
Health checks hit a dedicated path on a dedicated port when feasible; verify dependencies shallowly — not deeply, or cascading failures evict healthy targets.
/healthzbash
aws elbv2 create-target-group --name app-tg-blue --protocol HTTP --port 3000 \
--vpc-id vpc-0abc --health-check-path /healthz --health-check-interval-seconds 15 \
--healthy-threshold-count 2 --unhealthy-threshold-count 3 --matcher HttpCode=200
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTPS --port 443 \
--certificates CertificateArn=$ACM_ARN \
--ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
--default-actions Type=forward,TargetGroupArn=$TG_BLUE| 特性 | ALB | NLB |
|---|---|---|
| 层级 | 7层(HTTP/HTTPS/gRPC) | 4层(TCP/UDP/TLS) |
| 路由方式 | 基于主机、路径、头信息、查询参数 | 基于端口 |
| TLS终止 | 在ALB层 | 透传或在NLB层 |
| 会话保持 | 基于Cookie | 基于源IP流哈希 |
| 适用场景 | Web API、微服务 | 高吞吐量TCP、静态IP、PrivateLink |
健康检查尽可能命中专用端口上的路径;浅度验证依赖项——不要深度验证,否则级联故障会驱逐健康目标。
/healthzbash
aws elbv2 create-target-group --name app-tg-blue --protocol HTTP --port 3000 \
--vpc-id vpc-0abc --health-check-path /healthz --health-check-interval-seconds 15 \
--healthy-threshold-count 2 --unhealthy-threshold-count 3 --matcher HttpCode=200
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTPS --port 443 \
--certificates CertificateArn=$ACM_ARN \
--ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
--default-actions Type=forward,TargetGroupArn=$TG_BLUECDN
CDN
CloudFront or Cloudflare in front of every static asset and cacheable API response. Enable Origin Shield in a region close to the origin to cut origin fetches by 60–80%. Attach AWS WAF with the Managed Rules Core Rule Set plus Known Bad Inputs and IP-Reputation lists; add a rate-based rule at 2000 requests per 5 minutes per IP for unauthenticated endpoints.
bash
aws cloudfront create-distribution --distribution-config file://cf-dist.json
aws wafv2 create-web-acl --name app-prod-waf --scope CLOUDFRONT --default-action Allow={} \
--visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=app-prod-waf \
--rules file://waf-managed-rules.jsonInvalidate surgically — never on every deploy; use versioned asset paths () and cache-bust only HTML.
invalidate /*/static/v=<build-sha>/所有静态资源和可缓存API响应前需部署CloudFront或Cloudflare。在靠近源站的区域启用Origin Shield,可减少60-80%的源站请求。附加AWS WAF,启用托管规则核心规则集、已知恶意输入规则和IP信誉列表;对未认证端点添加速率限制规则,限制为每IP每5分钟2000次请求。
bash
aws cloudfront create-distribution --distribution-config file://cf-dist.json
aws wafv2 create-web-acl --name app-prod-waf --scope CLOUDFRONT --default-action Allow={} \
--visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=app-prod-waf \
--rules file://waf-managed-rules.json精准执行失效操作——不要在每次部署时执行;使用带版本号的资源路径(如),仅对HTML进行缓存刷新。
invalidate /*/static/v=<build-sha>/SSL/TLS Automation
SSL/TLS自动化
- AWS ALB, CloudFront, API Gateway → ACM certificates: free, auto-renewed, DNS-validated via Route 53.
- VPS or single host → Certbot + Let's Encrypt with the installer's systemd timer; nightly cron only when systemd is unavailable.
- Kubernetes → with a
cert-managerfor Let's Encrypt ACME HTTP-01 or DNS-01.ClusterIssuer
bash
aws acm request-certificate --domain-name app.example.co.ug \
--subject-alternative-names "*.app.example.co.ug" \
--validation-method DNS --key-algorithm RSA_2048
sudo certbot --nginx -d app.example.co.ug --deploy-hook "systemctl reload nginx"
kubectl apply -f cert-manager/letsencrypt-prod-issuer.yamlTLS 1.2 minimum, prefer 1.3. Enable HSTS once the production cert path is stable.
max-age=31536000; includeSubDomains; preload- AWS ALB、CloudFront、API Gateway → ACM证书:免费、自动续期、通过Route 53进行DNS验证。
- VPS或单主机 → Certbot + Let's Encrypt,使用安装程序的systemd定时器;仅当systemd不可用时使用夜间cron任务。
- Kubernetes → 搭配
cert-manager,使用Let's Encrypt ACME HTTP-01或DNS-01验证方式。ClusterIssuer
bash
aws acm request-certificate --domain-name app.example.co.ug \
--subject-alternative-names "*.app.example.co.ug" \
--validation-method DNS --key-algorithm RSA_2048
sudo certbot --nginx -d app.example.co.ug --deploy-hook "systemctl reload nginx"
kubectl apply -f cert-manager/letsencrypt-prod-issuer.yaml最低要求TLS 1.2,优先使用TLS 1.3。生产环境证书路径稳定后,启用HSTS 。
max-age=31536000; includeSubDomains; preloadAuto-Scaling
自动扩缩容
Target tracking first, step scaling second, predictive third. Scale on request count per target and P95 latency — not CPU alone.
bash
aws application-autoscaling put-scaling-policy --service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount --resource-id service/app-cluster/app-svc \
--policy-name tt-reqcount --policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 1000,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/alb-arn/tg-arn"
},
"ScaleOutCooldown": 60, "ScaleInCooldown": 300
}'- CPU target 70% for CPU-bound services; never below 40% (wastes capacity). Scheduled scaling for predictable load (EAT business hours 07:00–19:00). Predictive scaling requires ≥ 14 days of CloudWatch history and a regular daily/weekly pattern — otherwise predictions are noise. Warm pools for slow-booting AMIs (> 3 min boot).
优先使用目标追踪,其次是步进扩缩容,最后是预测性扩缩容。基于每个目标的请求数和P95延迟进行扩缩容——不要仅基于CPU。
bash
aws application-autoscaling put-scaling-policy --service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount --resource-id service/app-cluster/app-svc \
--policy-name tt-reqcount --policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 1000,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/alb-arn/tg-arn"
},
"ScaleOutCooldown": 60, "ScaleInCooldown": 300
}'- CPU密集型服务的CPU目标设为70%;绝不低于40%(会浪费容量)。针对可预测负载使用定时扩缩容(东非时间工作日07:00–19:00)。预测性扩缩容需要≥14天的CloudWatch历史数据和规律的每日/每周模式——否则预测结果无效。对于启动缓慢的AMI(启动时间>3分钟)使用预热池。
Zero-Downtime Deployments
零停机部署
Blue-green via ALB target-group swap for stateful-client apps; ASG instance refresh for stateless fleets. Canary for risky changes (pull weight to zero to rollback); shadow for unproven services receiving mirrored traffic. Automatic rollback triggers on health-check failure, 5xx-rate regression > 0.5% over 5 min, or P95 latency regression beyond SLO budget.
Blue-green procedure: register green with , wait for all targets via , then swap the listener:
app-tg-greenhealthyaws elbv2 describe-target-healthbash
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$TG_GREENHold blue for 30 minutes as a hot rollback target; deregister only after error-rate and latency SLOs hold. Rolling update via ASG instance refresh:
bash
aws autoscaling start-instance-refresh --auto-scaling-group-name app-prod-asg \
--strategy Rolling --preferences '{
"MinHealthyPercentage": 90, "InstanceWarmup": 180,
"CheckpointPercentages": [25, 50, 100], "CheckpointDelay": 600
}'Rollback: re-point the listener to (blue-green), or and roll forward with the prior Launch Template version. Schema migrations must be backwards-compatible across two application versions (expand → migrate → contract). Every deploy writes a signed record: who, what, when, artifact digest.
app-tg-blueaws autoscaling cancel-instance-refresh针对有状态客户端应用,通过ALB目标组切换实现蓝绿部署;针对无状态集群使用ASG实例刷新。对于高风险变更使用金丝雀发布(将流量权重调至0即可回滚);对于未验证的服务使用影子发布(接收镜像流量)。当健康检查失败、5xx错误率较基准上升>0.5%(持续5分钟)或P95延迟超出SLO预算时,触发自动回滚。
蓝绿部署流程:将绿色环境注册到,等待所有目标通过显示为,然后切换监听器:
app-tg-greenaws elbv2 describe-target-healthhealthybash
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$TG_GREEN保留蓝色环境30分钟作为热回滚目标;仅在错误率和延迟符合SLO后才注销。通过ASG实例刷新实现滚动更新:
bash
aws autoscaling start-instance-refresh --auto-scaling-group-name app-prod-asg \
--strategy Rolling --preferences '{
"MinHealthyPercentage": 90, "InstanceWarmup": 180,
"CheckpointPercentages": [25, 50, 100], "CheckpointDelay": 600
}'回滚操作:将监听器重新指向(蓝绿部署),或执行并使用之前的Launch Template版本向前滚动。 Schema迁移必须在两个应用版本间保持向后兼容(扩展→迁移→收缩)。每次部署需写入签名记录:执行人、内容、时间、工件摘要。
app-tg-blueaws autoscaling cancel-instance-refreshBackup & Disaster Recovery
备份与灾难恢复
Define RTO (how fast to recover) and RPO (how much data loss is tolerable) before picking tools. Typical production SaaS targets RTO ≤ 4 h, RPO ≤ 15 min.
- RDS: automated backups retention 7–35 days with PITR; weekly manual snapshots retained 90 days; cross-region snapshot copy to as a sovereignty-preserving DR site.
eu-west-1 - S3: versioning on every data bucket; lifecycle moves non-current versions to Glacier Deep Archive after 60 days; Cross-Region Replication for critical buckets.
- EBS: daily snapshots via AWS Backup with a 30-day retention plan.
bash
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:af-south-1:111122223333:snapshot:app-prod-2026-04-15 \
--target-db-snapshot-identifier app-prod-2026-04-15-dr \
--kms-key-id alias/rds-dr --source-region af-south-1 --region eu-west-1
aws s3api put-bucket-versioning --bucket app-prod-uploads --versioning-configuration Status=EnabledRehearse restore quarterly — an untested backup is a hypothesis, not a backup.
选择工具前需定义RTO(恢复速度)和RPO(可容忍的数据丢失量)。典型生产SaaS的目标为RTO≤4小时,RPO≤15分钟。
- RDS:自动备份保留7-35天并开启PITR;每周手动快照保留90天;跨区域快照复制到作为符合主权要求的灾难恢复站点。
eu-west-1 - S3:所有数据桶开启版本控制;生命周期策略将非当前版本在60天后迁移至Glacier深度归档;关键桶启用跨区域复制。
- EBS:通过AWS Backup每日创建快照,保留30天。
bash
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:af-south-1:111122223333:snapshot:app-prod-2026-04-15 \
--target-db-snapshot-identifier app-prod-2026-04-15-dr \
--kms-key-id alias/rds-dr --source-region af-south-1 --region eu-west-1
aws s3api put-bucket-versioning --bucket app-prod-uploads --versioning-configuration Status=Enabled每季度演练恢复流程——未测试的备份只是假设,而非有效备份。
Cost Optimisation
成本优化
- Reserved Instances or Savings Plans for steady baseline (70–80% of average compute); on-demand for burst. Prefer Compute Savings Plans (1y no-upfront starting posture; 3y only when headcount and roadmap are certain) — they apply across EC2, Fargate, Lambda.
- Spot for non-critical async workers and CI runners with a graceful shutdown handler for the 2-minute interruption notice.
- S3 Intelligent-Tiering on buckets with unpredictable access; tag every resource with ,
Environment,Team,CostCenterand activate these as cost-allocation tags in Billing.Project - Cost Explorer, Cost Anomaly Detection, and per-environment budgets on from day one.
bash
aws ce list-cost-allocation-tags --status Active --region us-east-1
aws budgets create-budget --account-id 111122223333 --budget '{
"BudgetName": "ug-prod-monthly",
"BudgetLimit": { "Amount": "5000", "Unit": "USD" },
"TimeUnit": "MONTHLY", "BudgetType": "COST",
"CostFilters": { "TagKeyValue": ["user:Environment$prod"] }
}'- 针对稳定基线计算(占平均计算量的70-80%)使用预留实例或Savings Plans;针对突发流量使用按需实例。优先选择Compute Savings Plans(1年期无预付金起步;仅当人员配置和路线图明确时选择3年期)——适用于EC2、Fargate、Lambda。
- 针对非关键异步工作者和CI运行器使用竞价实例,并实现优雅关闭处理以应对2分钟中断通知。
- 对访问模式不可预测的桶使用S3智能分层;为每个资源添加、
Environment、Team、CostCenter标签,并在账单中激活这些标签作为成本分配标签。Project - 从第一天起启用Cost Explorer、成本异常检测和按环境预算。
bash
aws ce list-cost-allocation-tags --status Active --region us-east-1
aws budgets create-budget --account-id 111122223333 --budget '{
"BudgetName": "ug-prod-monthly",
"BudgetLimit": { "Amount": "5000", "Unit": "USD" },
"TimeUnit": "MONTHLY", "BudgetType": "COST",
"CostFilters": { "TagKeyValue": ["user:Environment$prod"] }
}'Multi-Region Considerations
多区域考量
- Latency from East Africa: ~ 30 ms;
af-south-1~ 150 ms;eu-west-1~ 220 ms. Place user-facing tiers inus-east-1whenever available.af-south-1 - Data residency: Uganda DPPA 2019 requires personal data of Ugandan data subjects to be processed in a jurisdiction with adequate protection; with KMS customer-managed keys is the low-friction default. Log the data-flow and cross-border transfer basis in
af-south-1._context/compliance.md - Replication: active-passive (primary , warm standby
af-south-1) is the common starting posture; active-active only when conflict-resolution is designed in (DynamoDB Global Tables, Aurora Global Database with write forwarding). Route 53 health-checked failover records for DR, not client-side retry loops.eu-west-1
bash
aws route53 create-health-check --caller-reference "ug-app-$(date +%s)" --health-check-config file://hc.json
aws dynamodb update-table --table-name orders --replica-updates '[{"Create": {"RegionName": "eu-west-1"}}]'- 东非区域延迟:~30毫秒;
af-south-1~150毫秒;eu-west-1~220毫秒。只要可用,用户面向层应部署在us-east-1。af-south-1 - 数据驻留:乌干达DPPA 2019要求乌干达数据主体的个人数据在具备充分保护的司法管辖区处理;搭配KMS客户管理密钥是低摩擦的默认选择。在
af-south-1中记录数据流和跨境传输依据。_context/compliance.md - 复制:主备模式(主,热备
af-south-1)是常见的初始配置;仅当设计了冲突解决机制时才使用多活模式(DynamoDB全局表、支持写入转发的Aurora全局数据库)。使用Route 53健康检查故障转移记录进行灾难恢复,而非客户端重试循环。eu-west-1
bash
aws route53 create-health-check --caller-reference "ug-app-$(date +%s)" --health-check-config file://hc.json
aws dynamodb update-table --table-name orders --replica-updates '[{"Create": {"RegionName": "eu-west-1"}}]'Security Baseline
安全基线
Enable these on the management account and every member account on day one. All commands are idempotent — safe to re-run.
bash
aws cloudtrail create-trail --name org-trail --s3-bucket-name org-cloudtrail-logs \
--is-multi-region-trail --is-organization-trail --enable-log-file-validation \
--kms-key-id alias/cloudtrail
aws cloudtrail start-logging --name org-trail
aws s3api put-bucket-versioning --bucket org-cloudtrail-logs --versioning-configuration Status=Enabled
aws configservice start-configuration-recorder --configuration-recorder-name default
aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES
aws securityhub enable-security-hub --enable-default-standards
aws accessanalyzer create-analyzer --analyzer-name org-analyzer --type ORGANIZATION- CloudTrail: all regions, S3 bucket with versioning, log-file validation, KMS-encrypted.
- AWS Config: enable the AWS Foundational Security Best Practices conformance pack.
- GuardDuty: detector in every region with S3 and EKS protection on.
- Security Hub: aggregate findings in a delegated admin account; resolve Critical/High within team SLO. IAM Access Analyzer: organization-level, reviewed weekly.
在管理账户和所有成员账户启用以下功能,从第一天开始执行。所有命令均为幂等——可安全重复执行。
bash
aws cloudtrail create-trail --name org-trail --s3-bucket-name org-cloudtrail-logs \
--is-multi-region-trail --is-organization-trail --enable-log-file-validation \
--kms-key-id alias/cloudtrail
aws cloudtrail start-logging --name org-trail
aws s3api put-bucket-versioning --bucket org-cloudtrail-logs --versioning-configuration Status=Enabled
aws configservice start-configuration-recorder --configuration-recorder-name default
aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES
aws securityhub enable-security-hub --enable-default-standards
aws accessanalyzer create-analyzer --analyzer-name org-analyzer --type ORGANIZATION- CloudTrail:覆盖所有区域,S3桶开启版本控制,日志文件验证,KMS加密。
- AWS Config:启用AWS基础安全最佳实践一致性包。
- GuardDuty:每个区域启用检测器,开启S3和EKS保护。
- Security Hub:在委托管理账户聚合发现结果;按团队SLO解决严重/高危问题。IAM Access Analyzer:组织级,每周评审。
Review Checklist
评审检查清单
- Workload classified; compute model justified in writing.
- VPC spans ≥ 2 AZs; data stores Multi-AZ.
- No credentials in images, committed files, or Git history; IAM uses roles + OIDC, not long-lived keys.
- Deployment pattern chosen with rollback runbook validated; TLS, CDN, WAF posture documented.
- Auto-scaling signal is request- or latency-driven, not CPU-only.
- CloudTrail, Config, GuardDuty, Security Hub enabled across all regions; backups tested with a quarterly restore rehearsal (RTO/RPO documented); billing alerts active, Cost Explorer tags applied, Spot use paired with shutdown handling.
- 已完成工作负载分类;计算模型有书面理由。
- VPC覆盖≥2个可用区;数据存储启用多可用区。
- 镜像、已提交文件或Git历史中无凭证;IAM使用角色+OIDC,而非长期密钥。
- 已选择部署模式并验证回滚手册;TLS、CDN、WAF配置已记录。
- 自动扩缩容信号基于请求数或延迟,而非仅基于CPU。
- CloudTrail、Config、GuardDuty、Security Hub已在所有区域启用;备份已通过每季度恢复演练验证(RTO/RPO已记录);账单告警已激活,Cost Explorer标签已应用,竞价实例搭配关闭处理机制。
Platform Notes
平台说明
- Claude Code: CLI and
awsCLI are the primary surface. Configure profiles withdocker; use named profiles per environment.aws configure sso - Codex: treat every command as a patch candidate; keep commands in shell blocks so they stay portable.
- Claude Code:CLI和
awsCLI是主要操作界面。使用docker配置配置文件;按环境使用命名配置文件。aws configure sso - Codex:将每个命令视为补丁候选;将命令放在shell块中以保持可移植性。
References
参考资料
- references/aws-core-services.md: EC2, S3, RDS, IAM, ALB, ASG, CloudFront CLI recipes.
- references/docker-compose-patterns.md: Full local-parity stack template.
- references/deployment-patterns.md: Blue-green and canary runbooks with rollback steps.
- AWS Well-Architected Framework: aws.amazon.com/architecture/well-architected
- Docker Deep Dive — Nigel Poulton (reading programme, Phase 01 priority 1).
- references/aws-core-services.md:EC2、S3、RDS、IAM、ALB、ASG、CloudFront CLI示例。
- references/docker-compose-patterns.md:完整本地一致栈模板。
- references/deployment-patterns.md:蓝绿和金丝雀发布手册及回滚步骤。
- AWS Well-Architected框架:aws.amazon.com/architecture/well-architected
- Docker Deep Dive — Nigel Poulton(阅读计划,Phase 01优先级1)。