devops-role-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DevOps Role Skill

DevOps角色技能

Description

技能描述

Create CI/CD pipelines, automate build and deployment processes, implement monitoring and observability, and manage infrastructure across all environments. This skill implements professional DevOps practices including pipeline automation, infrastructure as code, environment management, and comprehensive monitoring.
创建CI/CD流水线,自动化构建与部署流程,实施监控与可观测性,并管理全环境下的基础设施。此技能涵盖专业的DevOps实践,包括流水线自动化、基础设施即代码、环境管理以及全面监控。

When to Use This Skill

何时使用此技能

  • Creating CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
  • Implementing infrastructure as code (Terraform, CloudFormation)
  • Setting up environment configurations (dev, staging, production)
  • Implementing monitoring and observability (Prometheus, Grafana)
  • Automating deployment processes
  • Managing containerization and orchestration (Docker, Kubernetes)
  • Implementing security scanning and compliance checks
  • 创建CI/CD流水线(GitHub Actions、GitLab CI、Jenkins)
  • 实施基础设施即代码(Terraform、CloudFormation)
  • 配置环境(开发、预发布、生产)
  • 实施监控与可观测性(Prometheus、Grafana)
  • 自动化部署流程
  • 管理容器化与编排(Docker、Kubernetes)
  • 实施安全扫描与合规检查

When NOT to Use This Skill

何时不使用此技能

  • For application code development (use builder-role-skill)
  • For system architecture design (use architect-role-skill)
  • For code testing and validation (use validator-role-skill)
  • For documentation writing (use scribe-role-skill)
  • 应用代码开发(使用builder-role-skill)
  • 系统架构设计(使用architect-role-skill)
  • 代码测试与验证(使用validator-role-skill)
  • 文档编写(使用scribe-role-skill)

Prerequisites

前置条件

  • Access to CI/CD platform (GitHub Actions, GitLab, Jenkins)
  • Cloud provider credentials (AWS, GCP, Azure)
  • Infrastructure as Code tools installed (Terraform, Ansible)
  • Container registry access
  • Kubernetes cluster access (if using K8s)
  • Monitoring tools configured

  • 拥有CI/CD平台访问权限(GitHub Actions、GitLab、Jenkins)
  • 云服务商凭证(AWS、GCP、Azure)
  • 已安装基础设施即代码工具(Terraform、Ansible)
  • 容器镜像仓库访问权限
  • Kubernetes集群访问权限(若使用K8s)
  • 已配置监控工具

Workflow

阶段1:CI/CD流水线创建

Phase 1: CI/CD Pipeline Creation

Implement automated build, test, and deployment workflows.
Step 1.1: Requirements Analysis
Load context files:
- DEVELOPMENT_PLAN.md (deployment strategy)
- ARCHITECTURE.md (system components)
- README.md (project overview)
- Security requirements
Step 1.2: Create Pipeline Configuration
实现自动化构建、测试与部署工作流。
步骤1.1:需求分析
加载上下文文件:
- DEVELOPMENT_PLAN.md(部署策略)
- ARCHITECTURE.md(系统组件)
- README.md(项目概述)
- 安全要求
步骤1.2:创建流水线配置

GitHub Actions Example

GitHub Actions示例

Create
.github/workflows/ci-cd.yml
:
yaml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ============================================
  # LINT & FORMAT CHECK
  # ============================================
  lint:
    name: Lint and Format Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Check formatting
        run: npm run format:check

  # ============================================
  # BUILD
  # ============================================
  build:
    name: Build Application
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-artifacts
          path: dist/
          retention-days: 7

  # ============================================
  # TEST
  # ============================================
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        test-type: [unit, integration]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run ${{ matrix.test-type }} tests
        run: npm run test:${{ matrix.test-type }}

      - name: Generate coverage report
        if: matrix.test-type == 'unit'
        run: npm run coverage

      - name: Upload coverage to Codecov
        if: matrix.test-type == 'unit'
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/coverage-final.json
          flags: unittests

  # ============================================
  # SECURITY SCAN
  # ============================================
  security:
    name: Security Scanning
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4

      - name: Run dependency audit
        run: npm audit --audit-level=moderate

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  # ============================================
  # BUILD DOCKER IMAGE
  # ============================================
  docker:
    name: Build and Push Docker Image
    runs-on: ubuntu-latest
    needs: [build, test, security]
    if: github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=sha,prefix={{branch}}-
            type=semver,pattern={{version}}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ============================================
  # DEPLOY TO STAGING
  # ============================================
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
          kubectl rollout status deployment/app

      - name: Run smoke tests
        run: |
          npm run test:smoke -- --url=https://staging.example.com

  # ============================================
  # DEPLOY TO PRODUCTION
  # ============================================
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://www.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}

      - name: Deploy to Kubernetes (Blue-Green)
        run: |
          # Deploy green version
          kubectl apply -f k8s/green-deployment.yaml
          kubectl set image deployment/app-green \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}

          # Wait for rollout
          kubectl rollout status deployment/app-green

          # Run health checks
          kubectl run health-check --rm -i --restart=Never \
            --image=curlimages/curl -- \
            curl http://app-green-service/health

          # Switch traffic to green
          kubectl patch service app-service \
            -p '{"spec":{"selector":{"version":"green"}}}'

          # Delete old blue deployment
          kubectl delete deployment app-blue || true

          # Rename green to blue for next deploy
          kubectl label deployment app-green version=blue --overwrite

      - name: Run production smoke tests
        run: |
          npm run test:smoke -- --url=https://www.example.com

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment successful!",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Step 1.3: Create Dockerfile
dockerfile
undefined
创建
.github/workflows/ci-cd.yml
yaml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ============================================
  # LINT & FORMAT CHECK
  # ============================================
  lint:
    name: Lint and Format Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Check formatting
        run: npm run format:check

  # ============================================
  # BUILD
  # ============================================
  build:
    name: Build Application
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-artifacts
          path: dist/
          retention-days: 7

  # ============================================
  # TEST
  # ============================================
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        test-type: [unit, integration]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run ${{ matrix.test-type }} tests
        run: npm run test:${{ matrix.test-type }}

      - name: Generate coverage report
        if: matrix.test-type == 'unit'
        run: npm run coverage

      - name: Upload coverage to Codecov
        if: matrix.test-type == 'unit'
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/coverage-final.json
          flags: unittests

  # ============================================
  # SECURITY SCAN
  # ============================================
  security:
    name: Security Scanning
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4

      - name: Run dependency audit
        run: npm audit --audit-level=moderate

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  # ============================================
  # BUILD DOCKER IMAGE
  # ============================================
  docker:
    name: Build and Push Docker Image
    runs-on: ubuntu-latest
    needs: [build, test, security]
    if: github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=sha,prefix={{branch}}-
            type=semver,pattern={{version}}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ============================================
  # DEPLOY TO STAGING
  # ============================================
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
          kubectl rollout status deployment/app

      - name: Run smoke tests
        run: |
          npm run test:smoke -- --url=https://staging.example.com

  # ============================================
  # DEPLOY TO PRODUCTION
  # ============================================
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://www.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}

      - name: Deploy to Kubernetes (Blue-Green)
        run: |
          # Deploy green version
          kubectl apply -f k8s/green-deployment.yaml
          kubectl set image deployment/app-green \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}

          # Wait for rollout
          kubectl rollout status deployment/app-green

          # Run health checks
          kubectl run health-check --rm -i --restart=Never \
            --image=curlimages/curl -- \
            curl http://app-green-service/health

          # Switch traffic to green
          kubectl patch service app-service \
            -p '{"spec":{"selector":{"version":"green"}}}'

          # Delete old blue deployment
          kubectl delete deployment app-blue || true

          # Rename green to blue for next deploy
          kubectl label deployment app-green version=blue --overwrite

      - name: Run production smoke tests
        run: |
          npm run test:smoke -- --url=https://www.example.com

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment successful!",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
步骤1.3:创建Dockerfile
dockerfile
undefined

Multi-stage build for optimal image size

Multi-stage build for optimal image size

FROM node:18-alpine AS builder
WORKDIR /app
FROM node:18-alpine AS builder
WORKDIR /app

Copy package files

Copy package files

COPY package*.json ./
COPY package*.json ./

Install dependencies

Install dependencies

RUN npm ci --only=production
RUN npm ci --only=production

Copy source code

Copy source code

COPY . .
COPY . .

Build application

Build application

RUN npm run build
RUN npm run build

============================================

============================================

Production image

Production image

============================================

============================================

FROM node:18-alpine
FROM node:18-alpine

Install dumb-init for proper signal handling

Install dumb-init for proper signal handling

RUN apk add --no-cache dumb-init
RUN apk add --no-cache dumb-init

Create non-root user

Create non-root user

RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001
WORKDIR /app
RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001
WORKDIR /app

Copy built artifacts and dependencies

Copy built artifacts and dependencies

COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=builder --chown=nodejs:nodejs /app/package.json ./
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=builder --chown=nodejs:nodejs /app/package.json ./

Switch to non-root user

Switch to non-root user

USER nodejs
USER nodejs

Expose port

Expose port

EXPOSE 3000
EXPOSE 3000

Health check

Health check

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

Use dumb-init to handle signals properly

Use dumb-init to handle signals properly

ENTRYPOINT ["dumb-init", "--"]
ENTRYPOINT ["dumb-init", "--"]

Start application

Start application

CMD ["node", "dist/main.js"]

---
CMD ["node", "dist/main.js"]

---

Phase 2: Infrastructure as Code

阶段2:基础设施即代码

Manage infrastructure using declarative configuration.
Step 2.1: Terraform Configuration
Create
terraform/main.tf
:
hcl
undefined
使用声明式配置管理基础设施。
步骤2.1:Terraform配置
创建
terraform/main.tf
hcl
undefined

============================================

============================================

Provider Configuration

Provider Configuration

============================================

============================================

terraform { required_version = ">= 1.0"
required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }
backend "s3" { bucket = "terraform-state-bucket" key = "app/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-locks" } }
provider "aws" { region = var.aws_region
default_tags { tags = { Project = "MyApp" Environment = var.environment ManagedBy = "Terraform" } } }
terraform { required_version = ">= 1.0"
required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }
backend "s3" { bucket = "terraform-state-bucket" key = "app/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-locks" } }
provider "aws" { region = var.aws_region
default_tags { tags = { Project = "MyApp" Environment = var.environment ManagedBy = "Terraform" } } }

============================================

============================================

VPC and Networking

VPC and Networking

============================================

============================================

module "vpc" { source = "terraform-aws-modules/vpc/aws"
name = "${var.project_name}-${var.environment}-vpc" cidr = var.vpc_cidr
azs = var.availability_zones private_subnets = var.private_subnet_cidrs public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true enable_vpn_gateway = false
enable_dns_hostnames = true enable_dns_support = true }
module "vpc" { source = "terraform-aws-modules/vpc/aws"
name = "${var.project_name}-${var.environment}-vpc" cidr = var.vpc_cidr
azs = var.availability_zones private_subnets = var.private_subnet_cidrs public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true enable_vpn_gateway = false
enable_dns_hostnames = true enable_dns_support = true }

============================================

============================================

EKS Cluster

EKS Cluster

============================================

============================================

module "eks" { source = "terraform-aws-modules/eks/aws"
cluster_name = "${var.project_name}-${var.environment}" cluster_version = "1.28"
vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = { general = { desired_size = var.node_desired_size min_size = var.node_min_size max_size = var.node_max_size
  instance_types = var.node_instance_types
  capacity_type  = "ON_DEMAND"

  labels = {
    role = "general"
  }

  tags = {
    NodeGroup = "general"
  }
}
}

Cluster access entry

enable_cluster_creator_admin_permissions = true }
module "eks" { source = "terraform-aws-modules/eks/aws"
cluster_name = "${var.project_name}-${var.environment}" cluster_version = "1.28"
vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = { general = { desired_size = var.node_desired_size min_size = var.node_min_size max_size = var.node_max_size
  instance_types = var.node_instance_types
  capacity_type  = "ON_DEMAND"

  labels = {
    role = "general"
  }

  tags = {
    NodeGroup = "general"
  }
}
}

Cluster access entry

enable_cluster_creator_admin_permissions = true }

============================================

============================================

RDS Database

RDS Database

============================================

============================================

module "db" { source = "terraform-aws-modules/rds/aws"
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres" engine_version = "15.4" family = "postgres15" major_engine_version = "15" instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name username = var.db_username port = 5432
multi_az = var.environment == "production" db_subnet_group_name = module.vpc.database_subnet_group vpc_security_group_ids = [aws_security_group.database.id]
backup_retention_period = var.environment == "production" ? 30 : 7 backup_window = "03:00-04:00" maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "production"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = { Name = "${var.project_name}-${var.environment}-db" } }
module "db" { source = "terraform-aws-modules/rds/aws"
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres" engine_version = "15.4" family = "postgres15" major_engine_version = "15" instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name username = var.db_username port = 5432
multi_az = var.environment == "production" db_subnet_group_name = module.vpc.database_subnet_group vpc_security_group_ids = [aws_security_group.database.id]
backup_retention_period = var.environment == "production" ? 30 : 7 backup_window = "03:00-04:00" maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "production"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = { Name = "${var.project_name}-${var.environment}-db" } }

============================================

============================================

ElastiCache Redis

ElastiCache Redis

============================================

============================================

module "redis" { source = "terraform-aws-modules/elasticache/aws"
cluster_id = "${var.project_name}-${var.environment}-redis" engine = "redis" engine_version = "7.0" node_type = var.redis_node_type num_cache_nodes = 1 parameter_group_family = "redis7"
subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.redis.id]
snapshot_retention_limit = var.environment == "production" ? 5 : 1 snapshot_window = "05:00-06:00" maintenance_window = "sun:06:00-sun:07:00" }
module "redis" { source = "terraform-aws-modules/elasticache/aws"
cluster_id = "${var.project_name}-${var.environment}-redis" engine = "redis" engine_version = "7.0" node_type = var.redis_node_type num_cache_nodes = 1 parameter_group_family = "redis7"
subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.redis.id]
snapshot_retention_limit = var.environment == "production" ? 5 : 1 snapshot_window = "05:00-06:00" maintenance_window = "sun:06:00-sun:07:00" }

============================================

============================================

Security Groups

Security Groups

============================================

============================================

resource "aws_security_group" "database" { name_prefix = "${var.project_name}-${var.environment}-db-" vpc_id = module.vpc.vpc_id
ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }
resource "aws_security_group" "redis" { name_prefix = "${var.project_name}-${var.environment}-redis-" vpc_id = module.vpc.vpc_id
ingress { from_port = 6379 to_port = 6379 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }
resource "aws_security_group" "database" { name_prefix = "${var.project_name}-${var.environment}-db-" vpc_id = module.vpc.vpc_id
ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }
resource "aws_security_group" "redis" { name_prefix = "${var.project_name}-${var.environment}-redis-" vpc_id = module.vpc.vpc_id
ingress { from_port = 6379 to_port = 6379 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

============================================

============================================

Outputs

Outputs

============================================

============================================

output "cluster_endpoint" { value = module.eks.cluster_endpoint }
output "database_endpoint" { value = module.db.db_instance_endpoint }
output "redis_endpoint" { value = module.redis.cache_nodes[0].address }

**Step 2.2: Variables and Environments**

Create `terraform/variables.tf`:

```hcl
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-west-2"
}

variable "project_name" {
  description = "Project name"
  type        = string
}

variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "private_subnet_cidrs" {
  description = "Private subnet CIDR blocks"
  type        = list(string)
}

variable "public_subnet_cidrs" {
  description = "Public subnet CIDR blocks"
  type        = list(string)
}

variable "node_desired_size" {
  description = "Desired number of EKS nodes"
  type        = number
  default     = 2
}

variable "node_min_size" {
  description = "Minimum number of EKS nodes"
  type        = number
  default     = 1
}

variable "node_max_size" {
  description = "Maximum number of EKS nodes"
  type        = number
  default     = 5
}

variable "node_instance_types" {
  description = "EKS node instance types"
  type        = list(string)
  default     = ["t3.medium"]
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.micro"
}

variable "db_allocated_storage" {
  description = "RDS allocated storage (GB)"
  type        = number
  default     = 20
}

variable "db_max_allocated_storage" {
  description = "RDS maximum allocated storage (GB)"
  type        = number
  default     = 100
}

variable "db_name" {
  description = "Database name"
  type        = string
}

variable "db_username" {
  description = "Database username"
  type        = string
}

variable "redis_node_type" {
  description = "ElastiCache Redis node type"
  type        = string
  default     = "cache.t3.micro"
}
Step 2.3: Terraform Execution
bash
undefined
output "cluster_endpoint" { value = module.eks.cluster_endpoint }
output "database_endpoint" { value = module.db.db_instance_endpoint }
output "redis_endpoint" { value = module.redis.cache_nodes[0].address }

**步骤2.2:变量与环境配置**

创建`terraform/variables.tf`:

```hcl
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-west-2"
}

variable "project_name" {
  description = "Project name"
  type        = string
}

variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "private_subnet_cidrs" {
  description = "Private subnet CIDR blocks"
  type        = list(string)
}

variable "public_subnet_cidrs" {
  description = "Public subnet CIDR blocks"
  type        = list(string)
}

variable "node_desired_size" {
  description = "Desired number of EKS nodes"
  type        = number
  default     = 2
}

variable "node_min_size" {
  description = "Minimum number of EKS nodes"
  type        = number
  default     = 1
}

variable "node_max_size" {
  description = "Maximum number of EKS nodes"
  type        = number
  default     = 5
}

variable "node_instance_types" {
  description = "EKS node instance types"
  type        = list(string)
  default     = ["t3.medium"]
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.micro"
}

variable "db_allocated_storage" {
  description = "RDS allocated storage (GB)"
  type        = number
  default     = 20
}

variable "db_max_allocated_storage" {
  description = "RDS maximum allocated storage (GB)"
  type        = number
  default     = 100
}

variable "db_name" {
  description = "Database name"
  type        = string
}

variable "db_username" {
  description = "Database username"
  type        = string
}

variable "redis_node_type" {
  description = "ElastiCache Redis node type"
  type        = string
  default     = "cache.t3.micro"
}
步骤2.3:Terraform执行流程
bash
undefined

Initialize Terraform

Initialize Terraform

terraform init
terraform init

Select workspace (environment)

Select workspace (environment)

terraform workspace select production || terraform workspace new production
terraform workspace select production || terraform workspace new production

Plan changes

Plan changes

terraform plan -var-file=environments/production.tfvars
terraform plan -var-file=environments/production.tfvars

Apply changes (requires approval)

Apply changes (requires approval)

terraform apply -var-file=environments/production.tfvars
terraform apply -var-file=environments/production.tfvars

View outputs

View outputs

terraform output

---
terraform output

---

Phase 3: Monitoring and Observability

阶段3:监控与可观测性

Implement comprehensive monitoring, logging, and alerting.
Step 3.1: Prometheus Configuration
Create
monitoring/prometheus-config.yaml
:
yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'
实施全面的监控、日志与告警系统。
步骤3.1:Prometheus配置
创建
monitoring/prometheus-config.yaml
yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'

Alerting configuration

Alerting configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

Load rules

rule_files:
  • /etc/prometheus/rules/*.yml
rule_files:
  • /etc/prometheus/rules/*.yml

Scrape configurations

Scrape configurations

scrape_configs:

Prometheus self-monitoring

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Kubernetes API server

  • job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
    • role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes nodes

  • job_name: 'kubernetes-nodes' kubernetes_sd_configs:
    • role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • action: labelmap regex: _meta_kubernetes_node_label(.+)

Application pods

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
    • action: labelmap regex: _meta_kubernetes_pod_label(.+)
    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

**Step 3.2: Alert Rules**

Create `monitoring/alert-rules.yml`:

```yaml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"

      # High response time
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "{{ $labels.instance }} CPU usage is {{ $value }}%"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "{{ $labels.instance }} memory usage is {{ $value }}%"

      # Disk space running out
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space running low"
          description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"

scrape_configs:

Prometheus self-monitoring

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Kubernetes API server

  • job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
    • role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes nodes

  • job_name: 'kubernetes-nodes' kubernetes_sd_configs:
    • role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • action: labelmap regex: _meta_kubernetes_node_label(.+)

Application pods

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
    • action: labelmap regex: _meta_kubernetes_pod_label(.+)
    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

**步骤3.2:告警规则**

创建`monitoring/alert-rules.yml`:

```yaml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"

      # High response time
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "{{ $labels.instance }} CPU usage is {{ $value }}%"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "{{ $labels.instance }} memory usage is {{ $value }}%"

      # Disk space running out
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space running low"
          description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"

Collaboration Patterns

协作模式

With Architect (or architect-role-skill)

与架构师(或architect-role-skill)协作

  • Review ARCHITECTURE.md for infrastructure requirements
  • Validate deployment strategy aligns with design
  • Confirm scalability and availability targets
  • 审阅ARCHITECTURE.md以确认基础设施需求
  • 验证部署策略是否符合设计要求
  • 确认可扩展性与可用性目标

With Builder (or builder-role-skill)

与开发人员(或builder-role-skill)协作

  • Coordinate on build process and artifacts
  • Ensure application exposes health check endpoints
  • Verify environment variable requirements
  • 协调构建流程与产物管理
  • 确保应用暴露健康检查端点
  • 验证环境变量需求

With Validator (or validator-role-skill)

与测试人员(或validator-role-skill)协作

  • Integrate test execution into CI/CD pipeline
  • Configure automated security scanning
  • Implement quality gates
  • 将测试执行集成到CI/CD流水线
  • 配置自动化安全扫描
  • 实施质量门禁

With Scribe (or scribe-role-skill)

与文档工程师(或scribe-role-skill)协作

  • Collaborate on deployment documentation
  • Document infrastructure architecture
  • Create runbooks for operations

  • 协作编写部署文档
  • 记录基础设施架构
  • 创建运维手册

Examples

示例

Example 1: GitHub Actions CI/CD Pipeline

示例1:GitHub Actions CI/CD流水线

Task: Create complete CI/CD pipeline for Node.js application
markdown
undefined
任务:为Node.js应用创建完整的CI/CD流水线
markdown
undefined

Deliverables

交付物

  • GitHub Actions workflow with lint, build, test, security scan
  • Docker multi-stage build configuration
  • Kubernetes deployment manifests
  • Blue-green deployment strategy for production
Result: Automated pipeline with 8-minute build-to-production time
undefined
  • 包含代码检查、构建、测试、安全扫描的GitHub Actions工作流
  • Docker多阶段构建配置
  • Kubernetes部署清单
  • 生产环境蓝绿部署策略
结果:自动化流水线,从构建到生产部署仅需8分钟
undefined

Example 2: AWS Infrastructure with Terraform

示例2:基于Terraform的AWS基础设施

Task: Provision production infrastructure on AWS
markdown
undefined
任务:在AWS上部署生产级基础设施
markdown
undefined

Infrastructure Created

创建的基础设施

  • VPC with public/private subnets across 3 AZs
  • EKS cluster with managed node groups (auto-scaling 2-5 nodes)
  • RDS PostgreSQL with multi-AZ failover
  • ElastiCache Redis for caching
  • All with proper security groups and IAM roles
Result: Fully automated infrastructure provisioning in 15 minutes
undefined
  • 跨3个可用区的VPC,包含公有/私有子网
  • 带托管节点组的EKS集群(自动扩缩容2-5节点)
  • 支持多可用区故障转移的RDS PostgreSQL
  • 用于缓存的ElastiCache Redis
  • 所有资源均配置了正确的安全组与IAM角色
结果:完全自动化的基础设施部署,耗时15分钟
undefined

Example 3: Monitoring Stack Setup

示例3:监控栈搭建

Task: Implement comprehensive monitoring and alerting
markdown
undefined
任务:实施全面的监控与告警系统
markdown
undefined

Monitoring Implemented

已实现的监控功能

  • Prometheus for metrics collection
  • Grafana dashboards for visualization
  • Alert rules for critical metrics (errors, latency, resource usage)
  • PagerDuty integration for on-call notifications
Result: Full observability with <2 minute alert response time

---
  • Prometheus指标采集
  • Grafana可视化仪表盘
  • 针对关键指标(错误率、延迟、资源使用率)的告警规则
  • 与PagerDuty集成的值班通知
结果:全链路可观测性,告警响应时间<2分钟

---

Resources

资源

Templates

模板

  • resources/github-actions-template.yml
    - GitHub Actions workflow template
  • resources/terraform-aws-template.tf
    - AWS infrastructure template
  • resources/prometheus-config-template.yaml
    - Prometheus configuration template
  • resources/grafana-dashboard-template.json
    - Grafana dashboard template
  • resources/github-actions-template.yml
    - GitHub Actions工作流模板
  • resources/terraform-aws-template.tf
    - AWS基础设施模板
  • resources/prometheus-config-template.yaml
    - Prometheus配置模板
  • resources/grafana-dashboard-template.json
    - Grafana仪表盘模板

Scripts

脚本

  • scripts/terraform-init.sh
    - Terraform initialization script
  • scripts/deploy-monitoring.sh
    - Monitoring stack deployment
  • scripts/backup-database.sh
    - Database backup automation

  • scripts/terraform-init.sh
    - Terraform初始化脚本
  • scripts/deploy-monitoring.sh
    - 监控栈部署脚本
  • scripts/backup-database.sh
    - 数据库备份自动化脚本

References

参考资料


Version: 1.0.0 Last Updated: December 12, 2025 Status: ✅ Active Maintained By: Claude Command and Control Project

版本:1.0.0 最后更新:2025年12月12日 状态:✅ 活跃 维护方:Claude Command and Control Project