devops-role-skill

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DevOps Role Skill

DevOps角色技能

Description

技能描述

Create CI/CD pipelines, automate build and deployment processes, implement monitoring and observability, and manage infrastructure across all environments. This skill implements professional DevOps practices including pipeline automation, infrastructure as code, environment management, and comprehensive monitoring.

创建CI/CD流水线，自动化构建与部署流程，实施监控与可观测性，并管理全环境下的基础设施。此技能涵盖专业的DevOps实践，包括流水线自动化、基础设施即代码、环境管理以及全面监控。

When to Use This Skill

何时使用此技能

Creating CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Implementing infrastructure as code (Terraform, CloudFormation)
Setting up environment configurations (dev, staging, production)
Implementing monitoring and observability (Prometheus, Grafana)
Automating deployment processes
Managing containerization and orchestration (Docker, Kubernetes)
Implementing security scanning and compliance checks

创建CI/CD流水线（GitHub Actions、GitLab CI、Jenkins）
实施基础设施即代码（Terraform、CloudFormation）
配置环境（开发、预发布、生产）
实施监控与可观测性（Prometheus、Grafana）
自动化部署流程
管理容器化与编排（Docker、Kubernetes）
实施安全扫描与合规检查

When NOT to Use This Skill

何时不使用此技能

For application code development (use builder-role-skill)
For system architecture design (use architect-role-skill)
For code testing and validation (use validator-role-skill)
For documentation writing (use scribe-role-skill)

应用代码开发（使用builder-role-skill）
系统架构设计（使用architect-role-skill）
代码测试与验证（使用validator-role-skill）
文档编写（使用scribe-role-skill）

Prerequisites

前置条件

Access to CI/CD platform (GitHub Actions, GitLab, Jenkins)
Cloud provider credentials (AWS, GCP, Azure)
Infrastructure as Code tools installed (Terraform, Ansible)
Container registry access
Kubernetes cluster access (if using K8s)
Monitoring tools configured

拥有CI/CD平台访问权限（GitHub Actions、GitLab、Jenkins）
云服务商凭证（AWS、GCP、Azure）
已安装基础设施即代码工具（Terraform、Ansible）
容器镜像仓库访问权限
Kubernetes集群访问权限（若使用K8s）
已配置监控工具

Workflow

阶段1：CI/CD流水线创建

Phase 1: CI/CD Pipeline Creation

—

Implement automated build, test, and deployment workflows.

Step 1.1: Requirements Analysis

Load context files:
- DEVELOPMENT_PLAN.md (deployment strategy)
- ARCHITECTURE.md (system components)
- README.md (project overview)
- Security requirements

Step 1.2: Create Pipeline Configuration

实现自动化构建、测试与部署工作流。

步骤1.1：需求分析

加载上下文文件：
- DEVELOPMENT_PLAN.md（部署策略）
- ARCHITECTURE.md（系统组件）
- README.md（项目概述）
- 安全要求

步骤1.2：创建流水线配置

GitHub Actions Example

GitHub Actions示例

Create

.github/workflows/ci-cd.yml

yaml

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ============================================
  # LINT & FORMAT CHECK
  # ============================================
  lint:
    name: Lint and Format Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Check formatting
        run: npm run format:check

  # ============================================
  # BUILD
  # ============================================
  build:
    name: Build Application
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-artifacts
          path: dist/
          retention-days: 7

  # ============================================
  # TEST
  # ============================================
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        test-type: [unit, integration]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run ${{ matrix.test-type }} tests
        run: npm run test:${{ matrix.test-type }}

      - name: Generate coverage report
        if: matrix.test-type == 'unit'
        run: npm run coverage

      - name: Upload coverage to Codecov
        if: matrix.test-type == 'unit'
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/coverage-final.json
          flags: unittests

  # ============================================
  # SECURITY SCAN
  # ============================================
  security:
    name: Security Scanning
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4

      - name: Run dependency audit
        run: npm audit --audit-level=moderate

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  # ============================================
  # BUILD DOCKER IMAGE
  # ============================================
  docker:
    name: Build and Push Docker Image
    runs-on: ubuntu-latest
    needs: [build, test, security]
    if: github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=sha,prefix={{branch}}-
            type=semver,pattern={{version}}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ============================================
  # DEPLOY TO STAGING
  # ============================================
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
          kubectl rollout status deployment/app

      - name: Run smoke tests
        run: |
          npm run test:smoke -- --url=https://staging.example.com

  # ============================================
  # DEPLOY TO PRODUCTION
  # ============================================
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://www.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}

      - name: Deploy to Kubernetes (Blue-Green)
        run: |
          # Deploy green version
          kubectl apply -f k8s/green-deployment.yaml
          kubectl set image deployment/app-green \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}

          # Wait for rollout
          kubectl rollout status deployment/app-green

          # Run health checks
          kubectl run health-check --rm -i --restart=Never \
            --image=curlimages/curl -- \
            curl http://app-green-service/health

          # Switch traffic to green
          kubectl patch service app-service \
            -p '{"spec":{"selector":{"version":"green"}}}'

          # Delete old blue deployment
          kubectl delete deployment app-blue || true

          # Rename green to blue for next deploy
          kubectl label deployment app-green version=blue --overwrite

      - name: Run production smoke tests
        run: |
          npm run test:smoke -- --url=https://www.example.com

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment successful!",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Step 1.3: Create Dockerfile

dockerfile

undefined

创建

.github/workflows/ci-cd.yml

：

yaml

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ============================================
  # LINT & FORMAT CHECK
  # ============================================
  lint:
    name: Lint and Format Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Check formatting
        run: npm run format:check

  # ============================================
  # BUILD
  # ============================================
  build:
    name: Build Application
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-artifacts
          path: dist/
          retention-days: 7

  # ============================================
  # TEST
  # ============================================
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        test-type: [unit, integration]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run ${{ matrix.test-type }} tests
        run: npm run test:${{ matrix.test-type }}

      - name: Generate coverage report
        if: matrix.test-type == 'unit'
        run: npm run coverage

      - name: Upload coverage to Codecov
        if: matrix.test-type == 'unit'
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/coverage-final.json
          flags: unittests

  # ============================================
  # SECURITY SCAN
  # ============================================
  security:
    name: Security Scanning
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4

      - name: Run dependency audit
        run: npm audit --audit-level=moderate

      - name: Run Snyk security scan
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  # ============================================
  # BUILD DOCKER IMAGE
  # ============================================
  docker:
    name: Build and Push Docker Image
    runs-on: ubuntu-latest
    needs: [build, test, security]
    if: github.event_name == 'push'
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=sha,prefix={{branch}}-
            type=semver,pattern={{version}}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ============================================
  # DEPLOY TO STAGING
  # ============================================
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/develop'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/app \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
          kubectl rollout status deployment/app

      - name: Run smoke tests
        run: |
          npm run test:smoke -- --url=https://staging.example.com

  # ============================================
  # DEPLOY TO PRODUCTION
  # ============================================
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: docker
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://www.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          method: kubeconfig
          kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}

      - name: Deploy to Kubernetes (Blue-Green)
        run: |
          # Deploy green version
          kubectl apply -f k8s/green-deployment.yaml
          kubectl set image deployment/app-green \
            app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}

          # Wait for rollout
          kubectl rollout status deployment/app-green

          # Run health checks
          kubectl run health-check --rm -i --restart=Never \
            --image=curlimages/curl -- \
            curl http://app-green-service/health

          # Switch traffic to green
          kubectl patch service app-service \
            -p '{"spec":{"selector":{"version":"green"}}}'

          # Delete old blue deployment
          kubectl delete deployment app-blue || true

          # Rename green to blue for next deploy
          kubectl label deployment app-green version=blue --overwrite

      - name: Run production smoke tests
        run: |
          npm run test:smoke -- --url=https://www.example.com

      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment successful!",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

步骤1.3：创建Dockerfile

dockerfile

undefined

Multi-stage build for optimal image size

FROM node:18-alpine AS builder

WORKDIR /app

FROM node:18-alpine AS builder

WORKDIR /app

Copy package files

COPY package*.json ./

Install dependencies

RUN npm ci --only=production

Copy source code

COPY . .

Build application

RUN npm run build

============================================

Production image

============================================

FROM node:18-alpine

Install dumb-init for proper signal handling

RUN apk add --no-cache dumb-init

Create non-root user

RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001

WORKDIR /app

RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001

WORKDIR /app

Copy built artifacts and dependencies

COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=builder --chown=nodejs:nodejs /app/package.json ./

Switch to non-root user

USER nodejs

Expose port

EXPOSE 3000

Health check

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

Use dumb-init to handle signals properly

ENTRYPOINT ["dumb-init", "--"]

Start application

CMD ["node", "dist/main.js"]

---

CMD ["node", "dist/main.js"]

---

Phase 2: Infrastructure as Code

阶段2：基础设施即代码

Manage infrastructure using declarative configuration.

Step 2.1: Terraform Configuration

Create

terraform/main.tf

hcl

undefined

使用声明式配置管理基础设施。

步骤2.1：Terraform配置

创建

terraform/main.tf

：

hcl

undefined

============================================

Provider Configuration

============================================

terraform { required_version = ">= 1.0"

required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }

backend "s3" { bucket = "terraform-state-bucket" key = "app/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-locks" } }

provider "aws" { region = var.aws_region

default_tags { tags = { Project = "MyApp" Environment = var.environment ManagedBy = "Terraform" } } }

terraform { required_version = ">= 1.0"

required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }

backend "s3" { bucket = "terraform-state-bucket" key = "app/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-locks" } }

provider "aws" { region = var.aws_region

default_tags { tags = { Project = "MyApp" Environment = var.environment ManagedBy = "Terraform" } } }

============================================

VPC and Networking

============================================

module "vpc" { source = "terraform-aws-modules/vpc/aws"

name = "${var.project_name}-${var.environment}-vpc" cidr = var.vpc_cidr

azs = var.availability_zones private_subnets = var.private_subnet_cidrs public_subnets = var.public_subnet_cidrs

enable_nat_gateway = true enable_vpn_gateway = false

enable_dns_hostnames = true enable_dns_support = true }

module "vpc" { source = "terraform-aws-modules/vpc/aws"

name = "${var.project_name}-${var.environment}-vpc" cidr = var.vpc_cidr

azs = var.availability_zones private_subnets = var.private_subnet_cidrs public_subnets = var.public_subnet_cidrs

enable_nat_gateway = true enable_vpn_gateway = false

enable_dns_hostnames = true enable_dns_support = true }

============================================

EKS Cluster

============================================

module "eks" { source = "terraform-aws-modules/eks/aws"

cluster_name = "${var.project_name}-${var.environment}" cluster_version = "1.28"

vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets

eks_managed_node_groups = { general = { desired_size = var.node_desired_size min_size = var.node_min_size max_size = var.node_max_size

  instance_types = var.node_instance_types
  capacity_type  = "ON_DEMAND"

  labels = {
    role = "general"
  }

  tags = {
    NodeGroup = "general"
  }
}

}

Cluster access entry

enable_cluster_creator_admin_permissions = true }

module "eks" { source = "terraform-aws-modules/eks/aws"

cluster_name = "${var.project_name}-${var.environment}" cluster_version = "1.28"

vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets

eks_managed_node_groups = { general = { desired_size = var.node_desired_size min_size = var.node_min_size max_size = var.node_max_size

  instance_types = var.node_instance_types
  capacity_type  = "ON_DEMAND"

  labels = {
    role = "general"
  }

  tags = {
    NodeGroup = "general"
  }
}

}

Cluster access entry

enable_cluster_creator_admin_permissions = true }

============================================

RDS Database

============================================

module "db" { source = "terraform-aws-modules/rds/aws"

identifier = "${var.project_name}-${var.environment}-db"

engine = "postgres" engine_version = "15.4" family = "postgres15" major_engine_version = "15" instance_class = var.db_instance_class

allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage

db_name = var.db_name username = var.db_username port = 5432

multi_az = var.environment == "production" db_subnet_group_name = module.vpc.database_subnet_group vpc_security_group_ids = [aws_security_group.database.id]

backup_retention_period = var.environment == "production" ? 30 : 7 backup_window = "03:00-04:00" maintenance_window = "Mon:04:00-Mon:05:00"

deletion_protection = var.environment == "production"

enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

tags = { Name = "${var.project_name}-${var.environment}-db" } }

module "db" { source = "terraform-aws-modules/rds/aws"

identifier = "${var.project_name}-${var.environment}-db"

engine = "postgres" engine_version = "15.4" family = "postgres15" major_engine_version = "15" instance_class = var.db_instance_class

allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage

db_name = var.db_name username = var.db_username port = 5432

multi_az = var.environment == "production" db_subnet_group_name = module.vpc.database_subnet_group vpc_security_group_ids = [aws_security_group.database.id]

backup_retention_period = var.environment == "production" ? 30 : 7 backup_window = "03:00-04:00" maintenance_window = "Mon:04:00-Mon:05:00"

deletion_protection = var.environment == "production"

enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

tags = { Name = "${var.project_name}-${var.environment}-db" } }

============================================

ElastiCache Redis

============================================

module "redis" { source = "terraform-aws-modules/elasticache/aws"

cluster_id = "${var.project_name}-${var.environment}-redis" engine = "redis" engine_version = "7.0" node_type = var.redis_node_type num_cache_nodes = 1 parameter_group_family = "redis7"

subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.redis.id]

snapshot_retention_limit = var.environment == "production" ? 5 : 1 snapshot_window = "05:00-06:00" maintenance_window = "sun:06:00-sun:07:00" }

module "redis" { source = "terraform-aws-modules/elasticache/aws"

cluster_id = "${var.project_name}-${var.environment}-redis" engine = "redis" engine_version = "7.0" node_type = var.redis_node_type num_cache_nodes = 1 parameter_group_family = "redis7"

subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.redis.id]

snapshot_retention_limit = var.environment == "production" ? 5 : 1 snapshot_window = "05:00-06:00" maintenance_window = "sun:06:00-sun:07:00" }

============================================

Security Groups

============================================

resource "aws_security_group" "database" { name_prefix = "${var.project_name}-${var.environment}-db-" vpc_id = module.vpc.vpc_id

ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

resource "aws_security_group" "redis" { name_prefix = "${var.project_name}-${var.environment}-redis-" vpc_id = module.vpc.vpc_id

ingress { from_port = 6379 to_port = 6379 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

resource "aws_security_group" "database" { name_prefix = "${var.project_name}-${var.environment}-db-" vpc_id = module.vpc.vpc_id

ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

resource "aws_security_group" "redis" { name_prefix = "${var.project_name}-${var.environment}-redis-" vpc_id = module.vpc.vpc_id

ingress { from_port = 6379 to_port = 6379 protocol = "tcp" security_groups = [module.eks.node_security_group_id] }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

============================================

Outputs

============================================

output "cluster_endpoint" { value = module.eks.cluster_endpoint }

output "database_endpoint" { value = module.db.db_instance_endpoint }

output "redis_endpoint" { value = module.redis.cache_nodes[0].address }


**Step 2.2: Variables and Environments**

Create `terraform/variables.tf`:

```hcl
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-west-2"
}

variable "project_name" {
  description = "Project name"
  type        = string
}

variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "private_subnet_cidrs" {
  description = "Private subnet CIDR blocks"
  type        = list(string)
}

variable "public_subnet_cidrs" {
  description = "Public subnet CIDR blocks"
  type        = list(string)
}

variable "node_desired_size" {
  description = "Desired number of EKS nodes"
  type        = number
  default     = 2
}

variable "node_min_size" {
  description = "Minimum number of EKS nodes"
  type        = number
  default     = 1
}

variable "node_max_size" {
  description = "Maximum number of EKS nodes"
  type        = number
  default     = 5
}

variable "node_instance_types" {
  description = "EKS node instance types"
  type        = list(string)
  default     = ["t3.medium"]
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.micro"
}

variable "db_allocated_storage" {
  description = "RDS allocated storage (GB)"
  type        = number
  default     = 20
}

variable "db_max_allocated_storage" {
  description = "RDS maximum allocated storage (GB)"
  type        = number
  default     = 100
}

variable "db_name" {
  description = "Database name"
  type        = string
}

variable "db_username" {
  description = "Database username"
  type        = string
}

variable "redis_node_type" {
  description = "ElastiCache Redis node type"
  type        = string
  default     = "cache.t3.micro"
}

Step 2.3: Terraform Execution

bash

undefined

output "cluster_endpoint" { value = module.eks.cluster_endpoint }

output "database_endpoint" { value = module.db.db_instance_endpoint }

output "redis_endpoint" { value = module.redis.cache_nodes[0].address }


**步骤2.2：变量与环境配置**

创建`terraform/variables.tf`：

```hcl
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-west-2"
}

variable "project_name" {
  description = "Project name"
  type        = string
}

variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "private_subnet_cidrs" {
  description = "Private subnet CIDR blocks"
  type        = list(string)
}

variable "public_subnet_cidrs" {
  description = "Public subnet CIDR blocks"
  type        = list(string)
}

variable "node_desired_size" {
  description = "Desired number of EKS nodes"
  type        = number
  default     = 2
}

variable "node_min_size" {
  description = "Minimum number of EKS nodes"
  type        = number
  default     = 1
}

variable "node_max_size" {
  description = "Maximum number of EKS nodes"
  type        = number
  default     = 5
}

variable "node_instance_types" {
  description = "EKS node instance types"
  type        = list(string)
  default     = ["t3.medium"]
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.micro"
}

variable "db_allocated_storage" {
  description = "RDS allocated storage (GB)"
  type        = number
  default     = 20
}

variable "db_max_allocated_storage" {
  description = "RDS maximum allocated storage (GB)"
  type        = number
  default     = 100
}

variable "db_name" {
  description = "Database name"
  type        = string
}

variable "db_username" {
  description = "Database username"
  type        = string
}

variable "redis_node_type" {
  description = "ElastiCache Redis node type"
  type        = string
  default     = "cache.t3.micro"
}

步骤2.3：Terraform执行流程

bash

undefined

Initialize Terraform

terraform init

Select workspace (environment)

terraform workspace select production || terraform workspace new production

Plan changes

terraform plan -var-file=environments/production.tfvars

Apply changes (requires approval)

terraform apply -var-file=environments/production.tfvars

View outputs

terraform output

---

terraform output

---

Phase 3: Monitoring and Observability

阶段3：监控与可观测性

Implement comprehensive monitoring, logging, and alerting.

Step 3.1: Prometheus Configuration

Create

monitoring/prometheus-config.yaml

yaml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'

实施全面的监控、日志与告警系统。

步骤3.1：Prometheus配置

创建

monitoring/prometheus-config.yaml

：

yaml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'

Alerting configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

rule_files:

/etc/prometheus/rules/*.yml

rule_files:

/etc/prometheus/rules/*.yml

Scrape configurations

scrape_configs:

Prometheus self-monitoring

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']

Kubernetes API server

job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes nodes

job_name: 'kubernetes-nodes' kubernetes_sd_configs:
- role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- action: labelmap regex: _meta_kubernetes_node_label(.+)

Application pods

job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- action: labelmap regex: _meta_kubernetes_pod_label(.+)
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name


**Step 3.2: Alert Rules**

Create `monitoring/alert-rules.yml`:

```yaml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"

      # High response time
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "{{ $labels.instance }} CPU usage is {{ $value }}%"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "{{ $labels.instance }} memory usage is {{ $value }}%"

      # Disk space running out
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space running low"
          description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"

scrape_configs:

Prometheus self-monitoring

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']

Kubernetes API server

job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes nodes

job_name: 'kubernetes-nodes' kubernetes_sd_configs:
- role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- action: labelmap regex: _meta_kubernetes_node_label(.+)

Application pods

job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- action: labelmap regex: _meta_kubernetes_pod_label(.+)
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name


**步骤3.2：告警规则**

创建`monitoring/alert-rules.yml`：

```yaml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"

      # High response time
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "{{ $labels.instance }} CPU usage is {{ $value }}%"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "{{ $labels.instance }} memory usage is {{ $value }}%"

      # Disk space running out
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space running low"
          description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"

Collaboration Patterns

协作模式

With Architect (or architect-role-skill)

与架构师（或architect-role-skill）协作

Review ARCHITECTURE.md for infrastructure requirements
Validate deployment strategy aligns with design
Confirm scalability and availability targets

审阅ARCHITECTURE.md以确认基础设施需求
验证部署策略是否符合设计要求
确认可扩展性与可用性目标

With Builder (or builder-role-skill)

与开发人员（或builder-role-skill）协作

Coordinate on build process and artifacts
Ensure application exposes health check endpoints
Verify environment variable requirements

协调构建流程与产物管理
确保应用暴露健康检查端点
验证环境变量需求

With Validator (or validator-role-skill)

与测试人员（或validator-role-skill）协作

Integrate test execution into CI/CD pipeline
Configure automated security scanning
Implement quality gates

将测试执行集成到CI/CD流水线
配置自动化安全扫描
实施质量门禁

With Scribe (or scribe-role-skill)

与文档工程师（或scribe-role-skill）协作

Collaborate on deployment documentation
Document infrastructure architecture
Create runbooks for operations

协作编写部署文档
记录基础设施架构
创建运维手册

Examples

示例

Example 1: GitHub Actions CI/CD Pipeline

示例1：GitHub Actions CI/CD流水线

Task: Create complete CI/CD pipeline for Node.js application

markdown

undefined

任务：为Node.js应用创建完整的CI/CD流水线

markdown

undefined

Deliverables

交付物

GitHub Actions workflow with lint, build, test, security scan
Docker multi-stage build configuration
Kubernetes deployment manifests
Blue-green deployment strategy for production

Result: Automated pipeline with 8-minute build-to-production time

undefined

包含代码检查、构建、测试、安全扫描的GitHub Actions工作流
Docker多阶段构建配置
Kubernetes部署清单
生产环境蓝绿部署策略

结果：自动化流水线，从构建到生产部署仅需8分钟

undefined

Example 2: AWS Infrastructure with Terraform

示例2：基于Terraform的AWS基础设施

Task: Provision production infrastructure on AWS

markdown

undefined

任务：在AWS上部署生产级基础设施

markdown

undefined

Infrastructure Created

创建的基础设施

VPC with public/private subnets across 3 AZs
EKS cluster with managed node groups (auto-scaling 2-5 nodes)
RDS PostgreSQL with multi-AZ failover
ElastiCache Redis for caching
All with proper security groups and IAM roles

Result: Fully automated infrastructure provisioning in 15 minutes

undefined

跨3个可用区的VPC，包含公有/私有子网
带托管节点组的EKS集群（自动扩缩容2-5节点）
支持多可用区故障转移的RDS PostgreSQL
用于缓存的ElastiCache Redis
所有资源均配置了正确的安全组与IAM角色

结果：完全自动化的基础设施部署，耗时15分钟

undefined

Example 3: Monitoring Stack Setup

示例3：监控栈搭建

Task: Implement comprehensive monitoring and alerting

markdown

undefined

任务：实施全面的监控与告警系统

markdown

undefined

Monitoring Implemented

已实现的监控功能

Prometheus for metrics collection
Grafana dashboards for visualization
Alert rules for critical metrics (errors, latency, resource usage)
PagerDuty integration for on-call notifications

Result: Full observability with <2 minute alert response time

---

Prometheus指标采集
Grafana可视化仪表盘
针对关键指标（错误率、延迟、资源使用率）的告警规则
与PagerDuty集成的值班通知

结果：全链路可观测性，告警响应时间<2分钟

---

Resources

资源

Templates

模板

```
resources/github-actions-template.yml
```
- GitHub Actions workflow template
```
resources/terraform-aws-template.tf
```
- AWS infrastructure template

resources/prometheus-config-template.yaml

- Prometheus configuration template

resources/grafana-dashboard-template.json

- Grafana dashboard template

```
resources/github-actions-template.yml
```
- GitHub Actions工作流模板
```
resources/terraform-aws-template.tf
```
- AWS基础设施模板

resources/prometheus-config-template.yaml

- Prometheus配置模板

resources/grafana-dashboard-template.json

- Grafana仪表盘模板

Scripts

脚本

```
scripts/terraform-init.sh
```
- Terraform initialization script
```
scripts/deploy-monitoring.sh
```
- Monitoring stack deployment
```
scripts/backup-database.sh
```
- Database backup automation

```
scripts/terraform-init.sh
```
- Terraform初始化脚本
```
scripts/deploy-monitoring.sh
```
- 监控栈部署脚本
```
scripts/backup-database.sh
```
- 数据库备份自动化脚本

References

参考资料

Version: 1.0.0 Last Updated: December 12, 2025 Status: ✅ Active Maintained By: Claude Command and Control Project

版本：1.0.0 最后更新：2025年12月12日状态：✅ 活跃 维护方：Claude Command and Control Project