devops-role-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps Role Skill
DevOps角色技能
Description
技能描述
Create CI/CD pipelines, automate build and deployment processes, implement monitoring and observability, and manage infrastructure across all environments. This skill implements professional DevOps practices including pipeline automation, infrastructure as code, environment management, and comprehensive monitoring.
创建CI/CD流水线,自动化构建与部署流程,实施监控与可观测性,并管理全环境下的基础设施。此技能涵盖专业的DevOps实践,包括流水线自动化、基础设施即代码、环境管理以及全面监控。
When to Use This Skill
何时使用此技能
- Creating CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
- Implementing infrastructure as code (Terraform, CloudFormation)
- Setting up environment configurations (dev, staging, production)
- Implementing monitoring and observability (Prometheus, Grafana)
- Automating deployment processes
- Managing containerization and orchestration (Docker, Kubernetes)
- Implementing security scanning and compliance checks
- 创建CI/CD流水线(GitHub Actions、GitLab CI、Jenkins)
- 实施基础设施即代码(Terraform、CloudFormation)
- 配置环境(开发、预发布、生产)
- 实施监控与可观测性(Prometheus、Grafana)
- 自动化部署流程
- 管理容器化与编排(Docker、Kubernetes)
- 实施安全扫描与合规检查
When NOT to Use This Skill
何时不使用此技能
- For application code development (use builder-role-skill)
- For system architecture design (use architect-role-skill)
- For code testing and validation (use validator-role-skill)
- For documentation writing (use scribe-role-skill)
- 应用代码开发(使用builder-role-skill)
- 系统架构设计(使用architect-role-skill)
- 代码测试与验证(使用validator-role-skill)
- 文档编写(使用scribe-role-skill)
Prerequisites
前置条件
- Access to CI/CD platform (GitHub Actions, GitLab, Jenkins)
- Cloud provider credentials (AWS, GCP, Azure)
- Infrastructure as Code tools installed (Terraform, Ansible)
- Container registry access
- Kubernetes cluster access (if using K8s)
- Monitoring tools configured
- 拥有CI/CD平台访问权限(GitHub Actions、GitLab、Jenkins)
- 云服务商凭证(AWS、GCP、Azure)
- 已安装基础设施即代码工具(Terraform、Ansible)
- 容器镜像仓库访问权限
- Kubernetes集群访问权限(若使用K8s)
- 已配置监控工具
Workflow
阶段1:CI/CD流水线创建
Phase 1: CI/CD Pipeline Creation
—
Implement automated build, test, and deployment workflows.
Step 1.1: Requirements Analysis
Load context files:
- DEVELOPMENT_PLAN.md (deployment strategy)
- ARCHITECTURE.md (system components)
- README.md (project overview)
- Security requirementsStep 1.2: Create Pipeline Configuration
实现自动化构建、测试与部署工作流。
步骤1.1:需求分析
加载上下文文件:
- DEVELOPMENT_PLAN.md(部署策略)
- ARCHITECTURE.md(系统组件)
- README.md(项目概述)
- 安全要求步骤1.2:创建流水线配置
GitHub Actions Example
GitHub Actions示例
Create :
.github/workflows/ci-cd.ymlyaml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
NODE_VERSION: '18.x'
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ============================================
# LINT & FORMAT CHECK
# ============================================
lint:
name: Lint and Format Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linter
run: npm run lint
- name: Check formatting
run: npm run format:check
# ============================================
# BUILD
# ============================================
build:
name: Build Application
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build application
run: npm run build
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-artifacts
path: dist/
retention-days: 7
# ============================================
# TEST
# ============================================
test:
name: Run Tests
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
test-type: [unit, integration]
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ${{ matrix.test-type }} tests
run: npm run test:${{ matrix.test-type }}
- name: Generate coverage report
if: matrix.test-type == 'unit'
run: npm run coverage
- name: Upload coverage to Codecov
if: matrix.test-type == 'unit'
uses: codecov/codecov-action@v3
with:
files: ./coverage/coverage-final.json
flags: unittests
# ============================================
# SECURITY SCAN
# ============================================
security:
name: Security Scanning
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
- name: Run dependency audit
run: npm audit --audit-level=moderate
- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
# ============================================
# BUILD DOCKER IMAGE
# ============================================
docker:
name: Build and Push Docker Image
runs-on: ubuntu-latest
needs: [build, test, security]
if: github.event_name == 'push'
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=sha,prefix={{branch}}-
type=semver,pattern={{version}}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ============================================
# DEPLOY TO STAGING
# ============================================
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: docker
if: github.ref == 'refs/heads/develop'
environment:
name: staging
url: https://staging.example.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
- name: Deploy to Kubernetes
run: |
kubectl set image deployment/app \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
kubectl rollout status deployment/app
- name: Run smoke tests
run: |
npm run test:smoke -- --url=https://staging.example.com
# ============================================
# DEPLOY TO PRODUCTION
# ============================================
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: docker
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://www.example.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}
- name: Deploy to Kubernetes (Blue-Green)
run: |
# Deploy green version
kubectl apply -f k8s/green-deployment.yaml
kubectl set image deployment/app-green \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}
# Wait for rollout
kubectl rollout status deployment/app-green
# Run health checks
kubectl run health-check --rm -i --restart=Never \
--image=curlimages/curl -- \
curl http://app-green-service/health
# Switch traffic to green
kubectl patch service app-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# Delete old blue deployment
kubectl delete deployment app-blue || true
# Rename green to blue for next deploy
kubectl label deployment app-green version=blue --overwrite
- name: Run production smoke tests
run: |
npm run test:smoke -- --url=https://www.example.com
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployment successful!",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}Step 1.3: Create Dockerfile
dockerfile
undefined创建:
.github/workflows/ci-cd.ymlyaml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
NODE_VERSION: '18.x'
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ============================================
# LINT & FORMAT CHECK
# ============================================
lint:
name: Lint and Format Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linter
run: npm run lint
- name: Check formatting
run: npm run format:check
# ============================================
# BUILD
# ============================================
build:
name: Build Application
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Build application
run: npm run build
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-artifacts
path: dist/
retention-days: 7
# ============================================
# TEST
# ============================================
test:
name: Run Tests
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
test-type: [unit, integration]
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ${{ matrix.test-type }} tests
run: npm run test:${{ matrix.test-type }}
- name: Generate coverage report
if: matrix.test-type == 'unit'
run: npm run coverage
- name: Upload coverage to Codecov
if: matrix.test-type == 'unit'
uses: codecov/codecov-action@v3
with:
files: ./coverage/coverage-final.json
flags: unittests
# ============================================
# SECURITY SCAN
# ============================================
security:
name: Security Scanning
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
- name: Run dependency audit
run: npm audit --audit-level=moderate
- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
# ============================================
# BUILD DOCKER IMAGE
# ============================================
docker:
name: Build and Push Docker Image
runs-on: ubuntu-latest
needs: [build, test, security]
if: github.event_name == 'push'
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=sha,prefix={{branch}}-
type=semver,pattern={{version}}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ============================================
# DEPLOY TO STAGING
# ============================================
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: docker
if: github.ref == 'refs/heads/develop'
environment:
name: staging
url: https://staging.example.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
- name: Deploy to Kubernetes
run: |
kubectl set image deployment/app \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:develop-${{ github.sha }}
kubectl rollout status deployment/app
- name: Run smoke tests
run: |
npm run test:smoke -- --url=https://staging.example.com
# ============================================
# DEPLOY TO PRODUCTION
# ============================================
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: docker
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://www.example.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}
- name: Deploy to Kubernetes (Blue-Green)
run: |
# Deploy green version
kubectl apply -f k8s/green-deployment.yaml
kubectl set image deployment/app-green \
app=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:main-${{ github.sha }}
# Wait for rollout
kubectl rollout status deployment/app-green
# Run health checks
kubectl run health-check --rm -i --restart=Never \
--image=curlimages/curl -- \
curl http://app-green-service/health
# Switch traffic to green
kubectl patch service app-service \
-p '{"spec":{"selector":{"version":"green"}}}'
# Delete old blue deployment
kubectl delete deployment app-blue || true
# Rename green to blue for next deploy
kubectl label deployment app-green version=blue --overwrite
- name: Run production smoke tests
run: |
npm run test:smoke -- --url=https://www.example.com
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployment successful!",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Production Deployment Complete* :rocket:\nCommit: ${{ github.sha }}\nAuthor: ${{ github.actor }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}步骤1.3:创建Dockerfile
dockerfile
undefinedMulti-stage build for optimal image size
Multi-stage build for optimal image size
FROM node:18-alpine AS builder
WORKDIR /app
FROM node:18-alpine AS builder
WORKDIR /app
Copy package files
Copy package files
COPY package*.json ./
COPY package*.json ./
Install dependencies
Install dependencies
RUN npm ci --only=production
RUN npm ci --only=production
Copy source code
Copy source code
COPY . .
COPY . .
Build application
Build application
RUN npm run build
RUN npm run build
============================================
============================================
Production image
Production image
============================================
============================================
FROM node:18-alpine
FROM node:18-alpine
Install dumb-init for proper signal handling
Install dumb-init for proper signal handling
RUN apk add --no-cache dumb-init
RUN apk add --no-cache dumb-init
Create non-root user
Create non-root user
RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001
adduser -S nodejs -u 1001
WORKDIR /app
RUN addgroup -g 1001 -S nodejs &&
adduser -S nodejs -u 1001
adduser -S nodejs -u 1001
WORKDIR /app
Copy built artifacts and dependencies
Copy built artifacts and dependencies
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package.json ./
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package.json ./
Switch to non-root user
Switch to non-root user
USER nodejs
USER nodejs
Expose port
Expose port
EXPOSE 3000
EXPOSE 3000
Health check
Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
Use dumb-init to handle signals properly
Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
ENTRYPOINT ["dumb-init", "--"]
Start application
Start application
CMD ["node", "dist/main.js"]
---CMD ["node", "dist/main.js"]
---Phase 2: Infrastructure as Code
阶段2:基础设施即代码
Manage infrastructure using declarative configuration.
Step 2.1: Terraform Configuration
Create :
terraform/main.tfhcl
undefined使用声明式配置管理基础设施。
步骤2.1:Terraform配置
创建:
terraform/main.tfhcl
undefined============================================
============================================
Provider Configuration
Provider Configuration
============================================
============================================
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "terraform-state-bucket"
key = "app/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "MyApp"
Environment = var.environment
ManagedBy = "Terraform"
}
}
}
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "terraform-state-bucket"
key = "app/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "MyApp"
Environment = var.environment
ManagedBy = "Terraform"
}
}
}
============================================
============================================
VPC and Networking
VPC and Networking
============================================
============================================
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "${var.project_name}-${var.environment}-vpc"
cidr = var.vpc_cidr
azs = var.availability_zones
private_subnets = var.private_subnet_cidrs
public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true
enable_vpn_gateway = false
enable_dns_hostnames = true
enable_dns_support = true
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "${var.project_name}-${var.environment}-vpc"
cidr = var.vpc_cidr
azs = var.availability_zones
private_subnets = var.private_subnet_cidrs
public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true
enable_vpn_gateway = false
enable_dns_hostnames = true
enable_dns_support = true
}
============================================
============================================
EKS Cluster
EKS Cluster
============================================
============================================
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "${var.project_name}-${var.environment}"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
desired_size = var.node_desired_size
min_size = var.node_min_size
max_size = var.node_max_size
instance_types = var.node_instance_types
capacity_type = "ON_DEMAND"
labels = {
role = "general"
}
tags = {
NodeGroup = "general"
}
}}
Cluster access entry
enable_cluster_creator_admin_permissions = true
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "${var.project_name}-${var.environment}"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
desired_size = var.node_desired_size
min_size = var.node_min_size
max_size = var.node_max_size
instance_types = var.node_instance_types
capacity_type = "ON_DEMAND"
labels = {
role = "general"
}
tags = {
NodeGroup = "general"
}
}}
Cluster access entry
enable_cluster_creator_admin_permissions = true
}
============================================
============================================
RDS Database
RDS Database
============================================
============================================
module "db" {
source = "terraform-aws-modules/rds/aws"
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres"
engine_version = "15.4"
family = "postgres15"
major_engine_version = "15"
instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage
max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name
username = var.db_username
port = 5432
multi_az = var.environment == "production"
db_subnet_group_name = module.vpc.database_subnet_group
vpc_security_group_ids = [aws_security_group.database.id]
backup_retention_period = var.environment == "production" ? 30 : 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "production"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = {
Name = "${var.project_name}-${var.environment}-db"
}
}
module "db" {
source = "terraform-aws-modules/rds/aws"
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres"
engine_version = "15.4"
family = "postgres15"
major_engine_version = "15"
instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage
max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name
username = var.db_username
port = 5432
multi_az = var.environment == "production"
db_subnet_group_name = module.vpc.database_subnet_group
vpc_security_group_ids = [aws_security_group.database.id]
backup_retention_period = var.environment == "production" ? 30 : 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "production"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = {
Name = "${var.project_name}-${var.environment}-db"
}
}
============================================
============================================
ElastiCache Redis
ElastiCache Redis
============================================
============================================
module "redis" {
source = "terraform-aws-modules/elasticache/aws"
cluster_id = "${var.project_name}-${var.environment}-redis"
engine = "redis"
engine_version = "7.0"
node_type = var.redis_node_type
num_cache_nodes = 1
parameter_group_family = "redis7"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.redis.id]
snapshot_retention_limit = var.environment == "production" ? 5 : 1
snapshot_window = "05:00-06:00"
maintenance_window = "sun:06:00-sun:07:00"
}
module "redis" {
source = "terraform-aws-modules/elasticache/aws"
cluster_id = "${var.project_name}-${var.environment}-redis"
engine = "redis"
engine_version = "7.0"
node_type = var.redis_node_type
num_cache_nodes = 1
parameter_group_family = "redis7"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.redis.id]
snapshot_retention_limit = var.environment == "production" ? 5 : 1
snapshot_window = "05:00-06:00"
maintenance_window = "sun:06:00-sun:07:00"
}
============================================
============================================
Security Groups
Security Groups
============================================
============================================
resource "aws_security_group" "database" {
name_prefix = "${var.project_name}-${var.environment}-db-"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [module.eks.node_security_group_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "redis" {
name_prefix = "${var.project_name}-${var.environment}-redis-"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [module.eks.node_security_group_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "database" {
name_prefix = "${var.project_name}-${var.environment}-db-"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [module.eks.node_security_group_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "redis" {
name_prefix = "${var.project_name}-${var.environment}-redis-"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [module.eks.node_security_group_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
============================================
============================================
Outputs
Outputs
============================================
============================================
output "cluster_endpoint" {
value = module.eks.cluster_endpoint
}
output "database_endpoint" {
value = module.db.db_instance_endpoint
}
output "redis_endpoint" {
value = module.redis.cache_nodes[0].address
}
**Step 2.2: Variables and Environments**
Create `terraform/variables.tf`:
```hcl
variable "environment" {
description = "Environment name (dev, staging, production)"
type = string
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-west-2"
}
variable "project_name" {
description = "Project name"
type = string
}
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
}
variable "availability_zones" {
description = "Availability zones"
type = list(string)
}
variable "private_subnet_cidrs" {
description = "Private subnet CIDR blocks"
type = list(string)
}
variable "public_subnet_cidrs" {
description = "Public subnet CIDR blocks"
type = list(string)
}
variable "node_desired_size" {
description = "Desired number of EKS nodes"
type = number
default = 2
}
variable "node_min_size" {
description = "Minimum number of EKS nodes"
type = number
default = 1
}
variable "node_max_size" {
description = "Maximum number of EKS nodes"
type = number
default = 5
}
variable "node_instance_types" {
description = "EKS node instance types"
type = list(string)
default = ["t3.medium"]
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.micro"
}
variable "db_allocated_storage" {
description = "RDS allocated storage (GB)"
type = number
default = 20
}
variable "db_max_allocated_storage" {
description = "RDS maximum allocated storage (GB)"
type = number
default = 100
}
variable "db_name" {
description = "Database name"
type = string
}
variable "db_username" {
description = "Database username"
type = string
}
variable "redis_node_type" {
description = "ElastiCache Redis node type"
type = string
default = "cache.t3.micro"
}Step 2.3: Terraform Execution
bash
undefinedoutput "cluster_endpoint" {
value = module.eks.cluster_endpoint
}
output "database_endpoint" {
value = module.db.db_instance_endpoint
}
output "redis_endpoint" {
value = module.redis.cache_nodes[0].address
}
**步骤2.2:变量与环境配置**
创建`terraform/variables.tf`:
```hcl
variable "environment" {
description = "Environment name (dev, staging, production)"
type = string
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-west-2"
}
variable "project_name" {
description = "Project name"
type = string
}
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
}
variable "availability_zones" {
description = "Availability zones"
type = list(string)
}
variable "private_subnet_cidrs" {
description = "Private subnet CIDR blocks"
type = list(string)
}
variable "public_subnet_cidrs" {
description = "Public subnet CIDR blocks"
type = list(string)
}
variable "node_desired_size" {
description = "Desired number of EKS nodes"
type = number
default = 2
}
variable "node_min_size" {
description = "Minimum number of EKS nodes"
type = number
default = 1
}
variable "node_max_size" {
description = "Maximum number of EKS nodes"
type = number
default = 5
}
variable "node_instance_types" {
description = "EKS node instance types"
type = list(string)
default = ["t3.medium"]
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.micro"
}
variable "db_allocated_storage" {
description = "RDS allocated storage (GB)"
type = number
default = 20
}
variable "db_max_allocated_storage" {
description = "RDS maximum allocated storage (GB)"
type = number
default = 100
}
variable "db_name" {
description = "Database name"
type = string
}
variable "db_username" {
description = "Database username"
type = string
}
variable "redis_node_type" {
description = "ElastiCache Redis node type"
type = string
default = "cache.t3.micro"
}步骤2.3:Terraform执行流程
bash
undefinedInitialize Terraform
Initialize Terraform
terraform init
terraform init
Select workspace (environment)
Select workspace (environment)
terraform workspace select production || terraform workspace new production
terraform workspace select production || terraform workspace new production
Plan changes
Plan changes
terraform plan -var-file=environments/production.tfvars
terraform plan -var-file=environments/production.tfvars
Apply changes (requires approval)
Apply changes (requires approval)
terraform apply -var-file=environments/production.tfvars
terraform apply -var-file=environments/production.tfvars
View outputs
View outputs
terraform output
---terraform output
---Phase 3: Monitoring and Observability
阶段3:监控与可观测性
Implement comprehensive monitoring, logging, and alerting.
Step 3.1: Prometheus Configuration
Create :
monitoring/prometheus-config.yamlyaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'实施全面的监控、日志与告警系统。
步骤3.1:Prometheus配置
创建:
monitoring/prometheus-config.yamlyaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'Alerting configuration
Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load rules
Load rules
rule_files:
- /etc/prometheus/rules/*.yml
rule_files:
- /etc/prometheus/rules/*.yml
Scrape configurations
Scrape configurations
scrape_configs:
Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- action: labelmap regex: _meta_kubernetes_node_label(.+)
Application pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- action: labelmap regex: _meta_kubernetes_pod_label(.+)
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
**Step 3.2: Alert Rules**
Create `monitoring/alert-rules.yml`:
```yaml
groups:
- name: application_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"
# High response time
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up{job="application"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 2 minutes"
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "{{ $labels.instance }} CPU usage is {{ $value }}%"
# High memory usage
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "{{ $labels.instance }} memory usage is {{ $value }}%"
# Disk space running out
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space running low"
description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"scrape_configs:
Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- action: labelmap regex: _meta_kubernetes_node_label(.+)
Application pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- action: labelmap regex: _meta_kubernetes_pod_label(.+)
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
**步骤3.2:告警规则**
创建`monitoring/alert-rules.yml`:
```yaml
groups:
- name: application_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "{{ $labels.service }} has error rate {{ $value | humanizePercentage }}"
# High response time
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up{job="application"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 2 minutes"
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "{{ $labels.instance }} CPU usage is {{ $value }}%"
# High memory usage
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "{{ $labels.instance }} memory usage is {{ $value }}%"
# Disk space running out
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space running low"
description: "{{ $labels.instance }} disk {{ $labels.mountpoint }} has {{ $value }}% free"Collaboration Patterns
协作模式
With Architect (or architect-role-skill)
与架构师(或architect-role-skill)协作
- Review ARCHITECTURE.md for infrastructure requirements
- Validate deployment strategy aligns with design
- Confirm scalability and availability targets
- 审阅ARCHITECTURE.md以确认基础设施需求
- 验证部署策略是否符合设计要求
- 确认可扩展性与可用性目标
With Builder (or builder-role-skill)
与开发人员(或builder-role-skill)协作
- Coordinate on build process and artifacts
- Ensure application exposes health check endpoints
- Verify environment variable requirements
- 协调构建流程与产物管理
- 确保应用暴露健康检查端点
- 验证环境变量需求
With Validator (or validator-role-skill)
与测试人员(或validator-role-skill)协作
- Integrate test execution into CI/CD pipeline
- Configure automated security scanning
- Implement quality gates
- 将测试执行集成到CI/CD流水线
- 配置自动化安全扫描
- 实施质量门禁
With Scribe (or scribe-role-skill)
与文档工程师(或scribe-role-skill)协作
- Collaborate on deployment documentation
- Document infrastructure architecture
- Create runbooks for operations
- 协作编写部署文档
- 记录基础设施架构
- 创建运维手册
Examples
示例
Example 1: GitHub Actions CI/CD Pipeline
示例1:GitHub Actions CI/CD流水线
Task: Create complete CI/CD pipeline for Node.js application
markdown
undefined任务:为Node.js应用创建完整的CI/CD流水线
markdown
undefinedDeliverables
交付物
- GitHub Actions workflow with lint, build, test, security scan
- Docker multi-stage build configuration
- Kubernetes deployment manifests
- Blue-green deployment strategy for production
Result: Automated pipeline with 8-minute build-to-production time
undefined- 包含代码检查、构建、测试、安全扫描的GitHub Actions工作流
- Docker多阶段构建配置
- Kubernetes部署清单
- 生产环境蓝绿部署策略
结果:自动化流水线,从构建到生产部署仅需8分钟
undefinedExample 2: AWS Infrastructure with Terraform
示例2:基于Terraform的AWS基础设施
Task: Provision production infrastructure on AWS
markdown
undefined任务:在AWS上部署生产级基础设施
markdown
undefinedInfrastructure Created
创建的基础设施
- VPC with public/private subnets across 3 AZs
- EKS cluster with managed node groups (auto-scaling 2-5 nodes)
- RDS PostgreSQL with multi-AZ failover
- ElastiCache Redis for caching
- All with proper security groups and IAM roles
Result: Fully automated infrastructure provisioning in 15 minutes
undefined- 跨3个可用区的VPC,包含公有/私有子网
- 带托管节点组的EKS集群(自动扩缩容2-5节点)
- 支持多可用区故障转移的RDS PostgreSQL
- 用于缓存的ElastiCache Redis
- 所有资源均配置了正确的安全组与IAM角色
结果:完全自动化的基础设施部署,耗时15分钟
undefinedExample 3: Monitoring Stack Setup
示例3:监控栈搭建
Task: Implement comprehensive monitoring and alerting
markdown
undefined任务:实施全面的监控与告警系统
markdown
undefinedMonitoring Implemented
已实现的监控功能
- Prometheus for metrics collection
- Grafana dashboards for visualization
- Alert rules for critical metrics (errors, latency, resource usage)
- PagerDuty integration for on-call notifications
Result: Full observability with <2 minute alert response time
---- Prometheus指标采集
- Grafana可视化仪表盘
- 针对关键指标(错误率、延迟、资源使用率)的告警规则
- 与PagerDuty集成的值班通知
结果:全链路可观测性,告警响应时间<2分钟
---Resources
资源
Templates
模板
- - GitHub Actions workflow template
resources/github-actions-template.yml - - AWS infrastructure template
resources/terraform-aws-template.tf - - Prometheus configuration template
resources/prometheus-config-template.yaml - - Grafana dashboard template
resources/grafana-dashboard-template.json
- - GitHub Actions工作流模板
resources/github-actions-template.yml - - AWS基础设施模板
resources/terraform-aws-template.tf - - Prometheus配置模板
resources/prometheus-config-template.yaml - - Grafana仪表盘模板
resources/grafana-dashboard-template.json
Scripts
脚本
- - Terraform initialization script
scripts/terraform-init.sh - - Monitoring stack deployment
scripts/deploy-monitoring.sh - - Database backup automation
scripts/backup-database.sh
- - Terraform初始化脚本
scripts/terraform-init.sh - - 监控栈部署脚本
scripts/deploy-monitoring.sh - - 数据库备份自动化脚本
scripts/backup-database.sh
References
参考资料
- Agent Skills vs. Multi-Agent
- Terraform Best Practices
- Kubernetes Production Best Practices
- Prometheus Monitoring Guide
Version: 1.0.0
Last Updated: December 12, 2025
Status: ✅ Active
Maintained By: Claude Command and Control Project
- Agent Skills vs. Multi-Agent
- Terraform最佳实践
- Kubernetes生产环境最佳实践
- Prometheus监控指南
版本:1.0.0
最后更新:2025年12月12日
状态:✅ 活跃
维护方:Claude Command and Control Project