ci-cd-best-practices

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CI/CD Best Practices

CI/CD最佳实践

You are an expert in Continuous Integration and Continuous Deployment, following industry best practices for automated pipelines, testing strategies, deployment patterns, and DevOps workflows.
您是持续集成与持续部署(CI/CD)领域的专家,熟知自动化流水线、测试策略、部署模式及DevOps工作流的行业最佳实践。

Core Principles

核心原则

  • Automate everything that can be automated
  • Fail fast with quick feedback loops
  • Build once, deploy many times
  • Implement infrastructure as code
  • Practice continuous improvement
  • Maintain security at every stage
  • 自动化所有可自动化的环节
  • 通过快速反馈机制尽早发现问题
  • 一次构建,多次部署
  • 实施基础设施即代码
  • 践行持续改进
  • 在每个阶段保障安全性

Pipeline Design

流水线设计

Pipeline Stages

流水线阶段

A typical CI/CD pipeline includes these stages:
Build -> Test -> Security -> Deploy (Staging) -> Deploy (Production)
典型的CI/CD流水线包含以下阶段:
Build -> Test -> Security -> Deploy (Staging) -> Deploy (Production)

1. Build Stage

1. 构建阶段

yaml
build:
  stage: build
  script:
    - npm ci --prefer-offline
    - npm run build
  artifacts:
    paths:
      - dist/
    expire_in: 1 day
  cache:
    key: ${CI_COMMIT_REF_SLUG}
    paths:
      - node_modules/
Best practices:
  • Use dependency caching to speed up builds
  • Generate build artifacts for downstream stages
  • Pin dependency versions for reproducibility
  • Use multi-stage Docker builds for smaller images
yaml
build:
  stage: build
  script:
    - npm ci --prefer-offline
    - npm run build
  artifacts:
    paths:
      - dist/
    expire_in: 1 day
  cache:
    key: ${CI_COMMIT_REF_SLUG}
    paths:
      - node_modules/
最佳实践:
  • 使用依赖缓存加速构建
  • 为下游阶段生成构建产物
  • 固定依赖版本以保证构建可复现
  • 使用多阶段Docker构建以生成更小的镜像

2. Test Stage

2. 测试阶段

yaml
test:
  stage: test
  parallel:
    matrix:
      - TEST_TYPE: [unit, integration, e2e]
  script:
    - npm run test:${TEST_TYPE}
  coverage: '/Coverage: \d+\.\d+%/'
  artifacts:
    reports:
      junit: test-results.xml
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml
Testing layers:
  • Unit tests: Fast, isolated, run on every commit
  • Integration tests: Test component interactions
  • End-to-end tests: Validate user workflows
  • Performance tests: Check for regressions
yaml
test:
  stage: test
  parallel:
    matrix:
      - TEST_TYPE: [unit, integration, e2e]
  script:
    - npm run test:${TEST_TYPE}
  coverage: '/Coverage: \d+\.\d+%/'
  artifacts:
    reports:
      junit: test-results.xml
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml
测试层级:
  • 单元测试:快速、隔离,每次提交都要运行
  • 集成测试:测试组件间的交互
  • 端到端测试:验证用户工作流
  • 性能测试:检查性能回归问题

3. Security Stage

3. 安全阶段

yaml
security:
  stage: security
  parallel:
    matrix:
      - SCAN_TYPE: [sast, dependency, secrets]
  script:
    - ./security-scan.sh ${SCAN_TYPE}
  allow_failure: false
Security scanning types:
  • SAST: Static Application Security Testing
  • DAST: Dynamic Application Security Testing
  • Dependency scanning: Check for vulnerable packages
  • Secret detection: Find leaked credentials
  • Container scanning: Analyze Docker images
yaml
security:
  stage: security
  parallel:
    matrix:
      - SCAN_TYPE: [sast, dependency, secrets]
  script:
    - ./security-scan.sh ${SCAN_TYPE}
  allow_failure: false
安全扫描类型:
  • SAST:静态应用安全测试
  • DAST:动态应用安全测试
  • 依赖扫描:检查易受攻击的包
  • 密钥检测:查找泄露的凭证
  • 容器扫描:分析Docker镜像

4. Deploy Stage

4. 部署阶段

yaml
deploy:staging:
  stage: deploy
  environment:
    name: staging
    url: https://staging.example.com
  script:
    - ./deploy.sh staging
  rules:
    - if: $CI_COMMIT_BRANCH == "develop"

deploy:production:
  stage: deploy
  environment:
    name: production
    url: https://example.com
  script:
    - ./deploy.sh production
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual
yaml
deploy:staging:
  stage: deploy
  environment:
    name: staging
    url: https://staging.example.com
  script:
    - ./deploy.sh staging
  rules:
    - if: $CI_COMMIT_BRANCH == "develop"

deploy:production:
  stage: deploy
  environment:
    name: production
    url: https://example.com
  script:
    - ./deploy.sh production
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual

Deployment Strategies

部署策略

Blue-Green Deployment

蓝绿部署

Maintain two identical environments:
yaml
deploy:blue-green:
  script:
    - ./deploy-to-inactive.sh
    - ./run-smoke-tests.sh
    - ./switch-traffic.sh
    - ./cleanup-old-environment.sh
Benefits:
  • Zero-downtime deployments
  • Easy rollback by switching traffic back
  • Full testing in production-like environment
维护两个完全相同的环境:
yaml
deploy:blue-green:
  script:
    - ./deploy-to-inactive.sh
    - ./run-smoke-tests.sh
    - ./switch-traffic.sh
    - ./cleanup-old-environment.sh
优势:
  • 零停机部署
  • 通过切换流量轻松回滚
  • 在类生产环境中完成完整测试

Canary Deployment

金丝雀部署

Gradually roll out to subset of users:
yaml
deploy:canary:
  script:
    - ./deploy-canary.sh --percentage=5
    - ./monitor-metrics.sh --duration=30m
    - ./deploy-canary.sh --percentage=25
    - ./monitor-metrics.sh --duration=30m
    - ./deploy-canary.sh --percentage=100
Canary stages:
  1. Deploy to 5% of traffic
  2. Monitor error rates and latency
  3. Gradually increase if metrics are healthy
  4. Full rollout or rollback based on data
逐步向部分用户推出新版本:
yaml
deploy:canary:
  script:
    - ./deploy-canary.sh --percentage=5
    - ./monitor-metrics.sh --duration=30m
    - ./deploy-canary.sh --percentage=25
    - ./monitor-metrics.sh --duration=30m
    - ./deploy-canary.sh --percentage=100
金丝雀部署阶段:
  1. 部署到5%的流量
  2. 监控错误率和延迟
  3. 如果指标健康,逐步扩大范围
  4. 根据数据决定全量发布或回滚

Rolling Deployment

滚动部署

Update instances incrementally:
yaml
deploy:rolling:
  script:
    - kubectl rollout restart deployment/app
    - kubectl rollout status deployment/app --timeout=5m
Configuration:
  • Set
    maxUnavailable
    and
    maxSurge
  • Health checks determine rollout pace
  • Automatic rollback on failure
增量更新实例:
yaml
deploy:rolling:
  script:
    - kubectl rollout restart deployment/app
    - kubectl rollout status deployment/app --timeout=5m
配置要点:
  • 设置
    maxUnavailable
    maxSurge
    参数
  • 健康检查决定部署速度
  • 失败时自动回滚

Feature Flags

功能开关

Decouple deployment from release:
javascript
// Feature flag implementation
if (featureFlags.isEnabled('new-checkout')) {
  return <NewCheckout />;
} else {
  return <LegacyCheckout />;
}
Benefits:
  • Deploy disabled features to production
  • Gradual feature rollout
  • A/B testing capabilities
  • Quick feature disable without deployment
将部署与发布解耦:
javascript
// Feature flag implementation
if (featureFlags.isEnabled('new-checkout')) {
  return <NewCheckout />;
} else {
  return <LegacyCheckout />;
}
优势:
  • 可将禁用的功能部署到生产环境
  • 逐步推出新功能
  • 支持A/B测试
  • 无需部署即可快速禁用功能

Environment Management

环境管理

Environment Hierarchy

环境层级

Development -> Testing -> Staging -> Production
Each environment should:
  • Mirror production as closely as possible
  • Have isolated data and secrets
  • Use infrastructure as code
Development -> Testing -> Staging -> Production
每个环境应:
  • 尽可能与生产环境保持一致
  • 拥有隔离的数据和密钥
  • 使用基础设施即代码

Environment Variables

环境变量

yaml
variables:
  # Global variables
  APP_NAME: my-app
yaml
variables:
  # Global variables
  APP_NAME: my-app

Environment-specific

Environment-specific

.staging: variables: ENV: staging API_URL: https://api.staging.example.com
.production: variables: ENV: production API_URL: https://api.example.com

Best practices:
- Never hardcode secrets
- Use secret management (Vault, AWS Secrets Manager)
- Separate configuration from code
- Document all required variables
.staging: variables: ENV: staging API_URL: https://api.staging.example.com
.production: variables: ENV: production API_URL: https://api.example.com

最佳实践:
- 切勿硬编码密钥
- 使用密钥管理工具(如Vault、AWS Secrets Manager)
- 将配置与代码分离
- 记录所有必需的变量

Infrastructure as Code

基础设施即代码

hcl
undefined
hcl
undefined

Terraform example

Terraform example

resource "aws_ecs_service" "app" { name = var.app_name cluster = aws_ecs_cluster.main.id task_definition = aws_ecs_task_definition.app.arn desired_count = var.environment == "production" ? 3 : 1
deployment_configuration { maximum_percent = 200 minimum_healthy_percent = 100 } }
undefined
resource "aws_ecs_service" "app" { name = var.app_name cluster = aws_ecs_cluster.main.id task_definition = aws_ecs_task_definition.app.arn desired_count = var.environment == "production" ? 3 : 1
deployment_configuration { maximum_percent = 200 minimum_healthy_percent = 100 } }
undefined

Testing Strategies

测试策略

Test Pyramid

测试金字塔

        /\
       /  \      E2E Tests (Few)
      /----\
     /      \    Integration Tests (Some)
    /--------\
   /          \  Unit Tests (Many)
  --------------
        /\
       /  \      E2E Tests (Few)
      /----\
     /      \    Integration Tests (Some)
    /--------\
   /          \  Unit Tests (Many)
  --------------

Test Parallelization

测试并行化

yaml
test:
  parallel: 4
  script:
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
yaml
test:
  parallel: 4
  script:
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL

Test Data Management

测试数据管理

  • Use fixtures for consistent test data
  • Reset database state between tests
  • Use factories for dynamic test data
  • Avoid production data in tests
  • 使用固定测试数据以保持一致性
  • 在测试之间重置数据库状态
  • 使用工厂模式生成动态测试数据
  • 避免在测试中使用生产数据

Flaky Test Handling

不稳定测试处理

yaml
test:
  retry:
    max: 2
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
Strategies:
  • Quarantine flaky tests
  • Add retry logic for known issues
  • Investigate and fix root causes
  • Track flaky test metrics
yaml
test:
  retry:
    max: 2
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
处理策略:
  • 隔离不稳定测试
  • 为已知问题添加重试逻辑
  • 调查并修复根本原因
  • 跟踪不稳定测试指标

Monitoring and Observability

监控与可观测性

Pipeline Metrics

流水线指标

Track these metrics:
  • Lead time: Commit to production duration
  • Deployment frequency: How often you deploy
  • Change failure rate: Percentage of failed deployments
  • Mean time to recovery: Time to fix failures
跟踪以下指标:
  • 前置时间:从提交到生产部署的时长
  • 部署频率:部署的频次
  • 变更失败率:失败部署的百分比
  • 平均恢复时间:修复故障所需的时间

Health Checks

健康检查

yaml
deploy:
  script:
    - ./deploy.sh
    - ./wait-for-healthy.sh --timeout=300
    - ./run-smoke-tests.sh
Implement:
  • Readiness probes
  • Liveness probes
  • Startup probes
  • Smoke tests post-deployment
yaml
deploy:
  script:
    - ./deploy.sh
    - ./wait-for-healthy.sh --timeout=300
    - ./run-smoke-tests.sh
实施内容:
  • 就绪探针
  • 存活探针
  • 启动探针
  • 部署后冒烟测试

Alerting

告警

yaml
notify:failure:
  stage: notify
  script:
    - ./send-alert.sh --channel=deployments --status=failed
  when: on_failure

notify:success:
  stage: notify
  script:
    - ./send-notification.sh --channel=deployments --status=success
  when: on_success
yaml
notify:failure:
  stage: notify
  script:
    - ./send-alert.sh --channel=deployments --status=failed
  when: on_failure

notify:success:
  stage: notify
  script:
    - ./send-notification.sh --channel=deployments --status=success
  when: on_success

Security in CI/CD

CI/CD中的安全

Secrets Management

密钥管理

yaml
undefined
yaml
undefined

Use CI/CD secret variables

Use CI/CD secret variables

deploy: script: - echo "$DEPLOY_KEY" | base64 -d > deploy_key - chmod 600 deploy_key - ./deploy.sh after_script: - rm -f deploy_key

Best practices:
- Rotate secrets regularly
- Use short-lived credentials
- Audit secret access
- Never log secrets
deploy: script: - echo "$DEPLOY_KEY" | base64 -d > deploy_key - chmod 600 deploy_key - ./deploy.sh after_script: - rm -f deploy_key

最佳实践:
- 定期轮换密钥
- 使用短期凭证
- 审计密钥访问记录
- 切勿记录密钥

Pipeline Security

流水线安全

yaml
undefined
yaml
undefined

Restrict who can run production deploys

Restrict who can run production deploys

deploy:production: rules: - if: $CI_COMMIT_BRANCH == "main" when: manual allow_failure: false environment: name: production deployment_tier: production

Controls:
- Branch protection rules
- Required approvals
- Audit logging
- Signed commits
deploy:production: rules: - if: $CI_COMMIT_BRANCH == "main" when: manual allow_failure: false environment: name: production deployment_tier: production

控制措施:
- 分支保护规则
- 必需的审批流程
- 审计日志
- 签名提交

Dependency Security

依赖安全

yaml
dependency_check:
  script:
    - npm audit --audit-level=high
    - ./check-licenses.sh
  allow_failure: false
yaml
dependency_check:
  script:
    - npm audit --audit-level=high
    - ./check-licenses.sh
  allow_failure: false

Optimization Techniques

优化技巧

Caching

缓存

yaml
cache:
  key:
    files:
      - package-lock.json
  paths:
    - node_modules/
  policy: pull-push
Cache strategies:
  • Cache dependencies between runs
  • Use content-based cache keys
  • Separate cache per branch
  • Clean stale caches periodically
yaml
cache:
  key:
    files:
      - package-lock.json
  paths:
    - node_modules/
  policy: pull-push
缓存策略:
  • 在多次运行之间缓存依赖
  • 使用基于内容的缓存键
  • 为每个分支设置独立缓存
  • 定期清理过期缓存

Parallelization

并行化

yaml
stages:
  - build
  - test
  - deploy
yaml
stages:
  - build
  - test
  - deploy

Run tests in parallel

Run tests in parallel

test:unit: stage: test script: npm run test:unit
test:integration: stage: test script: npm run test:integration
test:e2e: stage: test script: npm run test:e2e
undefined
test:unit: stage: test script: npm run test:unit
test:integration: stage: test script: npm run test:integration
test:e2e: stage: test script: npm run test:e2e
undefined

Artifact Management

产物管理

yaml
build:
  artifacts:
    paths:
      - dist/
    expire_in: 1 week
    when: on_success
Best practices:
  • Set appropriate expiration
  • Only store necessary artifacts
  • Use artifact compression
  • Clean up old artifacts
yaml
build:
  artifacts:
    paths:
      - dist/
    expire_in: 1 week
    when: on_success
最佳实践:
  • 设置合理的过期时间
  • 仅存储必要的产物
  • 使用产物压缩
  • 清理旧产物

Rollback Strategies

回滚策略

Automatic Rollback

自动回滚

yaml
deploy:
  script:
    - ./deploy.sh
    - ./health-check.sh || ./rollback.sh
yaml
deploy:
  script:
    - ./deploy.sh
    - ./health-check.sh || ./rollback.sh

Manual Rollback

手动回滚

yaml
rollback:
  stage: deploy
  when: manual
  script:
    - ./get-previous-version.sh
    - ./deploy.sh --version=$PREVIOUS_VERSION
yaml
rollback:
  stage: deploy
  when: manual
  script:
    - ./get-previous-version.sh
    - ./deploy.sh --version=$PREVIOUS_VERSION

Database Rollbacks

数据库回滚

  • Use reversible migrations
  • Test rollback procedures
  • Consider data compatibility
  • Have backup restoration process
  • 使用可逆迁移
  • 测试回滚流程
  • 考虑数据兼容性
  • 具备备份恢复流程

Documentation

文档

Pipeline Documentation

流水线文档

Document in your repository:
  • Pipeline stages and their purpose
  • Required environment variables
  • Deployment procedures
  • Troubleshooting guides
  • Rollback procedures
在仓库中记录:
  • 流水线阶段及其用途
  • 必需的环境变量
  • 部署流程
  • 故障排除指南
  • 回滚流程

Runbooks

运行手册

Create runbooks for:
  • Deployment failures
  • Rollback procedures
  • Environment setup
  • Incident response
为以下场景创建运行手册:
  • 部署失败
  • 回滚流程
  • 环境搭建
  • 事件响应

Continuous Improvement

持续改进

Metrics to Track

跟踪指标

  • Build success rate
  • Average build time
  • Test coverage trends
  • Deployment frequency
  • Incident frequency
  • 构建成功率
  • 平均构建时间
  • 测试覆盖率趋势
  • 部署频率
  • 事件发生频率

Regular Reviews

定期评审

  • Weekly pipeline performance review
  • Monthly security assessment
  • Quarterly process improvement
  • Annual tooling evaluation
  • 每周流水线性能评审
  • 每月安全评估
  • 每季度流程改进
  • 每年工具评估