shipping-and-launch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Shipping and Launch

发布与上线

Overview

概述

Ship with confidence. The goal is not just to deploy — it's to deploy safely, with monitoring in place, a rollback plan ready, and a clear understanding of what success looks like. Every launch should be reversible, observable, and incremental.
自信地发布。我们的目标不只是完成部署——而是安全部署,确保监控到位、回滚计划就绪,并且清晰定义成功的标准。每一次上线都应该是可回退、可观测且渐进式的。

When to Use

适用场景

  • Deploying a feature to production for the first time
  • Releasing a significant change to users
  • Migrating data or infrastructure
  • Opening a beta or early access program
  • Any deployment that carries risk (all of them)
  • 首次向生产环境部署功能
  • 向用户发布重大变更
  • 数据或基础设施迁移
  • 开启Beta测试或早期访问计划
  • 任何带有风险的部署(所有部署都存在风险)

The Pre-Launch Checklist

发布前检查清单

Code Quality

代码质量

  • All tests pass (unit, integration, e2e)
  • Build succeeds with no warnings
  • Lint and type checking pass
  • Code reviewed and approved
  • No TODO comments that should be resolved before launch
  • No
    console.log
    debugging statements in production code
  • Error handling covers expected failure modes
  • 所有测试通过(单元测试、集成测试、端到端测试)
  • 构建成功且无警告
  • 代码规范检查和类型校验通过
  • 代码已评审并获批准
  • 所有应在发布前解决的TODO注释已处理
  • 生产代码中无
    console.log
    调试语句
  • 错误处理覆盖了预期的故障模式

Security

安全

  • No secrets in code or version control
  • npm audit
    shows no critical or high vulnerabilities
  • Input validation on all user-facing endpoints
  • Authentication and authorization checks in place
  • Security headers configured (CSP, HSTS, etc.)
  • Rate limiting on authentication endpoints
  • CORS configured to specific origins (not wildcard)
  • 代码或版本控制中无敏感信息
  • npm audit
    未检测到严重或高危漏洞
  • 所有面向用户的端点都已配置输入验证
  • 已落实身份认证和权限校验
  • 已配置安全头(CSP、HSTS等)
  • 认证端点已配置速率限制
  • CORS已配置为特定源(而非通配符)

Performance

性能

  • Core Web Vitals within "Good" thresholds
  • No N+1 queries in critical paths
  • Images optimized (compression, responsive sizes, lazy loading)
  • Bundle size within budget
  • Database queries have appropriate indexes
  • Caching configured for static assets and repeated queries
  • Core Web Vitals处于「良好」阈值范围内
  • 关键路径中无N+1查询问题
  • 图片已优化(压缩、响应式尺寸、懒加载)
  • 包体积在预算范围内
  • 数据库查询已配置合适的索引
  • 静态资源和重复查询已配置缓存

Accessibility

可访问性

  • Keyboard navigation works for all interactive elements
  • Screen reader can convey page content and structure
  • Color contrast meets WCAG 2.1 AA (4.5:1 for text)
  • Focus management correct for modals and dynamic content
  • Error messages are descriptive and associated with form fields
  • No accessibility warnings in axe-core or Lighthouse
  • 所有交互元素支持键盘导航
  • 屏幕阅读器可正确传达页面内容和结构
  • 颜色对比度符合WCAG 2.1 AA标准(文本对比度4.5:1)
  • 模态框和动态内容的焦点管理正确
  • 错误提示信息清晰且与表单字段关联
  • axe-core或Lighthouse检测无访问性警告

Infrastructure

基础设施

  • Environment variables set in production
  • Database migrations applied (or ready to apply)
  • DNS and SSL configured
  • CDN configured for static assets
  • Logging and error reporting configured
  • Health check endpoint exists and responds
  • 生产环境变量已设置
  • 数据库迁移已应用(或准备就绪)
  • DNS和SSL已配置
  • 静态资源CDN已配置
  • 日志和错误上报已配置
  • 健康检查端点存在且可正常响应

Documentation

文档

  • README updated with any new setup requirements
  • API documentation current
  • ADRs written for any architectural decisions
  • Changelog updated
  • User-facing documentation updated (if applicable)
  • README已更新,包含所有新的配置要求
  • API文档为最新版本
  • 所有架构决策已编写ADR文档
  • 更新日志已更新
  • 面向用户的文档已更新(如适用)

Feature Flag Strategy

Feature Flag策略

Ship behind feature flags to decouple deployment from release:
typescript
// Feature flag check
const flags = await getFeatureFlags(userId);

if (flags.taskSharing) {
  // New feature: task sharing
  return <TaskSharingPanel task={task} />;
}

// Default: existing behavior
return null;
Feature flag lifecycle:
1. DEPLOY with flag OFF     → Code is in production but inactive
2. ENABLE for team/beta     → Internal testing in production environment
3. GRADUAL ROLLOUT          → 5% → 25% → 50% → 100% of users
4. MONITOR at each stage    → Watch error rates, performance, user feedback
5. CLEAN UP                 → Remove flag and dead code path after full rollout
Rules:
  • Every feature flag has an owner and an expiration date
  • Clean up flags within 2 weeks of full rollout
  • Don't nest feature flags (creates exponential combinations)
  • Test both flag states (on and off) in CI
通过Feature Flag发布,实现部署与发布解耦:
typescript
// Feature flag check
const flags = await getFeatureFlags(userId);

if (flags.taskSharing) {
  // New feature: task sharing
  return <TaskSharingPanel task={task} />;
}

// Default: existing behavior
return null;
Feature Flag生命周期:
1. DEPLOY with flag OFF     → Code is in production but inactive
2. ENABLE for team/beta     → Internal testing in production environment
3. GRADUAL ROLLOUT          → 5% → 25% → 50% → 100% of users
4. MONITOR at each stage    → Watch error rates, performance, user feedback
5. CLEAN UP                 → Remove flag and dead code path after full rollout
规则:
  • 每个Feature Flag都有负责人和到期日期
  • 全量发布后2周内清理Flag
  • 不要嵌套Feature Flag(会产生指数级组合)
  • 在CI中测试Flag的两种状态(开启和关闭)

Staged Rollout

分阶段发布

The Rollout Sequence

发布流程

1. DEPLOY to staging
   └── Full test suite in staging environment
   └── Manual smoke test of critical flows

2. DEPLOY to production (feature flag OFF)
   └── Verify deployment succeeded (health check)
   └── Check error monitoring (no new errors)

3. ENABLE for team (flag ON for internal users)
   └── Team uses the feature in production
   └── 24-hour monitoring window

4. CANARY rollout (flag ON for 5% of users)
   └── Monitor error rates, latency, user behavior
   └── Compare metrics: canary vs. baseline
   └── 24-48 hour monitoring window

5. GRADUAL increase (25% → 50% → 100%)
   └── Same monitoring at each step
   └── Ability to roll back to previous percentage at any point

6. FULL rollout (flag ON for all users)
   └── Monitor for 1 week
   └── Clean up feature flag
1. DEPLOY to staging
   └── Full test suite in staging environment
   └── Manual smoke test of critical flows

2. DEPLOY to production (feature flag OFF)
   └── Verify deployment succeeded (health check)
   └── Check error monitoring (no new errors)

3. ENABLE for team (flag ON for internal users)
   └── Team uses the feature in production
   └── 24-hour monitoring window

4. CANARY rollout (flag ON for 5% of users)
   └── Monitor error rates, latency, user behavior
   └── Compare metrics: canary vs. baseline
   └── 24-48 hour monitoring window

5. GRADUAL increase (25% → 50% → 100%)
   └── Same monitoring at each step
   └── Ability to roll back to previous percentage at any point

6. FULL rollout (flag ON for all users)
   └── Monitor for 1 week
   └── Clean up feature flag

When to Roll Back

回滚触发条件

Roll back immediately if:
  • Error rate increases by more than 2x baseline
  • P95 latency increases by more than 50%
  • User-reported issues spike
  • Data integrity issues detected
  • Security vulnerability discovered
出现以下情况需立即回滚:
  • 错误率超过基线的2倍
  • P95延迟增加超过50%
  • 用户反馈问题激增
  • 检测到数据完整性问题
  • 发现安全漏洞

Monitoring and Observability

监控与可观测性

What to Monitor

监控内容

Application metrics:
├── Error rate (total and by endpoint)
├── Response time (p50, p95, p99)
├── Request volume
├── Active users
└── Key business metrics (conversion, engagement)

Infrastructure metrics:
├── CPU and memory utilization
├── Database connection pool usage
├── Disk space
├── Network latency
└── Queue depth (if applicable)

Client metrics:
├── Core Web Vitals (LCP, INP, CLS)
├── JavaScript errors
├── API error rates from client perspective
└── Page load time
Application metrics:
├── Error rate (total and by endpoint)
├── Response time (p50, p95, p99)
├── Request volume
├── Active users
└── Key business metrics (conversion, engagement)

Infrastructure metrics:
├── CPU and memory utilization
├── Database connection pool usage
├── Disk space
├── Network latency
└── Queue depth (if applicable)

Client metrics:
├── Core Web Vitals (LCP, INP, CLS)
├── JavaScript errors
├── API error rates from client perspective
└── Page load time

Error Reporting

错误上报

typescript
// Set up error boundary with reporting
class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, info: React.ErrorInfo) {
    // Report to error tracking service
    reportError(error, {
      componentStack: info.componentStack,
      userId: getCurrentUser()?.id,
      page: window.location.pathname,
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
    }
    return this.props.children;
  }
}

// Server-side error reporting
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  reportError(err, {
    method: req.method,
    url: req.url,
    userId: req.user?.id,
  });

  // Don't expose internals to users
  res.status(500).json({
    error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
  });
});
typescript
// Set up error boundary with reporting
class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, info: React.ErrorInfo) {
    // Report to error tracking service
    reportError(error, {
      componentStack: info.componentStack,
      userId: getCurrentUser()?.id,
      page: window.location.pathname,
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
    }
    return this.props.children;
  }
}

// Server-side error reporting
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  reportError(err, {
    method: req.method,
    url: req.url,
    userId: req.user?.id,
  });

  // Don't expose internals to users
  res.status(500).json({
    error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
  });
});

Post-Launch Verification

上线后验证

In the first hour after launch:
1. Check health endpoint returns 200
2. Check error monitoring dashboard (no new error types)
3. Check latency dashboard (no regression)
4. Test the critical user flow manually
5. Verify logs are flowing and readable
6. Confirm rollback mechanism works (dry run if possible)
上线后第一小时内:
1. Check health endpoint returns 200
2. Check error monitoring dashboard (no new error types)
3. Check latency dashboard (no regression)
4. Test the critical user flow manually
5. Verify logs are flowing and readable
6. Confirm rollback mechanism works (dry run if possible)

Rollback Strategy

回滚策略

Every deployment needs a rollback plan before it happens:
markdown
undefined
每一次部署都需要提前制定回滚计划:
markdown
undefined

Rollback Plan for [Feature/Release]

Rollback Plan for [Feature/Release]

Trigger Conditions

Trigger Conditions

  • Error rate > 2x baseline
  • P95 latency > [X]ms
  • User reports of [specific issue]
  • Error rate > 2x baseline
  • P95 latency > [X]ms
  • User reports of [specific issue]

Rollback Steps

Rollback Steps

  1. Disable feature flag (if applicable) OR
  2. Deploy previous version:
    git revert <commit> && git push
  3. Verify rollback: health check, error monitoring
  4. Communicate: notify team of rollback
  1. Disable feature flag (if applicable) OR
  2. Deploy previous version:
    git revert <commit> && git push
  3. Verify rollback: health check, error monitoring
  4. Communicate: notify team of rollback

Database Considerations

Database Considerations

  • Migration [X] has a rollback:
    npx prisma migrate rollback
  • Data inserted by new feature: [preserved / cleaned up]
  • Migration [X] has a rollback:
    npx prisma migrate rollback
  • Data inserted by new feature: [preserved / cleaned up]

Time to Rollback

Time to Rollback

  • Feature flag: < 1 minute
  • Redeploy previous version: < 5 minutes
  • Database rollback: < 15 minutes
undefined
  • Feature flag: < 1 minute
  • Redeploy previous version: < 5 minutes
  • Database rollback: < 15 minutes
undefined

Common Rationalizations

常见自我合理化借口

RationalizationReality
"It works in staging, it'll work in production"Production has different data, traffic patterns, and edge cases. Monitor after deploy.
"We don't need feature flags for this"Every feature benefits from a kill switch. Even "simple" changes can break things.
"Monitoring is overhead"Not having monitoring means you discover problems from user complaints instead of dashboards.
"We'll add monitoring later"Add it before launch. You can't debug what you can't see.
"Rolling back is admitting failure"Rolling back is responsible engineering. Shipping a broken feature is the failure.
自我合理化借口实际情况
“在 staging 环境能正常运行,生产环境肯定也没问题”生产环境的数据、流量模式和边缘情况都与 staging 不同,部署后必须监控。
“这个功能不需要Feature Flag”每个功能都需要一个紧急关闭开关,即使是「简单」的变更也可能出问题。
“监控是额外负担”没有监控意味着你只能通过用户反馈发现问题,而不是通过仪表盘提前预警。
“我们之后再添加监控”上线前就应该配置好监控,你无法调试看不到的问题。
“回滚等于承认失败”回滚是负责任的工程实践,发布有问题的功能才是失败。

Red Flags

危险信号

  • Deploying without a rollback plan
  • No monitoring or error reporting in production
  • Big-bang releases (everything at once, no staging)
  • Feature flags with no expiration or owner
  • No one monitoring the deploy for the first hour
  • Production environment configuration done by memory, not code
  • "It's Friday afternoon, let's ship it"
  • 部署时没有回滚计划
  • 生产环境无监控或错误上报
  • 一次性全量发布(无 staging 环境,所有变更同时上线)
  • Feature Flag没有到期日期或负责人
  • 上线后第一小时无人监控
  • 生产环境配置依赖记忆而非代码
  • “现在是周五下午,我们发布吧”

Verification

验证

Before deploying:
  • Pre-launch checklist completed (all sections green)
  • Feature flag configured (if applicable)
  • Rollback plan documented
  • Monitoring dashboards set up
  • Team notified of deployment
After deploying:
  • Health check returns 200
  • Error rate is normal
  • Latency is normal
  • Critical user flow works
  • Logs are flowing
  • Rollback tested or verified ready
部署前:
  • 发布前检查清单已完成(所有部分均达标)
  • Feature Flag已配置(如适用)
  • 回滚计划已文档化
  • 监控仪表盘已设置
  • 团队已收到部署通知
部署后:
  • 健康检查返回200
  • 错误率正常
  • 延迟正常
  • 关键用户流程可正常运行
  • 日志正常输出
  • 回滚机制已测试或确认就绪