Shipping and Launch

发布与上线

Overview

概述

Ship with confidence. The goal is not just to deploy — it's to deploy safely, with monitoring in place, a rollback plan ready, and a clear understanding of what success looks like. Every launch should be reversible, observable, and incremental.

自信地发布。我们的目标不只是完成部署——而是安全部署，确保监控到位、回滚计划就绪，并且清晰定义成功的标准。每一次上线都应该是可回退、可观测且渐进式的。

When to Use

适用场景

Deploying a feature to production for the first time
Releasing a significant change to users
Migrating data or infrastructure
Opening a beta or early access program
Any deployment that carries risk (all of them)

首次向生产环境部署功能
向用户发布重大变更
数据或基础设施迁移
开启Beta测试或早期访问计划
任何带有风险的部署（所有部署都存在风险）

The Pre-Launch Checklist

发布前检查清单

Code Quality

代码质量

All tests pass (unit, integration, e2e)
Build succeeds with no warnings
Lint and type checking pass
Code reviewed and approved
No TODO comments that should be resolved before launch
No
```
console.log
```
debugging statements in production code
Error handling covers expected failure modes

所有测试通过（单元测试、集成测试、端到端测试）
构建成功且无警告
代码规范检查和类型校验通过
代码已评审并获批准
所有应在发布前解决的TODO注释已处理
生产代码中无
```
console.log
```
调试语句
错误处理覆盖了预期的故障模式

Security

安全

No secrets in code or version control
```
npm audit
```
shows no critical or high vulnerabilities
Input validation on all user-facing endpoints
Authentication and authorization checks in place
Security headers configured (CSP, HSTS, etc.)
Rate limiting on authentication endpoints
CORS configured to specific origins (not wildcard)

代码或版本控制中无敏感信息
```
npm audit
```
未检测到严重或高危漏洞
所有面向用户的端点都已配置输入验证
已落实身份认证和权限校验
已配置安全头（CSP、HSTS等）
认证端点已配置速率限制
CORS已配置为特定源（而非通配符）

Performance

性能

Core Web Vitals within "Good" thresholds
No N+1 queries in critical paths
Images optimized (compression, responsive sizes, lazy loading)
Bundle size within budget
Database queries have appropriate indexes
Caching configured for static assets and repeated queries

Core Web Vitals处于「良好」阈值范围内
关键路径中无N+1查询问题
图片已优化（压缩、响应式尺寸、懒加载）
包体积在预算范围内
数据库查询已配置合适的索引
静态资源和重复查询已配置缓存

Accessibility

可访问性

Keyboard navigation works for all interactive elements
Screen reader can convey page content and structure
Color contrast meets WCAG 2.1 AA (4.5:1 for text)
Focus management correct for modals and dynamic content
Error messages are descriptive and associated with form fields
No accessibility warnings in axe-core or Lighthouse

所有交互元素支持键盘导航
屏幕阅读器可正确传达页面内容和结构
颜色对比度符合WCAG 2.1 AA标准（文本对比度4.5:1）
模态框和动态内容的焦点管理正确
错误提示信息清晰且与表单字段关联
axe-core或Lighthouse检测无访问性警告

Infrastructure

基础设施

Environment variables set in production
Database migrations applied (or ready to apply)
DNS and SSL configured
CDN configured for static assets
Logging and error reporting configured
Health check endpoint exists and responds

生产环境变量已设置
数据库迁移已应用（或准备就绪）
DNS和SSL已配置
静态资源CDN已配置
日志和错误上报已配置
健康检查端点存在且可正常响应

Documentation

文档

README updated with any new setup requirements
API documentation current
ADRs written for any architectural decisions
Changelog updated
User-facing documentation updated (if applicable)

README已更新，包含所有新的配置要求
API文档为最新版本
所有架构决策已编写ADR文档
更新日志已更新
面向用户的文档已更新（如适用）

Feature Flag Strategy

Feature Flag策略

Ship behind feature flags to decouple deployment from release:

typescript

// Feature flag check
const flags = await getFeatureFlags(userId);

if (flags.taskSharing) {
  // New feature: task sharing
  return <TaskSharingPanel task={task} />;
}

// Default: existing behavior
return null;

Feature flag lifecycle:

1. DEPLOY with flag OFF     → Code is in production but inactive
2. ENABLE for team/beta     → Internal testing in production environment
3. GRADUAL ROLLOUT          → 5% → 25% → 50% → 100% of users
4. MONITOR at each stage    → Watch error rates, performance, user feedback
5. CLEAN UP                 → Remove flag and dead code path after full rollout

Rules:

Every feature flag has an owner and an expiration date
Clean up flags within 2 weeks of full rollout
Don't nest feature flags (creates exponential combinations)
Test both flag states (on and off) in CI

通过Feature Flag发布，实现部署与发布解耦：

typescript

// Feature flag check
const flags = await getFeatureFlags(userId);

if (flags.taskSharing) {
  // New feature: task sharing
  return <TaskSharingPanel task={task} />;
}

// Default: existing behavior
return null;

Feature Flag生命周期：

1. DEPLOY with flag OFF     → Code is in production but inactive
2. ENABLE for team/beta     → Internal testing in production environment
3. GRADUAL ROLLOUT          → 5% → 25% → 50% → 100% of users
4. MONITOR at each stage    → Watch error rates, performance, user feedback
5. CLEAN UP                 → Remove flag and dead code path after full rollout

规则：

每个Feature Flag都有负责人和到期日期
全量发布后2周内清理Flag
不要嵌套Feature Flag（会产生指数级组合）
在CI中测试Flag的两种状态（开启和关闭）

Staged Rollout

分阶段发布

The Rollout Sequence

发布流程

1. DEPLOY to staging
   └── Full test suite in staging environment
   └── Manual smoke test of critical flows

2. DEPLOY to production (feature flag OFF)
   └── Verify deployment succeeded (health check)
   └── Check error monitoring (no new errors)

3. ENABLE for team (flag ON for internal users)
   └── Team uses the feature in production
   └── 24-hour monitoring window

4. CANARY rollout (flag ON for 5% of users)
   └── Monitor error rates, latency, user behavior
   └── Compare metrics: canary vs. baseline
   └── 24-48 hour monitoring window

5. GRADUAL increase (25% → 50% → 100%)
   └── Same monitoring at each step
   └── Ability to roll back to previous percentage at any point

6. FULL rollout (flag ON for all users)
   └── Monitor for 1 week
   └── Clean up feature flag

1. DEPLOY to staging
   └── Full test suite in staging environment
   └── Manual smoke test of critical flows

2. DEPLOY to production (feature flag OFF)
   └── Verify deployment succeeded (health check)
   └── Check error monitoring (no new errors)

3. ENABLE for team (flag ON for internal users)
   └── Team uses the feature in production
   └── 24-hour monitoring window

4. CANARY rollout (flag ON for 5% of users)
   └── Monitor error rates, latency, user behavior
   └── Compare metrics: canary vs. baseline
   └── 24-48 hour monitoring window

5. GRADUAL increase (25% → 50% → 100%)
   └── Same monitoring at each step
   └── Ability to roll back to previous percentage at any point

6. FULL rollout (flag ON for all users)
   └── Monitor for 1 week
   └── Clean up feature flag

When to Roll Back

回滚触发条件

Roll back immediately if:

Error rate increases by more than 2x baseline
P95 latency increases by more than 50%
User-reported issues spike
Data integrity issues detected
Security vulnerability discovered

出现以下情况需立即回滚：

错误率超过基线的2倍
P95延迟增加超过50%
用户反馈问题激增
检测到数据完整性问题
发现安全漏洞

Monitoring and Observability

监控与可观测性

What to Monitor

监控内容

Application metrics:
├── Error rate (total and by endpoint)
├── Response time (p50, p95, p99)
├── Request volume
├── Active users
└── Key business metrics (conversion, engagement)

Infrastructure metrics:
├── CPU and memory utilization
├── Database connection pool usage
├── Disk space
├── Network latency
└── Queue depth (if applicable)

Client metrics:
├── Core Web Vitals (LCP, INP, CLS)
├── JavaScript errors
├── API error rates from client perspective
└── Page load time

Application metrics:
├── Error rate (total and by endpoint)
├── Response time (p50, p95, p99)
├── Request volume
├── Active users
└── Key business metrics (conversion, engagement)

Infrastructure metrics:
├── CPU and memory utilization
├── Database connection pool usage
├── Disk space
├── Network latency
└── Queue depth (if applicable)

Client metrics:
├── Core Web Vitals (LCP, INP, CLS)
├── JavaScript errors
├── API error rates from client perspective
└── Page load time

Error Reporting

错误上报

typescript

// Set up error boundary with reporting
class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, info: React.ErrorInfo) {
    // Report to error tracking service
    reportError(error, {
      componentStack: info.componentStack,
      userId: getCurrentUser()?.id,
      page: window.location.pathname,
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
    }
    return this.props.children;
  }
}

// Server-side error reporting
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  reportError(err, {
    method: req.method,
    url: req.url,
    userId: req.user?.id,
  });

  // Don't expose internals to users
  res.status(500).json({
    error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
  });
});

typescript

// Set up error boundary with reporting
class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, info: React.ErrorInfo) {
    // Report to error tracking service
    reportError(error, {
      componentStack: info.componentStack,
      userId: getCurrentUser()?.id,
      page: window.location.pathname,
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback onRetry={() => this.setState({ hasError: false })} />;
    }
    return this.props.children;
  }
}

// Server-side error reporting
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  reportError(err, {
    method: req.method,
    url: req.url,
    userId: req.user?.id,
  });

  // Don't expose internals to users
  res.status(500).json({
    error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' },
  });
});

Post-Launch Verification

上线后验证

In the first hour after launch:

1. Check health endpoint returns 200
2. Check error monitoring dashboard (no new error types)
3. Check latency dashboard (no regression)
4. Test the critical user flow manually
5. Verify logs are flowing and readable
6. Confirm rollback mechanism works (dry run if possible)

上线后第一小时内：

1. Check health endpoint returns 200
2. Check error monitoring dashboard (no new error types)
3. Check latency dashboard (no regression)
4. Test the critical user flow manually
5. Verify logs are flowing and readable
6. Confirm rollback mechanism works (dry run if possible)

Rollback Strategy

回滚策略

Every deployment needs a rollback plan before it happens:

markdown

undefined

每一次部署都需要提前制定回滚计划：

markdown

undefined

Rollback Plan for [Feature/Release]

Trigger Conditions

Error rate > 2x baseline
P95 latency > [X]ms
User reports of [specific issue]

Error rate > 2x baseline
P95 latency > [X]ms
User reports of [specific issue]

Rollback Steps

Disable feature flag (if applicable) OR
Deploy previous version:
```
git revert <commit> && git push
```
Verify rollback: health check, error monitoring
Communicate: notify team of rollback

Disable feature flag (if applicable) OR
Deploy previous version:
```
git revert <commit> && git push
```
Verify rollback: health check, error monitoring
Communicate: notify team of rollback

Database Considerations

Migration [X] has a rollback:
```
npx prisma migrate rollback
```
Data inserted by new feature: [preserved / cleaned up]

Migration [X] has a rollback:
```
npx prisma migrate rollback
```
Data inserted by new feature: [preserved / cleaned up]

Time to Rollback

Feature flag: < 1 minute
Redeploy previous version: < 5 minutes
Database rollback: < 15 minutes

undefined

Feature flag: < 1 minute
Redeploy previous version: < 5 minutes
Database rollback: < 15 minutes

undefined

Common Rationalizations

常见自我合理化借口

Rationalization	Reality
"It works in staging, it'll work in production"	Production has different data, traffic patterns, and edge cases. Monitor after deploy.
"We don't need feature flags for this"	Every feature benefits from a kill switch. Even "simple" changes can break things.
"Monitoring is overhead"	Not having monitoring means you discover problems from user complaints instead of dashboards.
"We'll add monitoring later"	Add it before launch. You can't debug what you can't see.
"Rolling back is admitting failure"	Rolling back is responsible engineering. Shipping a broken feature is the failure.

自我合理化借口	实际情况
“在 staging 环境能正常运行，生产环境肯定也没问题”	生产环境的数据、流量模式和边缘情况都与 staging 不同，部署后必须监控。
“这个功能不需要Feature Flag”	每个功能都需要一个紧急关闭开关，即使是「简单」的变更也可能出问题。
“监控是额外负担”	没有监控意味着你只能通过用户反馈发现问题，而不是通过仪表盘提前预警。
“我们之后再添加监控”	上线前就应该配置好监控，你无法调试看不到的问题。
“回滚等于承认失败”	回滚是负责任的工程实践，发布有问题的功能才是失败。

Red Flags

危险信号

Deploying without a rollback plan
No monitoring or error reporting in production
Big-bang releases (everything at once, no staging)
Feature flags with no expiration or owner
No one monitoring the deploy for the first hour
Production environment configuration done by memory, not code
"It's Friday afternoon, let's ship it"

部署时没有回滚计划
生产环境无监控或错误上报
一次性全量发布（无 staging 环境，所有变更同时上线）
Feature Flag没有到期日期或负责人
上线后第一小时无人监控
生产环境配置依赖记忆而非代码
“现在是周五下午，我们发布吧”

shipping-and-launch

Original

Translation

Shipping and Launch

发布与上线

Overview

概述

When to Use

适用场景

The Pre-Launch Checklist

发布前检查清单

Code Quality

代码质量

Security

安全

Performance

性能

Accessibility

可访问性

Infrastructure

基础设施

Documentation

文档

Feature Flag Strategy

Feature Flag策略

Staged Rollout

分阶段发布

The Rollout Sequence

发布流程

When to Roll Back

回滚触发条件

Monitoring and Observability

监控与可观测性

What to Monitor

监控内容

Error Reporting

错误上报

Post-Launch Verification

上线后验证

Rollback Strategy

回滚策略

Rollback Plan for [Feature/Release]

Rollback Plan for [Feature/Release]

Trigger Conditions

Trigger Conditions

Rollback Steps

Rollback Steps

Database Considerations

Database Considerations

Time to Rollback

Time to Rollback

Common Rationalizations

常见自我合理化借口

Red Flags

危险信号

Verification

验证