exp-driven-dev

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experimentation-Driven Development

实验驱动开发

When This Skill Activates

此技能的适用场景

Claude uses this skill when:
  • Building new features that affect core metrics
  • Implementing A/B testing infrastructure
  • Making data-driven decisions
  • Setting up feature flags for gradual rollouts
  • Choosing which metrics to track
当出现以下情况时,Claude会使用此技能:
  • 开发影响核心指标的新功能
  • 搭建A/B测试基础设施
  • 做出数据驱动的决策
  • 为逐步上线设置Feature Flag
  • 选择需要追踪的指标

Core Frameworks

核心框架

1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix)

1. 实验设计(来源:Ronny Kohavi,微软/Netflix)

The HITS Framework:
H - Hypothesis:
"We believe that [change] will cause [metric] to [increase/decrease] because [reason]"
I - Implementation:
  • Feature flag setup
  • Treatment vs control
  • Sample size calculation
T - Test:
  • Run for statistical significance
  • Monitor guardrail metrics
  • Watch for unexpected effects
S - Ship or Stop:
  • Ship if positive
  • Stop if negative
  • Iterate if inconclusive
Example:
markdown
Hypothesis:
"We believe that adding social proof ('X people bought this') 
will increase conversion rate by 10% 
because it reduces purchase anxiety."

Implementation:
- Control: No social proof
- Treatment: Show "X people bought"
- Sample size: 10,000 users per variant
- Duration: 2 weeks

Test:
- Primary metric: Conversion rate
- Guardrails: Cart abandonment, return rate

Ship or Stop:
- If conversion +5% or more → Ship
- If conversion -2% or less → Stop
- If inconclusive → Iterate and retest

HITS框架:
H - 假设(Hypothesis):
"我们认为,[变更]将导致[指标][上升/下降],原因是[理由]"
I - 实施(Implementation):
  • Feature Flag配置
  • 实验组 vs 对照组
  • 样本量计算
T - 测试(Test):
  • 运行实验直至达到统计显著性
  • 监控护栏指标
  • 留意意外影响
S - 上线或终止(Ship or Stop):
  • 若结果为正,则上线
  • 若结果为负,则终止
  • 若结果不确定,则迭代优化
示例:
markdown
假设:
"我们认为,添加社交证明('已有X人购买此商品')
将使转化率提升10%
because it reduces purchase anxiety."

实施:
- 对照组:无社交证明
- 实验组:显示"已有X人购买"
- 样本量:每个变体10,000名用户
- 时长:2周

测试:
- 核心指标:转化率
- 护栏指标:购物车弃购率、退货率

上线或终止:
- 若转化率提升5%及以上 → 上线
- 若转化率下降2%及以下 → 终止
- 若结果不确定 → 迭代后重新测试

2. Metric Selection

2. 指标选择

Primary Metric:
  • ONE metric you're trying to move
  • Directly tied to business value
  • Clear success threshold
Guardrail Metrics:
  • Metrics that shouldn't degrade
  • Prevent gaming the system
  • Ensure quality maintained
Example:
Feature: Streamlined checkout

Primary Metric:
✅ Purchase completion rate (+10%)

Guardrail Metrics:
⚠️ Cart abandonment (don't increase)
⚠️ Return rate (don't increase)
⚠️ Support tickets (don't increase)
⚠️ Load time (stay <2s)

核心指标:
  • 仅选择一个你试图影响的指标
  • 与业务价值直接挂钩
  • 明确成功阈值
护栏指标:
  • 不应出现恶化的指标
  • 防止指标造假
  • 确保质量得以维持
示例:
功能:简化结账流程

核心指标:
✅ 购买完成率(提升10%)

护栏指标:
⚠️ 购物车弃购率(不得上升)
⚠️ 退货率(不得上升)
⚠️ 支持工单量(不得上升)
⚠️ 加载时间(保持在2秒以内)

3. Statistical Significance

3. 统计显著性

The Math:
Minimum sample size = (Effect size, Confidence, Power)

Typical settings:
- Confidence: 95% (p < 0.05)
- Power: 80% (detect 80% of real effects)
- Effect size: Minimum detectable change

Example:
- Baseline conversion: 10%
- Minimum detectable effect: +1% (to 11%)
- Required: ~15,000 users per variant
Common Mistakes:
  • ❌ Stopping test early (peeking bias)
  • ❌ Running too short (seasonal effects)
  • ❌ Too many variants (dilutes sample)
  • ❌ Changing test mid-flight

计算公式:
最小样本量 = (效应量, 置信度, 统计功效)

典型设置:
- 置信度:95%(p < 0.05)
- 统计功效:80%(能检测到80%的真实效应)
- 效应量:最小可检测变化

示例:
- 基准转化率:10%
- 最小可检测效应:+1%(提升至11%)
- 所需样本量:每个变体约15,000名用户
常见错误:
  • ❌ 提前终止测试(偷看偏差)
  • ❌ 测试时长过短(受季节性影响)
  • ❌ 变体过多(稀释样本量)
  • ❌ 测试中途变更方案

4. Feature Flag Architecture

4. Feature Flag架构

Implementation:
javascript
// Feature flag pattern
function checkoutFlow(user) {
  if (isFeatureEnabled(user, 'new-checkout')) {
    return newCheckoutExperience();
  } else {
    return oldCheckoutExperience();
  }
}

// Gradual rollout
function isFeatureEnabled(user, feature) {
  const rolloutPercent = getFeatureRollout(feature);
  const userBucket = hashUserId(user.id) % 100;
  return userBucket < rolloutPercent;
}

// Experiment assignment
function assignExperiment(user, experiment) {
  const variant = consistentHash(user.id, experiment);
  track('experiment_assigned', {
    userId: user.id,
    experiment: experiment,
    variant: variant
  });
  return variant;
}

实现代码:
javascript
// Feature flag pattern
function checkoutFlow(user) {
  if (isFeatureEnabled(user, 'new-checkout')) {
    return newCheckoutExperience();
  } else {
    return oldCheckoutExperience();
  }
}

// Gradual rollout
function isFeatureEnabled(user, feature) {
  const rolloutPercent = getFeatureRollout(feature);
  const userBucket = hashUserId(user.id) % 100;
  return userBucket < rolloutPercent;
}

// Experiment assignment
function assignExperiment(user, experiment) {
  const variant = consistentHash(user.id, experiment);
  track('experiment_assigned', {
    userId: user.id,
    experiment: experiment,
    variant: variant
  });
  return variant;
}

Decision Tree: Should We Experiment?

决策树:是否需要进行实验?

NEW FEATURE
├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED
│  NO ↓
├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED
│  NO ↓
├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL
│  NO ↓
├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT?
│  NO ↓
└─ SHIP WITHOUT TEST ←────────────────┘
   (But still feature flag for rollback)
新功能
├─ 是否影响核心指标?─────是──→ 必须进行实验
│  否 ↓
├─ 是否为高风险变更?─────是──→ 建议进行实验
│  否 ↓
├─ 影响是否不确定?───────是──→ 实验有帮助
│  否 ↓
├─ 是否易于进行A/B测试?──是──→ 为何不做实验?
│  否 ↓
└─ 无需测试直接上线 ←────────────────┘
   (但仍需通过Feature Flag支持回滚)

Action Templates

行动模板

Template 1: Experiment Spec

模板1:实验规格书

markdown
undefined
markdown
undefined

Experiment: [Name]

实验:[名称]

Hypothesis

假设

We believe: [change] Will cause: [metric] to [increase/decrease] Because: [reasoning]
我们认为: [变更内容] 将导致: [指标][上升/下降] 原因: [推理依据]

Variants

变体

Control (50%)

对照组(50%)

[Current experience]
[当前体验]

Treatment (50%)

实验组(50%)

[New experience]
[新体验]

Metrics

指标

Primary Metric

核心指标

  • What: [metric name]
  • Current: [baseline]
  • Target: [goal]
  • Success: [threshold]
  • 指标名称: [指标名]
  • 当前基准: [基准值]
  • 目标: [目标值]
  • 成功阈值: [阈值]

Guardrail Metrics

护栏指标

  • Metric 1: [name] - Don't decrease
  • Metric 2: [name] - Don't increase
  • Metric 3: [name] - Maintain
  • 指标1: [名称] - 不得下降
  • 指标2: [名称] - 不得上升
  • 指标3: [名称] - 维持现状

Sample Size

样本量

  • Users needed: [X per variant]
  • Duration: [Y days]
  • Confidence: 95%
  • Power: 80%
  • 所需用户数: [每个变体X名]
  • 时长: [Y天]
  • 置信度: 95%
  • 统计功效: 80%

Implementation

实现代码

javascript
if (experiment('feature-name') === 'treatment') {
  // New experience
} else {
  // Old experience
}
javascript
if (experiment('feature-name') === 'treatment') {
  // 新体验
} else {
  // 旧体验
}

Success Criteria

成功标准

  • Primary metric improved by [X]%
  • No guardrail degradation
  • Statistical significance reached
  • No unexpected negative effects
  • 核心指标提升[X]%
  • 护栏指标未出现恶化
  • 达到统计显著性
  • 无意外负面影响

Decision

决策

  • If positive: Ship to 100%
  • If negative: Rollback, iterate
  • If inconclusive: Extend or redesign
undefined
  • 若结果为正: 全量上线
  • 若结果为负: 回滚并迭代优化
  • 若结果不确定: 延长测试或重新设计实验
undefined

Template 2: Feature Flag Implementation

模板2:Feature Flag实现

typescript
// features.ts
export const FEATURES = {
  'new-checkout': {
    rollout: 10,  // 10% of users
    enabled: true,
    description: 'New streamlined checkout flow'
  },
  'ai-recommendations': {
    rollout: 0,  // Not live yet
    enabled: false,
    description: 'AI-powered product recommendations'
  }
};

// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
  const config = FEATURES[feature];
  if (!config || !config.enabled) return false;
  
  const bucket = consistentHash(userId) % 100;
  return bucket < config.rollout;
}

// usage in code
if (isEnabled(user.id, 'new-checkout')) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}
typescript
// features.ts
export const FEATURES = {
  'new-checkout': {
    rollout: 10,  // 10% of users
    enabled: true,
    description: 'New streamlined checkout flow'
  },
  'ai-recommendations': {
    rollout: 0,  // Not live yet
    enabled: false,
    description: 'AI-powered product recommendations'
  }
};

// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
  const config = FEATURES[feature];
  if (!config || !config.enabled) return false;
  
  const bucket = consistentHash(userId) % 100;
  return bucket < config.rollout;
}

// usage in code
if (isEnabled(user.id, 'new-checkout')) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}

Template 3: Experiment Dashboard

模板3:实验仪表盘

markdown
undefined
markdown
undefined

Experiment Dashboard

实验仪表盘

Active Experiments

进行中的实验

Experiment 1: [Name]

实验1:[名称]

  • Status: Running
  • Started: [date]
  • Progress: [X]% sample size reached
  • Primary metric: [current result]
  • Guardrails: ✅ All healthy
  • 状态: 进行中
  • 开始时间: [日期]
  • 进度: 已达到[X]%的样本量
  • 核心指标: [当前结果]
  • 护栏指标: ✅ 全部正常

Experiment 2: [Name]

实验2:[名称]

  • Status: Complete
  • Result: Treatment won (+15% conversion)
  • Decision: Ship to 100%
  • Shipped: [date]
  • 状态: 已完成
  • 结果: 实验组获胜(转化率提升15%)
  • 决策: 全量上线
  • 上线时间: [日期]

Key Metrics

关键指标

Experiment Velocity

实验效率

  • Experiments launched: [X per month]
  • Win rate: [Y]%
  • Average duration: [Z] days
  • 每月启动实验数: [X]个
  • 成功率: [Y]%
  • 平均时长: [Z]天

Impact

业务影响

  • Revenue impact: +$[X]
  • Conversion improvement: +[Y]%
  • User satisfaction: +[Z] NPS
  • 收入影响: +$[X]
  • 转化率提升: +[Y]%
  • 用户满意度: NPS提升[Z]分

Learnings

经验总结

  • [Key insight 1]
  • [Key insight 2]
  • [Key insight 3]
undefined
  • [关键洞察1]
  • [关键洞察2]
  • [关键洞察3]
undefined

Quick Reference

快速参考

🧪 Experiment Checklist

🧪 实验检查清单

Before Starting:
  • Hypothesis written (believe → cause → because)
  • Primary metric defined
  • Guardrails identified
  • Sample size calculated
  • Feature flag implemented
  • Tracking instrumented
During Experiment:
  • Don't peek early (wait for significance)
  • Monitor guardrails daily
  • Watch for unexpected effects
  • Log any external factors (holidays, outages)
After Experiment:
  • Statistical significance reached
  • Guardrails not degraded
  • Decision made (ship/stop/iterate)
  • Learning documented

实验前:
  • 已撰写假设(我们认为→将导致→原因)
  • 已定义核心指标
  • 已确定护栏指标
  • 已计算样本量
  • 已实现Feature Flag
  • 已配置追踪工具
实验中:
  • 不提前偷看结果(等待达到统计显著性)
  • 每日监控护栏指标
  • 留意意外影响
  • 记录外部因素(节假日、系统故障等)
实验后:
  • 已达到统计显著性
  • 护栏指标未恶化
  • 已做出决策(上线/终止/迭代)
  • 已记录经验总结

Real-World Examples

实际案例

Example 1: Netflix Experimentation

案例1:Netflix的实验体系

Volume: 250+ experiments running at once Approach: Everything is an experiment Culture: "Strong opinions, weakly held - let data decide"
Example Test:
  • Hypothesis: Bigger thumbnails increase engagement
  • Result: No improvement, actually hurt browse time
  • Decision: Rollback
  • Learning: Saved $$ by not shipping

规模: 同时运行250+个实验 方法: 一切皆可实验 文化: "观点要坚定,立场要灵活——让数据做决定"
测试案例:
  • 假设:更大的缩略图能提升用户参与度
  • 结果:无提升,反而降低了浏览时长
  • 决策:回滚
  • 经验:避免了不必要的上线成本

Example 2: Airbnb's Experiments

案例2:Airbnb的实验

Test: New search ranking algorithm Primary: Bookings per search Guardrails:
  • Search quality (ratings of bookings)
  • Host earnings (don't concentrate bookings)
  • Guest satisfaction
Result: +3% bookings, all guardrails healthy → Ship

测试: 新搜索排序算法 核心指标: 每次搜索的预订量 护栏指标:
  • 搜索质量(预订的评分)
  • 房东收入(避免预订集中化)
  • 房客满意度
结果: 预订量提升3%,所有护栏指标正常 → 上线

Example 3: Stripe's Feature Flags

案例3:Stripe的Feature Flag实践

Approach: Every feature behind flag Benefits:
  • Instant rollback (flip flag)
  • Gradual rollout (1% → 5% → 25% → 100%)
  • Test in production safely
Example:
javascript
if (experiments.isEnabled('instant-payouts')) {
  return <InstantPayouts />;
}

方法: 所有功能都通过Flag管理 优势:
  • 即时回滚(切换Flag即可)
  • 逐步上线(1% → 5% → 25% → 100%)
  • 安全地在生产环境测试
示例:
javascript
if (experiments.isEnabled('instant-payouts')) {
  return <InstantPayouts />;
}

Common Pitfalls

常见误区

❌ Mistake 1: Peeking Too Early

❌ 误区1:提前偷看结果

Problem: Stopping test before statistical significance Fix: Calculate sample size upfront, wait for it
问题: 在达到统计显著性前终止测试 解决方法: 提前计算样本量,等待达到要求

❌ Mistake 2: No Guardrails

❌ 误区2:未设置护栏指标

Problem: Gaming the metric (increase clicks but hurt quality) Fix: Always define guardrails
问题: 指标造假(比如提升点击量但损害质量) 解决方法: 始终定义护栏指标

❌ Mistake 3: Too Many Variants

❌ 误区3:变体过多

Problem: Not enough users per variant Fix: Limit to 2-3 variants max
问题: 每个变体的用户量不足 解决方法: 最多限制为2-3个变体

❌ Mistake 4: Ignoring External Factors

❌ 误区4:忽略外部因素

Problem: Holiday spike looks like treatment effect Fix: Note external events, extend duration

问题: 节假日峰值被误认为是实验效果 解决方法: 记录外部事件,延长测试时长

Related Skills

相关技能

  • metrics-frameworks - For choosing right metrics
  • growth-embedded - For growth experiments
  • ship-decisions - For when to ship vs test more
  • strategic-build - For deciding what to test

  • metrics-frameworks - 用于选择合适的指标
  • growth-embedded - 用于增长实验
  • ship-decisions - 用于决策何时上线或继续测试
  • strategic-build - 用于决定测试内容

Key Quotes

关键引用

Ronny Kohavi:
"The best way to predict the future is to run an experiment."
Netflix Culture:
"Strong opinions, weakly held. Let data be the tie-breaker."
Airbnb:
"We trust our intuition to generate hypotheses, and we trust data to make decisions."

Ronny Kohavi:
"预测未来的最佳方式是进行实验。"
Netflix文化:
"观点要坚定,立场要灵活——让数据做决定。"
Airbnb:
"我们依靠直觉生成假设,依靠数据做出决策。"

Further Learning

拓展学习

  • references/experiment-design-guide.md - Complete methodology
  • references/statistical-significance.md - Sample size calculations
  • references/feature-flags-implementation.md - Code examples
  • references/guardrail-metrics.md - Choosing guardrails
  • references/experiment-design-guide.md - 完整方法论
  • references/statistical-significance.md - 样本量计算
  • references/feature-flags-implementation.md - 代码示例
  • references/guardrail-metrics.md - 护栏指标选择