exp-driven-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experimentation-Driven Development

实验驱动开发

When This Skill Activates

此技能的适用场景

Claude uses this skill when:

Building new features that affect core metrics
Implementing A/B testing infrastructure
Making data-driven decisions
Setting up feature flags for gradual rollouts
Choosing which metrics to track

当出现以下情况时，Claude会使用此技能：

开发影响核心指标的新功能
搭建A/B测试基础设施
做出数据驱动的决策
为逐步上线设置Feature Flag
选择需要追踪的指标

Core Frameworks

核心框架

1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix)

1. 实验设计（来源：Ronny Kohavi，微软/Netflix）

The HITS Framework:

H - Hypothesis:

"We believe that [change] will cause [metric] to [increase/decrease] because [reason]"

I - Implementation:

Feature flag setup
Treatment vs control
Sample size calculation

T - Test:

Run for statistical significance
Monitor guardrail metrics
Watch for unexpected effects

S - Ship or Stop:

Ship if positive
Stop if negative
Iterate if inconclusive

Example:

markdown

Hypothesis:
"We believe that adding social proof ('X people bought this') 
will increase conversion rate by 10% 
because it reduces purchase anxiety."

Implementation:
- Control: No social proof
- Treatment: Show "X people bought"
- Sample size: 10,000 users per variant
- Duration: 2 weeks

Test:
- Primary metric: Conversion rate
- Guardrails: Cart abandonment, return rate

Ship or Stop:
- If conversion +5% or more → Ship
- If conversion -2% or less → Stop
- If inconclusive → Iterate and retest

HITS框架：

H - 假设（Hypothesis）：

"我们认为，[变更]将导致[指标][上升/下降]，原因是[理由]"

I - 实施（Implementation）：

Feature Flag配置
实验组 vs 对照组
样本量计算

T - 测试（Test）：

运行实验直至达到统计显著性
监控护栏指标
留意意外影响

S - 上线或终止（Ship or Stop）：

若结果为正，则上线
若结果为负，则终止
若结果不确定，则迭代优化

示例：

markdown

假设：
"我们认为，添加社交证明（'已有X人购买此商品'）
将使转化率提升10%
because it reduces purchase anxiety."

实施：
- 对照组：无社交证明
- 实验组：显示"已有X人购买"
- 样本量：每个变体10,000名用户
- 时长：2周

测试：
- 核心指标：转化率
- 护栏指标：购物车弃购率、退货率

上线或终止：
- 若转化率提升5%及以上 → 上线
- 若转化率下降2%及以下 → 终止
- 若结果不确定 → 迭代后重新测试

2. Metric Selection

2. 指标选择

Primary Metric:

ONE metric you're trying to move
Directly tied to business value
Clear success threshold

Guardrail Metrics:

Metrics that shouldn't degrade
Prevent gaming the system
Ensure quality maintained

Example:

Feature: Streamlined checkout

Primary Metric:
✅ Purchase completion rate (+10%)

Guardrail Metrics:
⚠️ Cart abandonment (don't increase)
⚠️ Return rate (don't increase)
⚠️ Support tickets (don't increase)
⚠️ Load time (stay <2s)

核心指标：

仅选择一个你试图影响的指标
与业务价值直接挂钩
明确成功阈值

护栏指标：

不应出现恶化的指标
防止指标造假
确保质量得以维持

示例：

功能：简化结账流程

核心指标：
✅ 购买完成率（提升10%）

护栏指标：
⚠️ 购物车弃购率（不得上升）
⚠️ 退货率（不得上升）
⚠️ 支持工单量（不得上升）
⚠️ 加载时间（保持在2秒以内）

3. Statistical Significance

3. 统计显著性

The Math:

Minimum sample size = (Effect size, Confidence, Power)

Typical settings:
- Confidence: 95% (p < 0.05)
- Power: 80% (detect 80% of real effects)
- Effect size: Minimum detectable change

Example:
- Baseline conversion: 10%
- Minimum detectable effect: +1% (to 11%)
- Required: ~15,000 users per variant

Common Mistakes:

❌ Stopping test early (peeking bias)
❌ Running too short (seasonal effects)
❌ Too many variants (dilutes sample)
❌ Changing test mid-flight

计算公式：

最小样本量 = (效应量, 置信度, 统计功效)

典型设置：
- 置信度：95%（p < 0.05）
- 统计功效：80%（能检测到80%的真实效应）
- 效应量：最小可检测变化

示例：
- 基准转化率：10%
- 最小可检测效应：+1%（提升至11%）
- 所需样本量：每个变体约15,000名用户

常见错误：

❌ 提前终止测试（偷看偏差）
❌ 测试时长过短（受季节性影响）
❌ 变体过多（稀释样本量）
❌ 测试中途变更方案

4. Feature Flag Architecture

4. Feature Flag架构

Implementation:

javascript

// Feature flag pattern
function checkoutFlow(user) {
  if (isFeatureEnabled(user, 'new-checkout')) {
    return newCheckoutExperience();
  } else {
    return oldCheckoutExperience();
  }
}

// Gradual rollout
function isFeatureEnabled(user, feature) {
  const rolloutPercent = getFeatureRollout(feature);
  const userBucket = hashUserId(user.id) % 100;
  return userBucket < rolloutPercent;
}

// Experiment assignment
function assignExperiment(user, experiment) {
  const variant = consistentHash(user.id, experiment);
  track('experiment_assigned', {
    userId: user.id,
    experiment: experiment,
    variant: variant
  });
  return variant;
}

实现代码：

javascript

// Feature flag pattern
function checkoutFlow(user) {
  if (isFeatureEnabled(user, 'new-checkout')) {
    return newCheckoutExperience();
  } else {
    return oldCheckoutExperience();
  }
}

// Gradual rollout
function isFeatureEnabled(user, feature) {
  const rolloutPercent = getFeatureRollout(feature);
  const userBucket = hashUserId(user.id) % 100;
  return userBucket < rolloutPercent;
}

// Experiment assignment
function assignExperiment(user, experiment) {
  const variant = consistentHash(user.id, experiment);
  track('experiment_assigned', {
    userId: user.id,
    experiment: experiment,
    variant: variant
  });
  return variant;
}

Decision Tree: Should We Experiment?

决策树：是否需要进行实验？

NEW FEATURE
│
├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED
│  NO ↓
│
├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED
│  NO ↓
│
├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL
│  NO ↓
│
├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT?
│  NO ↓
│
└─ SHIP WITHOUT TEST ←────────────────┘
   (But still feature flag for rollback)

新功能
│
├─ 是否影响核心指标？─────是──→ 必须进行实验
│  否 ↓
│
├─ 是否为高风险变更？─────是──→ 建议进行实验
│  否 ↓
│
├─ 影响是否不确定？───────是──→ 实验有帮助
│  否 ↓
│
├─ 是否易于进行A/B测试？──是──→ 为何不做实验？
│  否 ↓
│
└─ 无需测试直接上线 ←────────────────┘
   (但仍需通过Feature Flag支持回滚)

Action Templates

行动模板

Template 1: Experiment Spec

模板1：实验规格书

markdown

undefined

markdown

undefined

Experiment: [Name]

实验：[名称]

Hypothesis

假设

We believe: [change] Will cause: [metric] to [increase/decrease] Because: [reasoning]

我们认为： [变更内容] 将导致： [指标][上升/下降] 原因： [推理依据]

Variants

变体

Control (50%)

对照组（50%）

[Current experience]

[当前体验]

Treatment (50%)

实验组（50%）

[New experience]

[新体验]

Metrics

指标

Primary Metric

核心指标

What: [metric name]
Current: [baseline]
Target: [goal]
Success: [threshold]

指标名称： [指标名]
当前基准： [基准值]
目标： [目标值]
成功阈值： [阈值]

Guardrail Metrics

护栏指标

Metric 1: [name] - Don't decrease
Metric 2: [name] - Don't increase
Metric 3: [name] - Maintain

指标1： [名称] - 不得下降
指标2： [名称] - 不得上升
指标3： [名称] - 维持现状

Sample Size

样本量

Users needed: [X per variant]
Duration: [Y days]
Confidence: 95%
Power: 80%

所需用户数： [每个变体X名]
时长： [Y天]
置信度： 95%
统计功效： 80%

Implementation

实现代码

javascript

if (experiment('feature-name') === 'treatment') {
  // New experience
} else {
  // Old experience
}

javascript

if (experiment('feature-name') === 'treatment') {
  // 新体验
} else {
  // 旧体验
}

Success Criteria

成功标准

Decision

决策

If positive: Ship to 100%
If negative: Rollback, iterate
If inconclusive: Extend or redesign

undefined

若结果为正： 全量上线
若结果为负： 回滚并迭代优化
若结果不确定： 延长测试或重新设计实验

undefined

Template 2: Feature Flag Implementation

模板2：Feature Flag实现

typescript

// features.ts
export const FEATURES = {
  'new-checkout': {
    rollout: 10,  // 10% of users
    enabled: true,
    description: 'New streamlined checkout flow'
  },
  'ai-recommendations': {
    rollout: 0,  // Not live yet
    enabled: false,
    description: 'AI-powered product recommendations'
  }
};

// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
  const config = FEATURES[feature];
  if (!config || !config.enabled) return false;
  
  const bucket = consistentHash(userId) % 100;
  return bucket < config.rollout;
}

// usage in code
if (isEnabled(user.id, 'new-checkout')) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}

typescript

// features.ts
export const FEATURES = {
  'new-checkout': {
    rollout: 10,  // 10% of users
    enabled: true,
    description: 'New streamlined checkout flow'
  },
  'ai-recommendations': {
    rollout: 0,  // Not live yet
    enabled: false,
    description: 'AI-powered product recommendations'
  }
};

// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
  const config = FEATURES[feature];
  if (!config || !config.enabled) return false;
  
  const bucket = consistentHash(userId) % 100;
  return bucket < config.rollout;
}

// usage in code
if (isEnabled(user.id, 'new-checkout')) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}

Template 3: Experiment Dashboard

模板3：实验仪表盘

markdown

undefined

markdown

undefined

Experiment Dashboard

实验仪表盘

Active Experiments

进行中的实验

Experiment 1: [Name]

实验1：[名称]

Status: Running
Started: [date]
Progress: [X]% sample size reached
Primary metric: [current result]
Guardrails: ✅ All healthy

状态： 进行中
开始时间： [日期]
进度： 已达到[X]%的样本量
核心指标： [当前结果]
护栏指标： ✅ 全部正常

Experiment 2: [Name]

实验2：[名称]

Status: Complete
Result: Treatment won (+15% conversion)
Decision: Ship to 100%
Shipped: [date]

状态： 已完成
结果： 实验组获胜（转化率提升15%）
决策： 全量上线
上线时间： [日期]

Key Metrics

关键指标

Experiment Velocity

实验效率

Experiments launched: [X per month]
Win rate: [Y]%
Average duration: [Z] days

每月启动实验数： [X]个
成功率： [Y]%
平均时长： [Z]天

Impact

业务影响

Revenue impact: +$[X]
Conversion improvement: +[Y]%
User satisfaction: +[Z] NPS

收入影响： +$[X]
转化率提升： +[Y]%
用户满意度： NPS提升[Z]分

Learnings

经验总结

[Key insight 1]
[Key insight 2]
[Key insight 3]

undefined

[关键洞察1]
[关键洞察2]
[关键洞察3]

undefined

Quick Reference

快速参考

🧪 Experiment Checklist

🧪 实验检查清单

Real-World Examples

实际案例

Example 1: Netflix Experimentation

案例1：Netflix的实验体系

Volume: 250+ experiments running at once Approach: Everything is an experiment Culture: "Strong opinions, weakly held - let data decide"

Example Test:

Hypothesis: Bigger thumbnails increase engagement
Result: No improvement, actually hurt browse time
Decision: Rollback
Learning: Saved $$ by not shipping

规模： 同时运行250+个实验 方法： 一切皆可实验 文化： "观点要坚定，立场要灵活——让数据做决定"

测试案例：

假设：更大的缩略图能提升用户参与度
结果：无提升，反而降低了浏览时长
决策：回滚
经验：避免了不必要的上线成本

Example 2: Airbnb's Experiments

案例2：Airbnb的实验

Test: New search ranking algorithm Primary: Bookings per search Guardrails:

Search quality (ratings of bookings)
Host earnings (don't concentrate bookings)
Guest satisfaction

Result: +3% bookings, all guardrails healthy → Ship

测试： 新搜索排序算法 核心指标： 每次搜索的预订量 护栏指标：

搜索质量（预订的评分）
房东收入（避免预订集中化）
房客满意度

结果： 预订量提升3%，所有护栏指标正常 → 上线

Example 3: Stripe's Feature Flags

案例3：Stripe的Feature Flag实践

Approach: Every feature behind flag Benefits:

Instant rollback (flip flag)
Gradual rollout (1% → 5% → 25% → 100%)
Test in production safely

Example:

javascript

if (experiments.isEnabled('instant-payouts')) {
  return <InstantPayouts />;
}

方法： 所有功能都通过Flag管理 优势：

即时回滚（切换Flag即可）
逐步上线（1% → 5% → 25% → 100%）
安全地在生产环境测试

示例：

javascript

if (experiments.isEnabled('instant-payouts')) {
  return <InstantPayouts />;
}

Common Pitfalls

常见误区

❌ Mistake 1: Peeking Too Early

❌ 误区1：提前偷看结果

Problem: Stopping test before statistical significance Fix: Calculate sample size upfront, wait for it

问题： 在达到统计显著性前终止测试 解决方法： 提前计算样本量，等待达到要求

❌ Mistake 2: No Guardrails

❌ 误区2：未设置护栏指标

Problem: Gaming the metric (increase clicks but hurt quality) Fix: Always define guardrails

问题： 指标造假（比如提升点击量但损害质量） 解决方法： 始终定义护栏指标

❌ Mistake 3: Too Many Variants

❌ 误区3：变体过多

Problem: Not enough users per variant Fix: Limit to 2-3 variants max

问题： 每个变体的用户量不足 解决方法： 最多限制为2-3个变体

❌ Mistake 4: Ignoring External Factors

❌ 误区4：忽略外部因素

Problem: Holiday spike looks like treatment effect Fix: Note external events, extend duration

问题： 节假日峰值被误认为是实验效果 解决方法： 记录外部事件，延长测试时长

Related Skills

Key Quotes

关键引用

Ronny Kohavi:

"The best way to predict the future is to run an experiment."

Netflix Culture:

"Strong opinions, weakly held. Let data be the tie-breaker."

Airbnb:

"We trust our intuition to generate hypotheses, and we trust data to make decisions."

Ronny Kohavi：

"预测未来的最佳方式是进行实验。"

Netflix文化：

"观点要坚定，立场要灵活——让数据做决定。"

Airbnb：

"我们依靠直觉生成假设，依靠数据做出决策。"

Further Learning

拓展学习

references/experiment-design-guide.md - Complete methodology
references/statistical-significance.md - Sample size calculations
references/feature-flags-implementation.md - Code examples
references/guardrail-metrics.md - Choosing guardrails

references/experiment-design-guide.md - 完整方法论
references/statistical-significance.md - 样本量计算
references/feature-flags-implementation.md - 代码示例
references/guardrail-metrics.md - 护栏指标选择