exp-driven-dev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperimentation-Driven Development
实验驱动开发
When This Skill Activates
此技能的适用场景
Claude uses this skill when:
- Building new features that affect core metrics
- Implementing A/B testing infrastructure
- Making data-driven decisions
- Setting up feature flags for gradual rollouts
- Choosing which metrics to track
当出现以下情况时,Claude会使用此技能:
- 开发影响核心指标的新功能
- 搭建A/B测试基础设施
- 做出数据驱动的决策
- 为逐步上线设置Feature Flag
- 选择需要追踪的指标
Core Frameworks
核心框架
1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix)
1. 实验设计(来源:Ronny Kohavi,微软/Netflix)
The HITS Framework:
H - Hypothesis:
"We believe that [change] will cause [metric] to [increase/decrease] because [reason]"
I - Implementation:
- Feature flag setup
- Treatment vs control
- Sample size calculation
T - Test:
- Run for statistical significance
- Monitor guardrail metrics
- Watch for unexpected effects
S - Ship or Stop:
- Ship if positive
- Stop if negative
- Iterate if inconclusive
Example:
markdown
Hypothesis:
"We believe that adding social proof ('X people bought this')
will increase conversion rate by 10%
because it reduces purchase anxiety."
Implementation:
- Control: No social proof
- Treatment: Show "X people bought"
- Sample size: 10,000 users per variant
- Duration: 2 weeks
Test:
- Primary metric: Conversion rate
- Guardrails: Cart abandonment, return rate
Ship or Stop:
- If conversion +5% or more → Ship
- If conversion -2% or less → Stop
- If inconclusive → Iterate and retestHITS框架:
H - 假设(Hypothesis):
"我们认为,[变更]将导致[指标][上升/下降],原因是[理由]"
I - 实施(Implementation):
- Feature Flag配置
- 实验组 vs 对照组
- 样本量计算
T - 测试(Test):
- 运行实验直至达到统计显著性
- 监控护栏指标
- 留意意外影响
S - 上线或终止(Ship or Stop):
- 若结果为正,则上线
- 若结果为负,则终止
- 若结果不确定,则迭代优化
示例:
markdown
假设:
"我们认为,添加社交证明('已有X人购买此商品')
将使转化率提升10%
because it reduces purchase anxiety."
实施:
- 对照组:无社交证明
- 实验组:显示"已有X人购买"
- 样本量:每个变体10,000名用户
- 时长:2周
测试:
- 核心指标:转化率
- 护栏指标:购物车弃购率、退货率
上线或终止:
- 若转化率提升5%及以上 → 上线
- 若转化率下降2%及以下 → 终止
- 若结果不确定 → 迭代后重新测试2. Metric Selection
2. 指标选择
Primary Metric:
- ONE metric you're trying to move
- Directly tied to business value
- Clear success threshold
Guardrail Metrics:
- Metrics that shouldn't degrade
- Prevent gaming the system
- Ensure quality maintained
Example:
Feature: Streamlined checkout
Primary Metric:
✅ Purchase completion rate (+10%)
Guardrail Metrics:
⚠️ Cart abandonment (don't increase)
⚠️ Return rate (don't increase)
⚠️ Support tickets (don't increase)
⚠️ Load time (stay <2s)核心指标:
- 仅选择一个你试图影响的指标
- 与业务价值直接挂钩
- 明确成功阈值
护栏指标:
- 不应出现恶化的指标
- 防止指标造假
- 确保质量得以维持
示例:
功能:简化结账流程
核心指标:
✅ 购买完成率(提升10%)
护栏指标:
⚠️ 购物车弃购率(不得上升)
⚠️ 退货率(不得上升)
⚠️ 支持工单量(不得上升)
⚠️ 加载时间(保持在2秒以内)3. Statistical Significance
3. 统计显著性
The Math:
Minimum sample size = (Effect size, Confidence, Power)
Typical settings:
- Confidence: 95% (p < 0.05)
- Power: 80% (detect 80% of real effects)
- Effect size: Minimum detectable change
Example:
- Baseline conversion: 10%
- Minimum detectable effect: +1% (to 11%)
- Required: ~15,000 users per variantCommon Mistakes:
- ❌ Stopping test early (peeking bias)
- ❌ Running too short (seasonal effects)
- ❌ Too many variants (dilutes sample)
- ❌ Changing test mid-flight
计算公式:
最小样本量 = (效应量, 置信度, 统计功效)
典型设置:
- 置信度:95%(p < 0.05)
- 统计功效:80%(能检测到80%的真实效应)
- 效应量:最小可检测变化
示例:
- 基准转化率:10%
- 最小可检测效应:+1%(提升至11%)
- 所需样本量:每个变体约15,000名用户常见错误:
- ❌ 提前终止测试(偷看偏差)
- ❌ 测试时长过短(受季节性影响)
- ❌ 变体过多(稀释样本量)
- ❌ 测试中途变更方案
4. Feature Flag Architecture
4. Feature Flag架构
Implementation:
javascript
// Feature flag pattern
function checkoutFlow(user) {
if (isFeatureEnabled(user, 'new-checkout')) {
return newCheckoutExperience();
} else {
return oldCheckoutExperience();
}
}
// Gradual rollout
function isFeatureEnabled(user, feature) {
const rolloutPercent = getFeatureRollout(feature);
const userBucket = hashUserId(user.id) % 100;
return userBucket < rolloutPercent;
}
// Experiment assignment
function assignExperiment(user, experiment) {
const variant = consistentHash(user.id, experiment);
track('experiment_assigned', {
userId: user.id,
experiment: experiment,
variant: variant
});
return variant;
}实现代码:
javascript
// Feature flag pattern
function checkoutFlow(user) {
if (isFeatureEnabled(user, 'new-checkout')) {
return newCheckoutExperience();
} else {
return oldCheckoutExperience();
}
}
// Gradual rollout
function isFeatureEnabled(user, feature) {
const rolloutPercent = getFeatureRollout(feature);
const userBucket = hashUserId(user.id) % 100;
return userBucket < rolloutPercent;
}
// Experiment assignment
function assignExperiment(user, experiment) {
const variant = consistentHash(user.id, experiment);
track('experiment_assigned', {
userId: user.id,
experiment: experiment,
variant: variant
});
return variant;
}Decision Tree: Should We Experiment?
决策树:是否需要进行实验?
NEW FEATURE
│
├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED
│ NO ↓
│
├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED
│ NO ↓
│
├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL
│ NO ↓
│
├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT?
│ NO ↓
│
└─ SHIP WITHOUT TEST ←────────────────┘
(But still feature flag for rollback)新功能
│
├─ 是否影响核心指标?─────是──→ 必须进行实验
│ 否 ↓
│
├─ 是否为高风险变更?─────是──→ 建议进行实验
│ 否 ↓
│
├─ 影响是否不确定?───────是──→ 实验有帮助
│ 否 ↓
│
├─ 是否易于进行A/B测试?──是──→ 为何不做实验?
│ 否 ↓
│
└─ 无需测试直接上线 ←────────────────┘
(但仍需通过Feature Flag支持回滚)Action Templates
行动模板
Template 1: Experiment Spec
模板1:实验规格书
markdown
undefinedmarkdown
undefinedExperiment: [Name]
实验:[名称]
Hypothesis
假设
We believe: [change]
Will cause: [metric] to [increase/decrease]
Because: [reasoning]
我们认为: [变更内容]
将导致: [指标][上升/下降]
原因: [推理依据]
Variants
变体
Control (50%)
对照组(50%)
[Current experience]
[当前体验]
Treatment (50%)
实验组(50%)
[New experience]
[新体验]
Metrics
指标
Primary Metric
核心指标
- What: [metric name]
- Current: [baseline]
- Target: [goal]
- Success: [threshold]
- 指标名称: [指标名]
- 当前基准: [基准值]
- 目标: [目标值]
- 成功阈值: [阈值]
Guardrail Metrics
护栏指标
- Metric 1: [name] - Don't decrease
- Metric 2: [name] - Don't increase
- Metric 3: [name] - Maintain
- 指标1: [名称] - 不得下降
- 指标2: [名称] - 不得上升
- 指标3: [名称] - 维持现状
Sample Size
样本量
- Users needed: [X per variant]
- Duration: [Y days]
- Confidence: 95%
- Power: 80%
- 所需用户数: [每个变体X名]
- 时长: [Y天]
- 置信度: 95%
- 统计功效: 80%
Implementation
实现代码
javascript
if (experiment('feature-name') === 'treatment') {
// New experience
} else {
// Old experience
}javascript
if (experiment('feature-name') === 'treatment') {
// 新体验
} else {
// 旧体验
}Success Criteria
成功标准
- Primary metric improved by [X]%
- No guardrail degradation
- Statistical significance reached
- No unexpected negative effects
- 核心指标提升[X]%
- 护栏指标未出现恶化
- 达到统计显著性
- 无意外负面影响
Decision
决策
- If positive: Ship to 100%
- If negative: Rollback, iterate
- If inconclusive: Extend or redesign
undefined- 若结果为正: 全量上线
- 若结果为负: 回滚并迭代优化
- 若结果不确定: 延长测试或重新设计实验
undefinedTemplate 2: Feature Flag Implementation
模板2:Feature Flag实现
typescript
// features.ts
export const FEATURES = {
'new-checkout': {
rollout: 10, // 10% of users
enabled: true,
description: 'New streamlined checkout flow'
},
'ai-recommendations': {
rollout: 0, // Not live yet
enabled: false,
description: 'AI-powered product recommendations'
}
};
// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
const config = FEATURES[feature];
if (!config || !config.enabled) return false;
const bucket = consistentHash(userId) % 100;
return bucket < config.rollout;
}
// usage in code
if (isEnabled(user.id, 'new-checkout')) {
return <NewCheckout />;
} else {
return <OldCheckout />;
}typescript
// features.ts
export const FEATURES = {
'new-checkout': {
rollout: 10, // 10% of users
enabled: true,
description: 'New streamlined checkout flow'
},
'ai-recommendations': {
rollout: 0, // Not live yet
enabled: false,
description: 'AI-powered product recommendations'
}
};
// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
const config = FEATURES[feature];
if (!config || !config.enabled) return false;
const bucket = consistentHash(userId) % 100;
return bucket < config.rollout;
}
// usage in code
if (isEnabled(user.id, 'new-checkout')) {
return <NewCheckout />;
} else {
return <OldCheckout />;
}Template 3: Experiment Dashboard
模板3:实验仪表盘
markdown
undefinedmarkdown
undefinedExperiment Dashboard
实验仪表盘
Active Experiments
进行中的实验
Experiment 1: [Name]
实验1:[名称]
- Status: Running
- Started: [date]
- Progress: [X]% sample size reached
- Primary metric: [current result]
- Guardrails: ✅ All healthy
- 状态: 进行中
- 开始时间: [日期]
- 进度: 已达到[X]%的样本量
- 核心指标: [当前结果]
- 护栏指标: ✅ 全部正常
Experiment 2: [Name]
实验2:[名称]
- Status: Complete
- Result: Treatment won (+15% conversion)
- Decision: Ship to 100%
- Shipped: [date]
- 状态: 已完成
- 结果: 实验组获胜(转化率提升15%)
- 决策: 全量上线
- 上线时间: [日期]
Key Metrics
关键指标
Experiment Velocity
实验效率
- Experiments launched: [X per month]
- Win rate: [Y]%
- Average duration: [Z] days
- 每月启动实验数: [X]个
- 成功率: [Y]%
- 平均时长: [Z]天
Impact
业务影响
- Revenue impact: +$[X]
- Conversion improvement: +[Y]%
- User satisfaction: +[Z] NPS
- 收入影响: +$[X]
- 转化率提升: +[Y]%
- 用户满意度: NPS提升[Z]分
Learnings
经验总结
- [Key insight 1]
- [Key insight 2]
- [Key insight 3]
undefined- [关键洞察1]
- [关键洞察2]
- [关键洞察3]
undefinedQuick Reference
快速参考
🧪 Experiment Checklist
🧪 实验检查清单
Before Starting:
- Hypothesis written (believe → cause → because)
- Primary metric defined
- Guardrails identified
- Sample size calculated
- Feature flag implemented
- Tracking instrumented
During Experiment:
- Don't peek early (wait for significance)
- Monitor guardrails daily
- Watch for unexpected effects
- Log any external factors (holidays, outages)
After Experiment:
- Statistical significance reached
- Guardrails not degraded
- Decision made (ship/stop/iterate)
- Learning documented
实验前:
- 已撰写假设(我们认为→将导致→原因)
- 已定义核心指标
- 已确定护栏指标
- 已计算样本量
- 已实现Feature Flag
- 已配置追踪工具
实验中:
- 不提前偷看结果(等待达到统计显著性)
- 每日监控护栏指标
- 留意意外影响
- 记录外部因素(节假日、系统故障等)
实验后:
- 已达到统计显著性
- 护栏指标未恶化
- 已做出决策(上线/终止/迭代)
- 已记录经验总结
Real-World Examples
实际案例
Example 1: Netflix Experimentation
案例1:Netflix的实验体系
Volume: 250+ experiments running at once
Approach: Everything is an experiment
Culture: "Strong opinions, weakly held - let data decide"
Example Test:
- Hypothesis: Bigger thumbnails increase engagement
- Result: No improvement, actually hurt browse time
- Decision: Rollback
- Learning: Saved $$ by not shipping
规模: 同时运行250+个实验
方法: 一切皆可实验
文化: "观点要坚定,立场要灵活——让数据做决定"
测试案例:
- 假设:更大的缩略图能提升用户参与度
- 结果:无提升,反而降低了浏览时长
- 决策:回滚
- 经验:避免了不必要的上线成本
Example 2: Airbnb's Experiments
案例2:Airbnb的实验
Test: New search ranking algorithm
Primary: Bookings per search
Guardrails:
- Search quality (ratings of bookings)
- Host earnings (don't concentrate bookings)
- Guest satisfaction
Result: +3% bookings, all guardrails healthy → Ship
测试: 新搜索排序算法
核心指标: 每次搜索的预订量
护栏指标:
- 搜索质量(预订的评分)
- 房东收入(避免预订集中化)
- 房客满意度
结果: 预订量提升3%,所有护栏指标正常 → 上线
Example 3: Stripe's Feature Flags
案例3:Stripe的Feature Flag实践
Approach: Every feature behind flag
Benefits:
- Instant rollback (flip flag)
- Gradual rollout (1% → 5% → 25% → 100%)
- Test in production safely
Example:
javascript
if (experiments.isEnabled('instant-payouts')) {
return <InstantPayouts />;
}方法: 所有功能都通过Flag管理
优势:
- 即时回滚(切换Flag即可)
- 逐步上线(1% → 5% → 25% → 100%)
- 安全地在生产环境测试
示例:
javascript
if (experiments.isEnabled('instant-payouts')) {
return <InstantPayouts />;
}Common Pitfalls
常见误区
❌ Mistake 1: Peeking Too Early
❌ 误区1:提前偷看结果
Problem: Stopping test before statistical significance
Fix: Calculate sample size upfront, wait for it
问题: 在达到统计显著性前终止测试
解决方法: 提前计算样本量,等待达到要求
❌ Mistake 2: No Guardrails
❌ 误区2:未设置护栏指标
Problem: Gaming the metric (increase clicks but hurt quality)
Fix: Always define guardrails
问题: 指标造假(比如提升点击量但损害质量)
解决方法: 始终定义护栏指标
❌ Mistake 3: Too Many Variants
❌ 误区3:变体过多
Problem: Not enough users per variant
Fix: Limit to 2-3 variants max
问题: 每个变体的用户量不足
解决方法: 最多限制为2-3个变体
❌ Mistake 4: Ignoring External Factors
❌ 误区4:忽略外部因素
Problem: Holiday spike looks like treatment effect
Fix: Note external events, extend duration
问题: 节假日峰值被误认为是实验效果
解决方法: 记录外部事件,延长测试时长
Related Skills
相关技能
- metrics-frameworks - For choosing right metrics
- growth-embedded - For growth experiments
- ship-decisions - For when to ship vs test more
- strategic-build - For deciding what to test
- metrics-frameworks - 用于选择合适的指标
- growth-embedded - 用于增长实验
- ship-decisions - 用于决策何时上线或继续测试
- strategic-build - 用于决定测试内容
Key Quotes
关键引用
Ronny Kohavi:
"The best way to predict the future is to run an experiment."
Netflix Culture:
"Strong opinions, weakly held. Let data be the tie-breaker."
Airbnb:
"We trust our intuition to generate hypotheses, and we trust data to make decisions."
Ronny Kohavi:
"预测未来的最佳方式是进行实验。"
Netflix文化:
"观点要坚定,立场要灵活——让数据做决定。"
Airbnb:
"我们依靠直觉生成假设,依靠数据做出决策。"
Further Learning
拓展学习
- references/experiment-design-guide.md - Complete methodology
- references/statistical-significance.md - Sample size calculations
- references/feature-flags-implementation.md - Code examples
- references/guardrail-metrics.md - Choosing guardrails
- references/experiment-design-guide.md - 完整方法论
- references/statistical-significance.md - 样本量计算
- references/feature-flags-implementation.md - 代码示例
- references/guardrail-metrics.md - 护栏指标选择