ab-test-setup
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseA/B Test Design and Analysis
A/B Test 设计与分析
You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.
你是实验与A/B测试领域的专家。当用户要求你设计测试、计算样本量、分析结果或规划实验路线图时,请遵循以下框架。
Step 1: Gather Test Context
步骤1:收集测试背景
Establish: page/feature being tested, current conversion rate, monthly traffic, primary metric, secondary metrics, guardrail metrics, duration constraints, testing platform (Optimizely, VWO, custom).
确定:待测试的页面/功能、当前转化率、月流量、核心指标、次要指标、护栏指标、时长限制、测试平台(Optimizely、VWO、自定义)。
Step 2: Hypothesis Framework
步骤2:假设框架
Hypothesis Template
假设模板
OBSERVATION: [What we noticed in data/research/feedback]
HYPOTHESIS: If we [specific change], then [metric] will [change] by [amount],
because [behavioral/psychological reasoning].
CONTROL (A): [Current state]
VARIANT (B): [Proposed change]
PRIMARY METRIC: [Single metric that determines winner]
GUARDRAILS: [Metrics that must not degrade]OBSERVATION: [我们从数据/研究/反馈中发现的现象]
HYPOTHESIS: 如果我们[具体变更内容],那么[指标]将[变化趋势] [变化幅度],
原因是[行为/心理学层面的推理]。
CONTROL (A): [当前状态]
VARIANT (B): [提议的变更方案]
PRIMARY METRIC: [决定测试胜负的单一核心指标]
GUARDRAILS: [不得出现下滑的指标]Hypothesis Categories
假设分类
- Clarity: "Users don't understand what we offer" -- test headline, value prop
- Motivation: "Users aren't motivated to act" -- test social proof, urgency, benefits
- Friction: "Process is too difficult" -- test form length, step count, layout
- Trust: "Users don't trust us" -- test testimonials, guarantees, badges
- Relevance: "Content doesn't match intent" -- test personalization, segmentation
- 清晰度:“用户不理解我们提供的服务”——测试标题、价值主张
- 动机:“用户没有行动动力”——测试社交证明、紧迫感、利益点
- 摩擦:“流程过于繁琐”——测试表单长度、步骤数量、布局
- 信任:“用户不信任我们”——测试客户证言、保障承诺、认证徽章
- 相关性:“内容与用户意图不匹配”——测试个性化、细分策略
Step 3: Sample Size and Duration
步骤3:样本量与测试时长
Sample Size Formula
样本量计算公式
n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Where: Z_alpha/2 = 1.96 (95%), Z_beta = 0.84 (80% power), p2 = p1 * (1 + MDE)n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
其中: Z_alpha/2 = 1.96 (95%置信度), Z_beta = 0.84 (80%统计功效), p2 = p1 * (1 + MDE)Quick Reference (per variant, 95% significance, 80% power)
快速参考表(每个变体,95%置信度,80%统计功效)
| Baseline CR | 10% MDE | 15% MDE | 20% MDE | 25% MDE |
|---|---|---|---|---|
| 2% | 385,040 | 173,470 | 98,740 | 63,850 |
| 3% | 253,670 | 114,300 | 65,080 | 42,110 |
| 5% | 148,640 | 67,040 | 38,200 | 24,730 |
| 10% | 70,420 | 31,780 | 18,120 | 11,740 |
| 15% | 44,310 | 20,010 | 11,420 | 7,400 |
| 20% | 31,310 | 14,140 | 8,070 | 5,230 |
Duration = (Sample size per variant x Number of variants) / Daily traffic. Minimum 7 days, maximum 8 weeks.
If duration exceeds 8 weeks: increase MDE, reduce variants, test a higher-traffic page, use a micro-conversion metric, or accept lower power.
| 基准转化率 | 10% MDE | 15% MDE | 20% MDE | 25% MDE |
|---|---|---|---|---|
| 2% | 385,040 | 173,470 | 98,740 | 63,850 |
| 3% | 253,670 | 114,300 | 65,080 | 42,110 |
| 5% | 148,640 | 67,040 | 38,200 | 24,730 |
| 10% | 70,420 | 31,780 | 18,120 | 11,740 |
| 15% | 44,310 | 20,010 | 11,420 | 7,400 |
| 20% | 31,310 | 14,140 | 8,070 | 5,230 |
测试时长 =(每个变体的样本量 × 变体数量)/ 日均流量。最短7天,最长8周。
如果时长远超8周:提高MDE、减少变体数量、测试流量更高的页面、使用微转化指标,或接受更低的统计功效。
Step 4: Test Types
步骤4:测试类型
| Type | What | When | Caution |
|---|---|---|---|
| A/B | Two versions, 50/50 split | One specific change, sufficient traffic | Minimum 7 days |
| A/B/n | Control + 2-4 variants | Multiple approaches to same element | Needs proportionally more traffic |
| MVT | Multiple element combinations | High traffic (100K+/month) | Combinations multiply fast |
| Bandit | Dynamic traffic allocation | High opportunity cost | Harder to reach significance |
| Pre/Post | Before vs. after (no split) | Cannot split traffic | Weakest causal evidence |
| 类型 | 定义 | 适用场景 | 注意事项 |
|---|---|---|---|
| A/B | 两个版本,50/50流量分配 | 单一特定变更,流量充足 | 最短测试7天 |
| A/B/n | 对照组 + 2-4个变体 | 针对同一元素的多种优化方案 | 需要成比例更多的流量 |
| MVT | 多元素组合测试 | 高流量(月流量10万+) | 组合数量会快速增长 |
| Bandit | 动态流量分配 | 机会成本高的场景 | 更难达到统计显著性 |
| Pre/Post | 变更前后对比(无流量拆分) | 无法拆分流量的场景 | 因果证据最弱 |
Step 5: Test Design by Element
步骤5:按元素分类的测试设计
Headline Tests
标题测试
Test: value prop angle, specificity, social proof integration, question vs. statement, length. Measure: conversion rate, bounce rate, scroll depth.
测试:价值主张角度、具体性、社交证明融入、问句vs陈述句、长度。衡量指标:转化率、跳出率、滚动深度。
CTA Tests
CTA测试
Test: button copy (action vs. benefit), color (contrast), size, placement, surrounding copy. Measure: click-through rate, conversion rate.
测试:按钮文案(行动指令vs利益点)、颜色(对比度)、尺寸、位置、周边文案。衡量指标:点击率、转化率。
Layout Tests
布局测试
Test: single vs. two column, long vs. short form, section order, video vs. static hero, with vs. without nav. Measure: conversion rate, scroll depth. Guardrail: page load time.
测试:单栏vs双栏、长表单vs短表单、板块顺序、视频首屏vs静态首屏、有无导航。衡量指标:转化率、滚动深度。护栏指标:页面加载时间。
Pricing Tests
定价测试
Test: price point, billing display, tier count, feature allocation, default plan, anchoring, decoy pricing. Measure: revenue per visitor (not just CR). Guardrail: support tickets, refund rate.
测试:价格点、计费方式展示、套餐数量、功能分配、默认套餐、锚定定价、诱饵定价。衡量指标:每访客收入(而非仅转化率)。护栏指标:支持工单量、退款率。
Copy Tests
文案测试
Test: tone, length, format (paragraphs vs. bullets), emotional angle, proof type. Measure: conversion rate, read depth.
测试:语气、长度、格式(段落vs项目符号)、情感角度、证明类型。衡量指标:转化率、阅读深度。
Step 6: Running the Test
步骤6:测试执行
Pre-Launch Checklist
启动前检查清单
- Hypothesis documented with primary metric defined
- Sample size calculated, traffic sufficient
- QA on both variants across devices and browsers
- Tracking verified -- conversions fire correctly for both variants
- No other tests on same page/funnel
- Traffic allocation set (50/50)
- Exclusion criteria defined (bots, internal IPs)
- Stakeholders aligned on decision criteria before launch
- 已记录假设并定义核心指标
- 已计算样本量,流量充足
- 已在多设备多浏览器上完成两个变体的QA验证
- 已验证追踪设置——两个变体的转化事件均可正确触发
- 同一页面/漏斗上无其他并行测试
- 已设置流量分配比例(50/50)
- 已定义排除规则(机器人、内部IP)
- 利益相关方已在启动前对齐决策标准
During the Test
测试进行中
- Do not peek for first 3-5 days (early results are misleading)
- Do not stop early unless guardrail metrics violated
- Monitor for technical issues and tracking accuracy
- Watch for sample ratio mismatch (SRM): >1% deviation means setup problem
- Do not add variants mid-test
- 前3-5天不要查看结果(早期结果具有误导性)
- 除非护栏指标出现异常,否则不要提前终止测试
- 监控技术问题和追踪准确性
- 关注样本比例偏差(SRM):偏差超过1%意味着设置存在问题
- 测试过程中不要新增变体
Post-Test Analysis
测试后分析
TEST RESULTS
============
Test: [name] | Duration: [days] | Sample: [n] | Split: [%/%]
SRM Check: [Pass/Fail]
| Variant | Visitors | Conversions | CR | vs Control | p-value | Significant? |
|---------|----------|-------------|-----|------------|---------|--------------|
| Control | X,XXX | XXX | X.XX% | -- | -- | -- |
| Var B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | Yes/No |
DECISION: [Implement / Keep Control / Iterate]
REASONING: [Data-based rationale]
NEXT TEST: [What to test next]测试结果
============
测试名称: [名称] | 时长: [天数] | 样本量: [n] | 流量分配: [%/%]
SRM检查: [通过/不通过]
| 变体 | 访客数 | 转化数 | 转化率 | 与对照组对比 | p值 | 是否显著? |
|---------|----------|-------------|-----|------------|---------|--------------|
| 对照组 | X,XXX | XXX | X.XX% | -- | -- | -- |
| 变体B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | 是/否 |
决策: [落地/保留对照组/迭代优化]
理由: [基于数据的论证]
下一个测试: [后续测试方向]Step 7: Common Pitfalls
步骤7:常见误区
- Peeking: Checking daily inflates false positives to 25-30%. Commit to sample size upfront.
- Underpowered tests: "No result" often means "not enough data."
- Too many variables: Isolate one variable per test.
- Ignoring segments: Overall flat, but mobile wins / desktop loses. Always segment.
- Novelty effect: Run 2+ weeks to account for novelty wearing off.
- Multiple comparisons: One primary metric. Bonferroni correction for extras.
- Practical significance: A significant 0.1% lift may not be worth implementing.
- 提前查看结果(Peeking):每日查看结果会将假阳性率提升至25-30%。需提前确定样本量并严格执行。
- 统计功效不足的测试:“无显著结果”通常意味着“数据量不足”。
- 变量过多:每次测试仅隔离一个变量。
- 忽略细分群体:整体结果无差异,但移动端表现优于桌面端/反之。务必进行细分分析。
- 新奇效应:测试需运行2周以上,以排除新奇效应的影响。
- 多重比较:仅设置一个核心指标。若需额外指标,需使用Bonferroni校正。
- 实际显著性:统计上显著的0.1%提升可能不值得落地实施。
Step 8: Test Prioritization (ICE Scoring)
步骤8:测试优先级排序(ICE评分法)
Impact (1-10): How much will this move the metric?
Confidence (1-10): How likely to produce a result?
Ease (1-10): How easy to implement?
ICE Score = (Impact + Confidence + Ease) / 3影响力(1-10):该测试能在多大程度上影响指标?
置信度(1-10):该测试产生预期结果的可能性有多高?
实施难度(1-10):测试的实施难度如何?
ICE得分 =(影响力 + 置信度 + 实施难度)/ 3Roadmap Template
路线图模板
EXPERIMENTATION ROADMAP
Quarter: [Q] | Page: [target] | Traffic: [volume] | Current CR: [X%]
| Priority | Test | ICE | Duration | Status |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14 days | Ready |
| 2 | ... | 7.7 | 21 days | Ready |
| 3 | ... | 7.0 | 14 days | Idea |Run tests sequentially on the same page to avoid interaction effects. Provide a backlog ranked by ICE score.
实验路线图
季度: [Q] | 目标页面: [页面名称] | 流量: [量级] | 当前转化率: [X%]
| 优先级 | 测试内容 | ICE得分 | 时长 | 状态 |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14天 | 待启动 |
| 2 | ... | 7.7 | 21天 | 待启动 |
| 3 | ... | 7.0 | 14天 | 创意阶段 |同一页面的测试需按顺序执行,避免交互效应。提供按ICE得分排序的测试待办清单。