mkt-ab-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Marketing A/B Testing

营销A/B测试

Framework

框架

IRON LAW: One Variable at a Time

If you change the headline AND the image AND the CTA simultaneously,
you cannot know which change caused the result. Test ONE variable per
experiment. If you need to test multiple changes, use sequential tests
or multivariate testing (MVT) with sufficient traffic.
IRON LAW: One Variable at a Time

If you change the headline AND the image AND the CTA simultaneously,
you cannot know which change caused the result. Test ONE variable per
experiment. If you need to test multiple changes, use sequential tests
or multivariate testing (MVT) with sufficient traffic.

What to Test (by Impact)

测试内容(按影响程度排序)

ElementExpected LiftTraffic NeededPriority
Offer/Pricing10-50%MediumHighest
Headline/Subject line5-30%LowHigh
CTA (text, color, placement)5-20%LowHigh
Page layout5-15%MediumMedium
Image/Video3-15%MediumMedium
Form fields5-25% (reduction = higher CVR)LowMedium
Social proof placement3-10%MediumLower
元素预期提升幅度所需流量优先级
报价/定价10-50%中等最高
标题/邮件主题5-30%
CTA(文本、颜色、位置)5-20%
页面布局5-15%中等
图片/视频3-15%中等
表单字段5-25%(字段减少=转化率提升)
社交证明摆放位置3-10%中等较低

Test Design

测试设计

  1. Hypothesis: "Changing [variable] from [A] to [B] will increase [metric] by [X%] because [reasoning]"
  2. Primary metric: ONE metric that determines winner (conversion rate, revenue per visitor, signup rate)
  3. Guardrail metrics: Metrics that must NOT degrade (bounce rate, page load time, revenue per user)
  4. Traffic split: 50/50 between control and variant (standard)
  5. Sample size: Calculate before starting (see stat-ab-testing for formula)
  6. Duration: Minimum 1-2 full business weeks (capture day-of-week effects)
  1. 假设: "将[变量]从[A]改为[B]将使[指标]提升[X%],因为[推理依据]"
  2. 核心指标: 唯一决定测试胜负的指标(转化率、访客贡献收入、注册率)
  3. 约束指标: 不能出现下滑的指标(跳出率、页面加载时间、单用户收入)
  4. 流量分配: 对照组与变体组各占50%(标准配置)
  5. 样本量: 测试开始前计算(参考stat-ab-testing技能中的公式)
  6. 测试时长: 至少1-2个完整业务周(覆盖周内不同日期的行为差异)

Common Marketing Tests

常见营销测试场景

TestControl (A)Variant (B)Metric
Email subject"Your weekly update""3 trends you missed this week"Open rate
Landing page CTA"Sign Up""Start Free Trial"Click rate
Pricing pageShow 3 plansShow 2 plans + "most popular" badgeConversion rate
Ad creativeProduct photoLifestyle photo with productCTR → conversion
Form length8 fields4 fieldsForm completion rate
测试场景对照组(A)变体组(B)核心指标
邮件主题"您的每周更新""本周您错过的3大趋势"打开率
着陆页CTA"立即注册""开始免费试用"点击率
定价页展示3种套餐展示2种套餐 + "最受欢迎"标识转化率
广告创意产品实拍图包含产品的生活场景图点击率 → 转化率
表单长度8个字段4个字段表单完成率

Analysis & Decision

结果分析与决策

ResultDecisionAction
B wins, p < 0.05, meaningful liftShip BDeploy variant, start next test
B wins, p < 0.05, tiny lift (<1%)Don't shipLift not worth the change risk
No significant differenceKeep AA is the known quantity; test something else
B wins on primary but loses on guardrailInvestigateMay need to redesign variant
测试结果决策行动
B组获胜,p < 0.05,提升幅度显著上线B版本部署变体,启动下一项测试
B组获胜,p < 0.05,提升幅度极小(<1%)不上线B版本提升幅度不足以抵消变更风险
无显著差异保留A版本A版本为已知稳定方案,测试其他内容
B组核心指标获胜,但约束指标下滑深入调查可能需要重新设计变体

Output Format

输出格式

markdown
undefined
markdown
undefined

A/B Test Plan: {Test Name}

A/B Test Plan: {Test Name}

Hypothesis

Hypothesis

Changing {variable} from {A} to {B} will increase {metric} by {X%} because {reasoning}.
Changing {variable} from {A} to {B} will increase {metric} by {X%} because {reasoning}.

Design

Design

  • Primary metric: {metric}
  • Guardrail: {metric(s)}
  • Split: 50/50
  • Sample size: {N per variant}
  • Duration: {days/weeks}
  • Primary metric: {metric}
  • Guardrail: {metric(s)}
  • Split: 50/50
  • Sample size: {N per variant}
  • Duration: {days/weeks}

Results

Results

MetricControlVariantDiffCI (95%)Significant?
{primary}{value}{value}{±%}[{lower}, {upper}]Y/N
MetricControlVariantDiffCI (95%)Significant?
{primary}{value}{value}{±%}[{lower}, {upper}]Y/N

Decision

Decision

{Ship / Don't ship / Extend} — {rationale}
undefined
{Ship / Don't ship / Extend} — {rationale}
undefined

Gotchas

注意事项

  • Don't stop early because it "looks good": Peeking at results and stopping when you see significance inflates false positive rates to 30%+. Run to planned sample size.
  • Day-of-week effects: Monday visitors behave differently from Saturday visitors. Always run tests for at least 1-2 complete weeks.
  • Novelty effect: A new design may get a temporary lift from curiosity. Wait 2+ weeks to see if the effect sustains.
  • Winner's curse: The estimated lift from a test is often larger than the true lift due to statistical noise. Expect the actual impact after deployment to be smaller.
  • Don't test everything — test what matters: Running 20 small tests on button colors while ignoring the pricing page is misallocating effort. Test high-impact elements first.
  • 不要因"初步结果向好"提前终止测试: 中途查看结果并在发现显著性时停止测试,会将假阳性率提升至30%以上。需运行至计划样本量。
  • 周内日期影响: 周一访客行为与周六访客差异显著。测试时长至少覆盖1-2个完整业务周。
  • 新奇效应: 新设计可能因好奇心获得短期提升。需等待2周以上确认效果是否持续。
  • 胜者魔咒: 测试得出的预期提升幅度常因统计误差高于实际效果。部署后实际影响可能更小。
  • 不要盲目测试所有内容——聚焦高价值项: 花费精力测试20次按钮颜色,却忽略定价页优化,属于资源错配。优先测试高影响元素。

Scripts

脚本工具

ScriptDescriptionUsage
scripts/ab_test.py
Two-proportion z-test with effect size and sample-size planning
python scripts/ab_test.py --help
Run
python scripts/ab_test.py --verify
to execute built-in sanity tests.
脚本描述使用方式
scripts/ab_test.py
包含效应量计算与样本量规划的双比例Z检验工具
python scripts/ab_test.py --help
运行
python scripts/ab_test.py --verify
执行内置的 sanity 测试。

References

参考资料

  • For statistical methodology (sample size, p-values), see the stat-ab-testing skill
  • For multivariate testing design, see
    references/mvt-design.md
  • 统计方法(样本量、p值)参考 stat-ab-testing 技能
  • 多变量测试设计参考
    references/mvt-design.md