ab-test-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseA/B Test Analysis
A/B测试分析
Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.
以严谨的统计方法评估A/B测试结果,并将分析结论转化为清晰的产品决策。
Context
背景信息
You are analyzing A/B test results for $ARGUMENTS.
If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.
你正在为**$ARGUMENTS**分析A/B测试结果。
如果用户提供数据文件(CSV、Excel或分析导出文件),直接读取并分析。必要时生成用于统计计算的Python脚本。
Instructions
操作步骤
-
Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
-
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
- Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
- Sample size: Is the sample large enough for the expected effect size?
-
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these. -
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
-
Interpret results:
Outcome Recommendation Significant positive lift, no guardrail issues Ship it — roll out to 100% Significant positive lift, guardrail concerns Investigate — understand trade-offs before shipping Not significant, positive trend Extend the test — need more data or larger effect Not significant, flat Stop the test — no meaningful difference detected Significant negative lift Don't ship — revert to control, analyze why -
Provide the analysis summary:
## A/B Test Results: [Test Name] **Hypothesis**: [What we expected] **Duration**: [X days] | **Sample**: [N control / M variant] | Metric | Control | Variant | Lift | p-value | Significant? | |---|---|---|---|---|---| | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No | | [Guardrail] | ... | ... | ... | ... | ... | **Recommendation**: [Ship / Extend / Stop / Investigate] **Reasoning**: [Why] **Next steps**: [What to do]
Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.
-
理解实验背景:
- 实验假设是什么?
- 变体版本做出了哪些改动?
- 核心指标是什么?有没有防护指标?
- 测试持续了多长时间?
- 流量分配比例是多少?
-
验证测试设置:
- 样本量: 样本量是否足够支撑预期的效果规模?
- 使用公式:n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- 如果测试功效不足(<80%),需要标记出来
- 持续时间: 测试是否至少运行了1-2个完整的业务周期?
- 随机性: 是否存在样本比例不匹配(SRM)的迹象?
- 新奇/首因效应: 是否有足够的时间消除初始行为变化的影响?
- 样本量: 样本量是否足够支撑预期的效果规模?
-
计算统计显著性:
- 对照组和变体组的转化率
- 相对提升率: (变体组 - 对照组) / 对照组 × 100
- p值: 使用双尾z-test或chi-squared test计算
- 置信区间: 差异的95%置信区间
- 统计显著性: p值是否 < 0.05?
- 实际显著性: 提升效果对业务是否有实际意义?
如果用户提供原始数据,生成并运行Python脚本进行上述计算。 -
检查防护指标:
- 有没有防护指标(收入、用户参与度、页面加载时间)出现下滑?
- 核心指标表现优异但防护指标下滑的情况,可能并非真正的成功
-
解读结果:
结果 建议 显著正向提升,无防护指标问题 发布 — 全量推出变体版本 显著正向提升,但存在防护指标隐患 深入调查 — 了解利弊后再决定是否发布 未达显著性,但有正向趋势 延长测试 — 需要更多数据或更大的效果规模 未达显著性,结果持平 停止测试 — 未检测到有意义的差异 显著负向提升 不发布 — 恢复为对照组,分析原因 -
提供分析总结:
## A/B测试结果: [测试名称] **假设**: [我们的预期] **持续时间**: [X天] | **样本量**: [对照组N / 变体组M] | 指标 | 对照组 | 变体组 | 提升率 | p值 | 是否显著? | |---|---|---|---|---|---| | [核心指标] | X% | Y% | +Z% | 0.0X | 是/否 | | [防护指标] | ... | ... | ... | ... | ... | **建议**: [发布 / 延长 / 停止 / 调查] **理由**: [原因] **下一步**: [行动]
请逐步思考。保存为Markdown格式。如果提供原始数据,生成用于计算的Python脚本。