algo-risk-benford

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benford's Law Analysis

本福特定律分析

Overview

概述

Benford's Law predicts that in naturally occurring datasets, the leading digit d appears with probability P(d) = log₁₀(1 + 1/d). Digit 1 appears ~30.1% of the time, digit 9 only ~4.6%. Deviations from this distribution may indicate data fabrication or manipulation. Analysis runs in O(n).
本福特定律预测,在自然生成的数据集中,首数字d出现的概率为P(d) = log₁₀(1 + 1/d)。数字1出现的概率约为30.1%,数字9仅约4.6%。与该分布的偏差可能表明数据被伪造或篡改。分析的时间复杂度为O(n)。

When to Use

适用场景

Trigger conditions:
  • Auditing financial data (expenses, invoices, tax returns) for manipulation
  • Screening large datasets for data integrity issues
  • Detecting fabricated or artificially rounded numbers
When NOT to use:
  • For assigned/sequential numbers (zip codes, phone numbers, IDs)
  • For datasets with constrained ranges (e.g., human ages, percentages)
  • For small datasets (< 500 records — insufficient statistical power)
触发场景:
  • 审计财务数据(费用、发票、纳税申报表)以检测操纵行为
  • 筛查大型数据集以发现数据完整性问题
  • 检测伪造或人工凑整的数值
不适用场景:
  • 分配/连续编号(邮政编码、电话号码、ID)
  • 范围受限的数据集(如人类年龄、百分比)
  • 小型数据集(<500条记录——统计效力不足)

Algorithm

算法

IRON LAW: Benford's Law Applies to NATURALLY OCCURRING Data Spanning Orders of Magnitude
Data that doesn't span multiple orders of magnitude (e.g., temperatures
in Celsius, human heights) will NOT follow Benford's Law. Deviation from
Benford's in such data is EXPECTED, not suspicious. Always verify the
data type is appropriate before concluding fraud.
IRON LAW: Benford's Law Applies to NATURALLY OCCURRING Data Spanning Orders of Magnitude
Data that doesn't span multiple orders of magnitude (e.g., temperatures
in Celsius, human heights) will NOT follow Benford's Law. Deviation from
Benford's in such data is EXPECTED, not suspicious. Always verify the
data type is appropriate before concluding fraud.

Phase 1: Input Validation

阶段1:输入验证

Extract leading digits from dataset. Filter: remove zeros, negatives (take absolute value), values < 10. Verify dataset spans multiple orders of magnitude. Gate: 500+ records, data spans at least 2 orders of magnitude.
从数据集中提取首数字。过滤规则:移除零值、负值(取绝对值)、小于10的值。验证数据集是否跨越多个数量级。 准入条件: 500条以上记录,数据至少跨越2个数量级。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Extract first digit of each number
  2. Count frequency of each digit (1-9)
  3. Compare observed frequencies against Benford's expected: P(d) = log₁₀(1 + 1/d)
  4. Statistical tests: chi-squared test, MAD (Mean Absolute Deviation), KS test
  1. 提取每个数字的首数字
  2. 统计每个数字(1-9)的出现频率
  3. 将观测频率与本福特预期频率对比:P(d) = log₁₀(1 + 1/d)
  4. 统计检验:卡方检验、MAD(平均绝对偏差)、KS检验

Phase 3: Verification

阶段3:验证

MAD thresholds: < 0.006 (close conformity), 0.006-0.012 (acceptable), 0.012-0.015 (marginal), > 0.015 (non-conforming). Flag specific digits with large deviations. Gate: MAD computed, non-conforming digits identified.
MAD阈值:<0.006(高度符合)、0.006-0.012(可接受)、0.012-0.015(边缘符合)、>0.015(不符合)。标记偏差较大的特定数字。 准入条件: 已计算MAD,识别出不符合的数字。

Phase 4: Output

阶段4:输出

Return conformity assessment with digit-level analysis.
返回包含数字级分析的符合性评估结果。

Output Format

输出格式

json
{
  "conformity": "marginal",
  "mad": 0.013,
  "chi_squared": {"statistic": 18.5, "p_value": 0.018, "df": 8},
  "digit_analysis": [{"digit": 1, "observed_pct": 25.1, "expected_pct": 30.1, "deviation": -5.0}],
  "metadata": {"records": 5000, "dataset": "Q4 expense reports"}
}
json
{
  "conformity": "marginal",
  "mad": 0.013,
  "chi_squared": {"statistic": 18.5, "p_value": 0.018, "df": 8},
  "digit_analysis": [{"digit": 1, "observed_pct": 25.1, "expected_pct": 30.1, "deviation": -5.0}],
  "metadata": {"records": 5000, "dataset": "Q4 expense reports"}
}

Examples

示例

Sample I/O

样本输入输出

Input: 1000 invoice amounts from a company's AP ledger Expected: First digits should approximate 30.1%, 17.6%, 12.5%, 9.7%, 7.9%, 6.7%, 5.8%, 5.1%, 4.6%. MAD < 0.012 for legitimate data.
输入: 某公司应付账款分类账中的1000笔发票金额 预期结果: 首数字的出现概率应接近30.1%、17.6%、12.5%、9.7%、7.9%、6.7%、5.8%、5.1%、4.6%。合法数据的MAD应<0.012。

Edge Cases

边缘案例

InputExpectedWhy
All amounts $90-$99Digit 9 dominatesConstrained range — Benford's doesn't apply
Round number spike (digit 1, 5)Flag for reviewMay indicate round-number estimation or threshold manipulation
Government budget dataTypically conforms wellLarge naturally-occurring financial datasets fit Benford's
输入预期结果原因
所有金额在$90-$99之间数字9占主导范围受限——本福特定律不适用
整数值激增(数字1、5)标记为需复核可能表明采用整数值估算或阈值操纵
政府预算数据通常高度符合大型自然生成的财务数据集符合本福特定律

Gotchas

注意事项

  • Not proof of fraud: Non-conformity is a RED FLAG, not evidence. Many legitimate processes produce non-Benford distributions. Always investigate further.
  • Second-digit test: First digit test catches gross fabrication. Second-digit analysis catches more subtle manipulation (e.g., rounding to approval thresholds).
  • Combining datasets: Mixing datasets from different processes may artificially create or destroy Benford conformity. Analyze homogeneous datasets.
  • Approval thresholds: If expenses over $5,000 require VP approval, expect a spike of amounts just below $5,000 (digit 4 in the $4,9xx range). This is a behavioral pattern, flagged by second-digit analysis.
  • Sample size matters: Chi-squared test is sensitive to sample size. With 100K+ records, even trivial deviations become statistically significant. Use MAD as primary metric.
  • 并非欺诈证据:不符合本福特分布只是一个预警信号,而非证据。许多合法流程也会生成不符合本福特分布的数据。务必进一步调查。
  • 第二位数字测试:首数字测试可检测明显的伪造行为。第二位数字分析可检测更隐蔽的操纵(如凑整至审批阈值)。
  • 数据集合并:合并不同流程的数据集可能会人为造成或消除本福特符合性。应分析同质数据集。
  • 审批阈值:若超过$5000的费用需要副总裁审批,预计会出现大量略低于$5000的金额($49xx范围的首数字为4)。这是一种行为模式,可通过第二位数字分析检测到。
  • 样本量至关重要:卡方检验对样本量敏感。当记录数达到10万+时,即使微小偏差也会具有统计显著性。应将MAD作为主要指标。

References

参考资料

  • For second and third digit extensions, see
    references/higher-digit-tests.md
  • For case studies in fraud detection, see
    references/fraud-case-studies.md
  • 有关第二位和第三位数字的扩展测试,请参阅
    references/higher-digit-tests.md
  • 有关欺诈检测的案例研究,请参阅
    references/fraud-case-studies.md