Understanding Benford's Law
Most people assume digits 1 through 9 occur with equal likelihood as leading digits—roughly 11% each. Reality differs sharply. In authentic datasets spanning tax filings, election results, atomic weights, or river lengths, smaller digits dominate. This happens because numbers grow logarithmically: the interval from 1 to 2 on a log scale is much wider than the interval from 9 to 10.
Benford observed this pattern across diverse domains in the 1930s, testing it on:
- Physical constants and mathematical tables
- US population figures and street addresses
- Molecular weights and death rates
- Surface areas of geographic features
The law also applies to mathematical sequences like Fibonacci numbers. Datasets that violate Benford's distribution often signal data entry errors, measurement bias, or deliberate manipulation—though deviation alone does not prove fraud.
The Benford's Law Formula
The theoretical probability that digit d (where d ranges from 1 to 9) appears as a leading digit follows this logarithmic relationship:
P(d) = log₁₀(d + 1) − log₁₀(d)
or equivalently:
P(d) = log₁₀(1 + 1/d)
P(d)— Probability of digit d appearing as the leading digitd— The leading digit (integer from 1 to 9)log₁₀— Base-10 logarithm
Applying Benford's Law in Practice
Testing whether your data follows Benford's law involves three main steps:
- Count occurrences: For each number in your dataset, identify its leading digit (the first non-zero digit) and tally how many times each digit 1–9 appears.
- Calculate frequencies: Divide the count for each digit by your total sample size to get observed relative frequencies.
- Compare and visualize: Plot your observed frequencies against Benford's theoretical distribution. Significant deviations suggest either non-compliance or potential irregularities.
This tool accepts either raw numbers (up to 50) or pre-counted digit frequencies. The calculator generates comparative visualizations so you can assess goodness-of-fit at a glance.
When Data Deviates from Benford's Law
Not all datasets should follow Benford's law. Several categories naturally produce different leading-digit distributions:
- Constrained ranges: Invoice amounts limited to £5,000–£9,999 will have fewer leading 1s
- Rounded numbers: Data rounded to nearest 10 or 100 loses logarithmic properties
- Small samples: Fewer than 100 observations show random noise rather than underlying patterns
- Assigned identifiers: Account numbers, ZIP codes, or sequential IDs do not follow natural distributions
- Manufactured data: Intentionally fabricated datasets often show too many mid-range digits (5, 6, 7) due to human bias toward uniform distribution
Forensic analysts use Benford's law as an initial screening tool, but deviations always warrant investigation rather than assumption of misconduct.
Key Considerations When Testing Benford's Law
Avoid common pitfalls when applying Benford's law to your data.
- Sufficient sample size matters — Datasets with fewer than 50–100 observations may show apparent non-compliance due to random fluctuation alone. Aim for at least 100–200 numbers to obtain stable frequency estimates. Statistical tests (chi-squared, Kolmogorov–Smirnov) become more reliable with larger samples.
- Pre-filter your data appropriately — Remove negative signs, currency symbols, and leading zeros before identifying the leading digit. Exclude any numbers assigned rather than measured—such as account IDs or license plates. Similarly, exclude data bounded by arbitrary thresholds, as these naturally suppress smaller leading digits.
- Use statistical tests for final decisions — Visual comparison alone is insufficient for high-stakes conclusions. Perform a chi-squared goodness-of-fit test or Kolmogorov–Smirnov test to determine whether observed frequencies differ significantly from Benford's predictions. Both tests have limitations; consult a statistician for borderline cases.
- Context beats rules — Benford's law is a heuristic, not a law of nature. Many legitimate datasets deviate—accounting records in narrow ranges, truncated measurements, or datasets from heavily regulated domains. Always investigate the source and nature of your data before concluding non-compliance indicates fraud.