What Is an A/B Test?
An A/B test, also called a split test or two-proportion Z-test, compares two independent groups on a single outcome. Did users click your redesigned button more often? Are patients who received the new drug more likely to recover? The test answers whether the gap between group rates is genuine or attributable to sampling variation.
In statistical terms, you begin with a null hypothesis: no true difference exists between the populations. The A/B test either provides evidence strong enough to reject that hypothesis or fails to do so. The strength of your evidence depends on three factors:
- The magnitude of the observed difference
- The size of each sample
- The variability within each group
Larger samples and bigger differences both increase your confidence that a real effect exists rather than a fluke result.
Understanding Statistical Significance
Statistical significance answers a precise question: if no real difference existed, how likely is it we'd observe a gap this large purely by chance? When an outcome is statistically significant, the probability of seeing such an extreme result under the null hypothesis is very low—typically below 5% (at 95% confidence).
Consider a coin you suspect is biased. You flip it 100 times and get 60 heads instead of the expected 50. Was the coin unfair, or did you simply experience normal randomness? A significance test calculates the probability of 60+ heads with a fair coin. If that probability is less than 5%, you reject the fair-coin hypothesis and conclude the evidence points toward bias.
The same logic applies to A/B tests. A higher conversion rate in variant B might reflect a genuinely better design, or it might be luck. Significance testing quantifies your confidence in the former explanation.
A/B Test Formulas
The A/B test uses conversion rates from both groups to calculate a Z-score, which measures how many standard deviations the observed difference lies from zero. A higher absolute Z-score indicates stronger evidence against the null hypothesis.
p₁ = t₁ ÷ n₁
p₂ = t₂ ÷ n₂
p = (t₁ + t₂) ÷ (n₁ + n₂)
Z = (p₁ − p₂) ÷ √[p × (1 − p) × (1/n₁ + 1/n₂)]
p₁— Conversion rate for group 1, calculated as positive outcomes divided by total sample sizep₂— Conversion rate for group 2, calculated as positive outcomes divided by total sample sizep— Pooled conversion rate across both groups combinedZ— The test statistic indicating how many standard errors the difference spans; larger magnitude means stronger significancet₁, t₂— Number of positive results (conversions, successes) in each groupn₁, n₂— Total sample size for each group
Interpreting Your Results
After entering your data and selecting a confidence level, the calculator returns a Z-score and tells you whether the difference is statistically significant. A confidence level of 95% is standard in business and social science experiments, meaning you're willing to accept a 5% chance of a false positive (rejecting the null hypothesis when it's actually true).
Common confidence levels and their corresponding critical Z-values:
- 90% confidence: critical Z ≈ 1.645 (5% significance level)
- 95% confidence: critical Z ≈ 1.96 (2.5% significance level on each tail)
- 98% confidence: critical Z ≈ 2.326 (1% significance level)
- 99% confidence: critical Z ≈ 2.576 (0.5% significance level)
If your calculated Z-score exceeds the critical value, the difference is statistically significant at that confidence level. Otherwise, you lack sufficient evidence to reject the null hypothesis.
Common Pitfalls and Practical Considerations
A/B tests are powerful but require careful setup to yield trustworthy results.
- Sample size matters enormously — The central limit theorem requires each group to contain at least 30 observations, though 50+ per group is more reliable. Tiny samples produce unreliable Z-scores and may not follow the normal distribution the test assumes. Underpowered studies frequently miss real effects.
- Unequal group sizes weaken conclusions — While the test can accommodate different sample sizes, roughly equal groups maximize your statistical power. If one variant reaches 1,000 users while the other has only 50, your ability to detect a genuine difference in the smaller group is severely compromised.
- Your data must be randomly selected — If you only test your new feature on Friday nights, or measure conversions only from returning customers, your sample won't represent the full population. Non-random sampling introduces bias and invalidates the significance calculation, no matter how high your Z-score appears.
- Statistical significance is not practical significance — A 0.5% conversion rate increase can be statistically significant with millions of users but have negligible business impact. Always pair statistical testing with practical judgment about whether the finding justifies the cost and effort of deployment.