What Is an A/B Test?

An A/B test, also called a split test or two-proportion Z-test, compares two independent groups on a single outcome. Did users click your redesigned button more often? Are patients who received the new drug more likely to recover? The test answers whether the gap between group rates is genuine or attributable to sampling variation.

In statistical terms, you begin with a null hypothesis: no true difference exists between the populations. The A/B test either provides evidence strong enough to reject that hypothesis or fails to do so. The strength of your evidence depends on three factors:

  • The magnitude of the observed difference
  • The size of each sample
  • The variability within each group

Larger samples and bigger differences both increase your confidence that a real effect exists rather than a fluke result.

Understanding Statistical Significance

Statistical significance answers a precise question: if no real difference existed, how likely is it we'd observe a gap this large purely by chance? When an outcome is statistically significant, the probability of seeing such an extreme result under the null hypothesis is very low—typically below 5% (at 95% confidence).

Consider a coin you suspect is biased. You flip it 100 times and get 60 heads instead of the expected 50. Was the coin unfair, or did you simply experience normal randomness? A significance test calculates the probability of 60+ heads with a fair coin. If that probability is less than 5%, you reject the fair-coin hypothesis and conclude the evidence points toward bias.

The same logic applies to A/B tests. A higher conversion rate in variant B might reflect a genuinely better design, or it might be luck. Significance testing quantifies your confidence in the former explanation.

A/B Test Formulas

The A/B test uses conversion rates from both groups to calculate a Z-score, which measures how many standard deviations the observed difference lies from zero. A higher absolute Z-score indicates stronger evidence against the null hypothesis.

p₁ = t₁ ÷ n₁

p₂ = t₂ ÷ n₂

p = (t₁ + t₂) ÷ (n₁ + n₂)

Z = (p₁ − p₂) ÷ √[p × (1 − p) × (1/n₁ + 1/n₂)]

  • p₁ — Conversion rate for group 1, calculated as positive outcomes divided by total sample size
  • p₂ — Conversion rate for group 2, calculated as positive outcomes divided by total sample size
  • p — Pooled conversion rate across both groups combined
  • Z — The test statistic indicating how many standard errors the difference spans; larger magnitude means stronger significance
  • t₁, t₂ — Number of positive results (conversions, successes) in each group
  • n₁, n₂ — Total sample size for each group

Interpreting Your Results

After entering your data and selecting a confidence level, the calculator returns a Z-score and tells you whether the difference is statistically significant. A confidence level of 95% is standard in business and social science experiments, meaning you're willing to accept a 5% chance of a false positive (rejecting the null hypothesis when it's actually true).

Common confidence levels and their corresponding critical Z-values:

  • 90% confidence: critical Z ≈ 1.645 (5% significance level)
  • 95% confidence: critical Z ≈ 1.96 (2.5% significance level on each tail)
  • 98% confidence: critical Z ≈ 2.326 (1% significance level)
  • 99% confidence: critical Z ≈ 2.576 (0.5% significance level)

If your calculated Z-score exceeds the critical value, the difference is statistically significant at that confidence level. Otherwise, you lack sufficient evidence to reject the null hypothesis.

Common Pitfalls and Practical Considerations

A/B tests are powerful but require careful setup to yield trustworthy results.

  1. Sample size matters enormously — The central limit theorem requires each group to contain at least 30 observations, though 50+ per group is more reliable. Tiny samples produce unreliable Z-scores and may not follow the normal distribution the test assumes. Underpowered studies frequently miss real effects.
  2. Unequal group sizes weaken conclusions — While the test can accommodate different sample sizes, roughly equal groups maximize your statistical power. If one variant reaches 1,000 users while the other has only 50, your ability to detect a genuine difference in the smaller group is severely compromised.
  3. Your data must be randomly selected — If you only test your new feature on Friday nights, or measure conversions only from returning customers, your sample won't represent the full population. Non-random sampling introduces bias and invalidates the significance calculation, no matter how high your Z-score appears.
  4. Statistical significance is not practical significance — A 0.5% conversion rate increase can be statistically significant with millions of users but have negligible business impact. Always pair statistical testing with practical judgment about whether the finding justifies the cost and effort of deployment.

Frequently Asked Questions

How many samples do I need for a reliable A/B test?

Statistical tests require sufficient sample size to detect real effects without false alarms. For A/B tests, a minimum of 30 observations per group is necessary for the central limit theorem to apply, but 50–100 per group is more practical. The exact requirement depends on the effect size you wish to detect: smaller differences require larger samples. Many practitioners use power analysis beforehand to determine how many observations they'll need, specifying a desired effect size and acceptable error rates.

What's the difference between statistical and practical significance?

An outcome can be statistically significant—unlikely under random chance—yet have negligible real-world value. With a massive sample, even a 0.1% improvement can yield a significant Z-score. Conversely, practical significance asks whether the difference matters to your business or field. A 15% uplift in conversion rate is both statistically and practically significant; a 0.01% improvement might be statistically notable but trivial in decision-making terms.

Can I peek at results before my sample size is complete?

Peeking at interim results inflates the false positive rate because you're effectively running multiple tests instead of one. Each time you check, you introduce another chance to observe an extreme value by luck alone. If you plan sequential testing, use sequential analysis methods that adjust critical values, rather than stopping whenever you see significance. Proper practice dictates deciding sample size upfront and completing it before analyzing results.

What confidence level should I choose?

The 95% confidence level (5% alpha) is the scientific and business standard, balancing stringency with practicality. Use 90% only when you're exploratory and a higher false positive rate is tolerable. Choose 99% when errors are costly—medical trials, for example. Higher confidence demands larger sample sizes to achieve significance. Your choice reflects how much wrong-ness you can afford; more critical decisions warrant stricter standards.

Why does sample size difference matter if the calculator accepts both?

While the formula accommodates unequal group sizes, a 100-vs-1000 split wastes statistical power. Unequal samples reduce your ability to detect a genuine effect in the smaller group and can skew the overall test sensitivity. The Z-test assumes both groups contribute meaningful information, which happens best when sizes are comparable. Always aim for balanced group sizes when possible, or account for unequal sizes when planning sample collection.

What does a negative Z-score mean?

The Z-score's sign indicates which group has the higher rate. A negative Z-score means group 2 has a higher conversion rate than group 1; positive means group 1 leads. For significance testing, only the absolute value matters—a Z of −2.0 is equally significant as +2.0. Both indicate the difference is about two standard errors away from zero, suggesting strong evidence against the null hypothesis of no difference.

More statistics calculators (see all)