Understanding P-Values
A p-value is fundamentally a conditional probability: given that the null hypothesis holds, what is the chance of observing a test statistic as extreme as (or more extreme than) the one you calculated from your sample?
The p-value does not tell you the probability that the null hypothesis is true. Instead, it measures compatibility between your data and the null hypothesis. Under repeated sampling from the same population, a smaller p-value suggests your observed result would be rarer under the null hypothesis.
The interpretation hinge on the significance level you choose (commonly α = 0.05):
- p-value < α: Reject the null hypothesis. The data provide evidence against it.
- p-value ≥ α: Fail to reject the null hypothesis. The data are consistent with it.
This framework applies uniformly across all test distributions, though the calculation method differs.
P-Value Calculation Formulas
The p-value depends on both your test statistic and the type of hypothesis test. Let cdf denote the cumulative distribution function of your chosen distribution.
Left-tailed test: p-value = cdf(x)
Right-tailed test: p-value = 1 − cdf(x)
Two-tailed test: p-value = 2 × min{cdf(x), 1 − cdf(x)}
x— Your test statistic (Z-score, t-score, χ², or F-value)cdf(x)— Cumulative distribution function evaluated at x, specific to your distribution
Selecting the Right Distribution
Choose your distribution based on what you know about your data and test:
- Z-test (Normal Distribution): Use when testing a population mean with known population standard deviation, or for large samples (n > 30).
- t-test (t-Student Distribution): Use for small samples or when population standard deviation is unknown. Specify degrees of freedom (typically n − 1 for one-sample tests).
- Chi-squared Test: Use when testing proportions or independence in categorical data, or goodness-of-fit tests. Requires degrees of freedom equal to the number of categories minus constraints.
- F-test (Fisher–Snedecor Distribution): Use when comparing variances across groups or in regression analysis. Requires two degrees-of-freedom parameters: numerator and denominator.
Common Pitfalls When Interpreting P-Values
Misunderstanding p-values is endemic in statistical practice. Avoid these frequent mistakes.
- P-value ≠ Probability of Null Hypothesis — A p-value is not the probability your null hypothesis is true. It's the probability of seeing your data (or more extreme) if the null were true. A small p-value is evidence against the null, not proof that an alternative is true.
- One-Tailed vs Two-Tailed Tests — Using the wrong tail direction inflates your false positive rate. A two-tailed test splits α equally between both extremes; one-tailed tests concentrate it in one direction. Choose your tail structure before analyzing, not after seeing results.
- Multiple Testing Compounds Error — Running many statistical tests without correction inflates the overall error rate. If you perform 20 independent tests at α = 0.05, you expect ~1 false positive by chance. Use corrections like Bonferroni when testing multiple hypotheses.
- P-Value < 0.05 Does Not Guarantee Replication — Statistical significance at p < 0.05 does not ensure your finding will replicate. With low statistical power or publication bias, significant results often fail to reproduce. Report effect sizes and confidence intervals alongside p-values.
Worked Example: Z-Test P-Value
Suppose a factory claims lightbulbs last 1,000 hours on average. You test 100 bulbs and find a mean lifetime of 980 hours with a known population standard deviation of 50 hours. Your null hypothesis is that μ = 1,000; your alternative is that μ ≠ 1,000 (two-tailed).
First, calculate the Z-score:
Z = (980 − 1000) ÷ (50 ÷ √100) = −20 ÷ 5 = −4
For a two-tailed test, the p-value is 2 × Φ(−4) ≈ 2 × 0.00003 ≈ 0.00006. This tiny p-value (well below 0.05) provides strong evidence to reject the factory's claim.