What is R-Squared?
The coefficient of determination, denoted R², measures the fraction of total variation in a dependent variable that a regression model successfully explains. In simple linear regression Y ~ aX + b, it answers the question: what percentage of Y's fluctuations are driven by X?
An R² of 0.75 means the model accounts for 75% of the variance; the remaining 25% stems from unmeasured factors, measurement error, or genuine randomness. Unlike correlation (which ranges from −1 to +1), R² is always non-negative and bounded at 1. It represents the square of the Pearson correlation coefficient in bivariate settings.
R² is fundamental to:
- Model selection—comparing competing regression specifications
- Predictive inference—determining whether forecasts are reliable
- Publishing—journals often require R² as evidence of model fit
- Quality assurance—identifying when a fitted line is misleading
The R-Squared Formula
R² is computed from three sums of squares. First, calculate the mean of your y-values (ȳ), then fit your regression line to obtain predicted values (ŷᵢ). The formula then becomes:
R² = 1 − (SSE / SST)
or equivalently
R² = SSR / SST
where:
SSE = Σ(yᵢ − ŷᵢ)²
SST = Σ(yᵢ − ȳ)²
SSR = Σ(ŷᵢ − ȳ)²
SSE— Sum of squared errors (residual sum of squares); measures divergence between observed and predicted y-valuesSST— Total sum of squares; measures total variance in the y-variable around its meanSSR— Sum of squares due to regression; captures variance explained by the fitted modelȳ— Mean of all observed y-valuesŷᵢ— Fitted (predicted) y-value for the i-th observation
Interpreting R² Values
R² = 1.0: Perfect fit. All observations lie exactly on the regression line; prediction is error-free.
R² = 0.9 to 0.99: Excellent fit. The model explains 90–99% of variation. Rare in real-world data; suggests strong causal or functional relationships.
R² = 0.7 to 0.9: Strong fit. Common in physics, engineering, and controlled experiments. The model is predictively useful, though unexplained variation remains.
R² = 0.5 to 0.7: Moderate fit. Typical in social sciences and observational studies. The model captures meaningful patterns but substantial noise persists.
R² = 0.3 to 0.5: Weak fit. The model has limited predictive power; consider alternative specifications or additional variables.
R² < 0.3: Poor fit. The independent variable(s) explain less than 30% of variation. Re-examine model assumptions and data quality.
Context matters: a low R² does not disqualify a model if its coefficients are statistically significant and theoretically sound.
Worked Example
Suppose you have three data points: (0, 1), (2, 4), (4, 4).
Step 1: Calculate ȳ = (1 + 4 + 4) / 3 = 3
Step 2: Fit the line Y ~ 0.75X + 1.5 using least squares regression.
Step 3: Compute predicted values:
- ŷ₁ = 0.75(0) + 1.5 = 1.5
- ŷ₂ = 0.75(2) + 1.5 = 3.0
- ŷ₃ = 0.75(4) + 1.5 = 4.5
Step 4: Calculate SST = (1 − 3)² + (4 − 3)² + (4 − 3)² = 4 + 1 + 1 = 6
Step 5: Calculate SSE = (1 − 1.5)² + (4 − 3)² + (4 − 4.5)² = 0.25 + 1 + 0.25 = 1.5
Step 6: R² = 1 − (1.5 / 6) = 1 − 0.25 = 0.75
The model explains 75% of the variance in y.
Common Pitfalls and Caveats
R² is a powerful diagnostic, but several limitations deserve attention.
- R² always rises with more variables — Adding predictors mechanically inflates R², even if they have no real relationship to the outcome. Use adjusted R² or information criteria (AIC, BIC) when comparing models with different numbers of variables. Adjusted R² penalises complexity and provides a fairer comparison.
- High R² does not imply causation — A strong fit indicates predictive association, not causality. Two variables may both be driven by a hidden confounder. Always ground your interpretation in theory and experimental design, not statistics alone.
- R² is scale-sensitive in weighted regression — When observations have different reliabilities or sample sizes, unweighted R² can mislead. Use weighted least squares and report both weighted and unweighted R² if heteroscedasticity is suspected.
- Outliers and leverage can distort R² — A single extreme point can dramatically shift the regression line and inflate or deflate R². Always inspect residual plots and identify influential observations using leverage and Cook's distance diagnostics.