Understanding Linear Regression and Residuals
Linear regression models the relationship between two variables as a straight line: y = a × x + b. The slope a quantifies how much the dependent variable y changes for each unit increase in the independent variable x, while the intercept b is the value of y when x equals zero.
A residual represents the vertical distance between a data point's actual value and the value the regression line predicts. Formally, a residual is positive when the observed value exceeds the prediction, and negative when the prediction overshoots reality. Understanding these deviations is essential because they reveal whether your model captures the true relationship or overlooks important patterns.
For example, if your fitted model predicts ŷ = 6 but you observe y = 7, the residual is +1. This single discrepancy seems small, but when summed across all observations, residual patterns expose model weaknesses—such as non-linearity, outliers, or heteroscedasticity.
The Residual Formula
Each residual is calculated as the observed value minus the model's prediction:
e = y − ŷ
e— Residual for a single observationy— Observed or actual value of the dependent variableŷ— Predicted value from the fitted regression line
Sum of Squares Residuals and Model Evaluation
To assess overall model fit, residuals are squared and summed:
SSR = Σ(e²) = Σ(y − ŷ)²
Squaring serves two purposes: it eliminates the cancellation effect where positive and negative residuals mask each other, and it penalizes large errors more heavily than small ones. A lower sum of squares residuals (SSR) indicates a tighter fit, whereas a higher SSR signals poor predictive performance.
This metric is fundamental in regression diagnostics and model comparison. When choosing between candidate models or assessing whether linear regression is appropriate for your data, SSR often guides the decision. Combined with the number of observations and parameters, it enables calculation of the coefficient of determination (R²) and other goodness-of-fit measures.
Interpreting Residual Plots
A residual plot displays residuals on the vertical axis against predicted values on the horizontal axis. This visual tool immediately reveals whether your linear model is well-suited to the data.
In a well-fitted model, residuals scatter randomly around zero with no pattern—they should look like noise. If residuals form a curve, cone, or cluster, the model violates key assumptions. For instance, a curved pattern suggests non-linearity; a widening spread indicates heteroscedasticity (unequal variance). Isolated points far from the axis may be outliers deserving investigation.
Residual plots are more informative than a single SSR value because they expose where and why predictions fail. A model with moderate SSR but a clear systematic pattern in its residuals is often inferior to one with slightly higher SSR but randomly scattered deviations.
Practical Residual Analysis Tips
Avoid common pitfalls when evaluating residuals and regression models.
- Don't confuse SSR with model utility — A low sum of squares residuals is necessary but not sufficient for a good model. Always inspect the residual plot visually and check whether residuals are truly random. High SSR combined with a random scatter is worse than lower SSR with systematic bias.
- Watch for non-linear relationships — Linear regression forces a straight-line fit regardless of the underlying relationship. If your residual plot curves upward or downward, your data likely contains a non-linear pattern. Consider transforming variables or switching to polynomial or non-parametric models.
- Outliers distort residual metrics — A single extreme observation inflates SSR disproportionately and may mislead your model evaluation. Always identify and investigate outliers. Sometimes they are data errors; other times they reveal genuine but rare phenomena that warrant separate analysis.
- Ensure you have adequate data — Residual analysis is most reliable with at least 20–30 observations. With fewer points, random noise can mimic patterns, and a single outlier has outsized influence. Collect more data when possible before drawing conclusions about model adequacy.