What is Linear Regression?
Linear regression is a foundational statistical method for modelling the relationship between an independent variable (predictor) and a dependent variable (outcome). Rather than listing individual observations, it produces a single equation—a straight line—that summarizes how one quantity changes with another.
Consider a practical scenario: a biologist measures plant heights at different nitrogen fertilizer levels. Linear regression distils 50 observations into one simple formula, revealing how height responds to nutrient addition. The result is interpretable, testable, and useful for prediction.
The method assumes a linear relationship exists. When that assumption holds, it offers clarity and computational speed compared to more complex models. It also serves as a building block for advanced techniques like multiple regression and logistic regression.
The Linear Regression Equation
The fitted regression line takes the form:
y = a × x + b
where:
- y is the dependent variable (the value you want to predict)
- x is the independent variable (the predictor or explanatory variable)
- a is the slope—the steepness and direction of the line
- b is the y-intercept—the value of y when x is zero
Once you provide data points, the calculator solves for a and b using the least-squares method, which minimizes the vertical distance between each observation and the fitted line.
How to Calculate Linear Regression Parameters
The least-squares estimators for slope and intercept are derived from projecting the data onto the regression line. Using summation notation, the formulas are:
a = (n × Σ(xy) − Σ(x) × Σ(y)) ÷ (n × Σ(x²) − (Σ(x))²)
b = (Σ(y) − a × Σ(x)) ÷ n
n— Total number of data pointsΣ(x)— Sum of all x valuesΣ(y)— Sum of all y valuesΣ(xy)— Sum of each x multiplied by its corresponding yΣ(x²)— Sum of each x value squared
Interpreting the Slope and Goodness of Fit
The slope a tells you the magnitude and direction of the relationship:
- If a > 0, y increases as x increases (positive correlation).
- If a < 0, y decreases as x increases (negative correlation).
- If a ≈ 0, there is little or no linear relationship.
The slope's unit is important: if x is in metres and y is in kilograms, a slope of 2.5 means "per additional metre, y increases by 2.5 kilograms on average."
The R² value (coefficient of determination) ranges from 0 to 1. It represents the fraction of variance in y explained by x. An R² of 0.92 means the model accounts for 92% of the variation; an R² of 0.30 suggests the linear model captures little of the pattern, and other factors or a non-linear relationship may be at play.
Common Pitfalls and Best Practices
Avoid these frequent mistakes when applying linear regression to your data.
- Don't assume linearity without inspection — Always visualize your data first. If the scatter plot shows a curved, clustered, or scattered pattern, a linear model may mislead you. Non-linear transformations or polynomial regression might be more appropriate.
- Beware of outliers — A single extreme point can distort both the slope and intercept significantly. Use robust methods or investigate outliers before fitting. Sometimes they represent measurement errors; other times they reveal genuine extreme events worth studying separately.
- Remember that correlation is not causation — A strong linear fit between two variables does not imply that one causes the other. Confounding variables or mere coincidence can produce high R² values. Domain knowledge and experimental design matter more than statistics alone.
- Ensure sufficient data span — Regression works best when x values are spread across a wide range. Clustering all observations in a narrow band reduces precision and makes extrapolation risky. Ideally, collect data across the full range of practical interest.