Understanding Polynomial Regression
Polynomial regression is a form of statistical modelling that describes the relationship between a dependent variable and one or more independent variables using a polynomial function. Unlike simple linear regression, which assumes a straight-line relationship, polynomial regression accommodates curved, non-linear patterns in data.
The core idea rests on the assumption that your data follows a polynomial equation. For a single independent variable, this equation takes the form where each power of the variable contributes to the overall fit. This flexibility makes polynomial models invaluable across disciplines: engineers use them to model stress-strain curves in materials testing, economists apply them to track diminishing returns in production functions, and environmental scientists employ them to analyse pollutant concentration gradients.
The degree of your polynomial determines its complexity. A degree-1 polynomial is simply a straight line. Degree-2 produces a parabola. Degree-3 creates an S-shaped curve. Higher degrees permit increasingly complex oscillations, though with diminishing practical utility and increased risk of overfitting to noise rather than capturing true underlying patterns.
The Polynomial Regression Equation
A polynomial regression model of degree n is defined by the equation below, where y is the dependent variable, x is the independent variable, and a₀, a₁, ..., aₙ are the coefficients determined from your data:
y = a₀ + a₁x + a₂x² + a₃x³ + ... + aₙxⁿ
y— The dependent variable (predicted value)x— The independent variable (input value)a₀, a₁, ..., aₙ— Regression coefficients computed from your datasetn— The degree of the polynomial (1 for linear, 2 for quadratic, 3 for cubic, etc.)
The Least-Squares Method
Finding the best polynomial fit requires determining which coefficients minimise the overall prediction error. The least-squares method achieves this by finding coefficients that minimise the sum of squared residuals—the vertical distances between each observed data point and the polynomial curve.
Mathematically, for N data points, the method finds coefficients that minimise:
Σ(yᵢ − (a₀ + a₁xᵢ + a₂xᵢ² + ... + aₙxᵢⁿ))²
This leads to a system of n+1 linear equations (the normal equations) that can be solved simultaneously. The result is a unique set of coefficients that provides the optimal polynomial fit according to the least-squares criterion. Modern calculators solve these systems numerically, but the underlying principle remains: minimise the squared errors to obtain the best-fitting curve.
Linear vs. Polynomial Regression: Clarifying the Terminology
A common source of confusion: why is polynomial regression called "linear" regression when it clearly models curves?
The answer lies in mathematical terminology. Polynomial regression is linear in its coefficients—the equation is a linear combination of the unknown parameters a₀, a₁, ..., aₙ. However, because the equation contains powers of x, the relationship between the input variable and output is non-linear. You can fit parabolas, cubic functions, and complex curves, all while using the mathematical framework of linear regression.
This distinction matters because it allows statisticians to use powerful linear algebra techniques—matrix inversion, eigenvalue decomposition—to solve polynomial problems efficiently, despite the non-linear appearance of the final fitted curve.
Key Considerations for Successful Polynomial Fitting
Avoid these common pitfalls when applying polynomial regression to your data:
- Overfitting with high-degree polynomials — Using a polynomial of degree equal to or greater than your number of data points will produce a perfect fit that passes through every point—but will likely perform poorly on new data. A degree-4 polynomial fitted to exactly 5 points has no freedom to smooth noise or measurement error. Always validate your model on hold-out data.
- Insufficient data points — For a degree-<em>n</em> polynomial, you need at least <em>n+1</em> data points to solve the system of equations. With exactly <em>n+1</em> points, the fit is mathematically perfect but unvalidated. Aim for significantly more points—ideally 10–20 times the degree—to obtain a robust, generalisable model.
- Extrapolation beyond your data range — Polynomials can behave wildly outside the range of your input data, particularly high-degree ones. A cubic that fits temperatures across a calendar year will produce nonsensical predictions for years before or after your observation period. Restrict predictions to the domain of your original measurements.
- Ignoring residual patterns — After fitting, examine a plot of residuals (observed minus predicted values) against the independent variable. If residuals show a systematic pattern, your chosen polynomial degree may be inappropriate, or the relationship may be governed by omitted variables. A well-fitted model produces randomly scattered residuals.