Understanding Correlation Coefficients
A correlation coefficient is a numerical summary of the linear or ranked relationship between two variables. All coefficients range from −1 to +1, where values near zero indicate weak or no association, and values near ±1 indicate strong association.
- Positive correlation: As one variable increases, the other tends to increase.
- Negative correlation: As one variable increases, the other tends to decrease.
- Zero correlation: No systematic relationship exists between the variables.
The choice of coefficient depends on your data type. Pearson works best with continuous, normally distributed variables. Spearman and Kendall tau suit ranked or non-normal data. Matthews correlation is designed specifically for binary classification performance.
Pearson Correlation Formula
Pearson correlation measures the strength of linear association between two continuous variables. It is calculated as the covariance between the variables divided by the product of their standard deviations:
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² × Σ(yᵢ − ȳ)²]
r— Pearson correlation coefficient (ranges from −1 to +1)xᵢ, yᵢ— Individual data points for variables X and Yx̄, ȳ— Mean (average) values of X and Y respectively
Spearman and Kendall Rank Correlations
When data cannot be assumed linear or normally distributed, rank-based methods provide robust alternatives. Spearman's rho treats the Pearson formula applied to ranked data, while Kendall's tau counts concordant and discordant pairs.
Spearman ρ = Cov(rank(X), rank(Y)) / [SD(rank(X)) × SD(rank(Y))]
Kendall τ = (C − D) / [n(n−1)/2]
rank(X), rank(Y)— Ranks of observations from lowest to highestC, D— Number of concordant and discordant pairsn— Total number of observations
Matthews Correlation Coefficient
Matthews correlation (MCC) evaluates binary classifier performance using a 2×2 confusion matrix. It balances all four outcome categories and is particularly useful in machine learning when class imbalance exists.
MCC = (TP × TN − FP × FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
TP— True positives—correctly predicted positive casesTN— True negatives—correctly predicted negative casesFP— False positives—negative cases predicted as positiveFN— False negatives—positive cases predicted as negative
Common Pitfalls and Key Considerations
Correlation strength and validity depend critically on how you interpret and calculate it.
- Correlation ≠ Causation — A strong correlation between two variables does not imply that one causes the other. Confounding variables, reverse causality, or pure coincidence can produce high correlations. Always investigate underlying mechanisms before drawing causal conclusions.
- Sample Size Matters — Small samples produce unstable correlation estimates with wide confidence intervals. Results from 5–10 observations are unreliable; aim for at least 30 pairs for meaningful inference. Larger samples strengthen the precision of your correlation coefficient.
- Outliers Distort Pearson Correlation — Extreme values can dramatically shift Pearson's r. If your data contains outliers, consider Spearman or Kendall tau instead, which ignore magnitude and rely on rank order. Always visualize your data with a scatter plot first.
- Choose the Right Coefficient for Your Data Type — Pearson assumes continuous, roughly normal data. Ranked or ordinal data demands Spearman or Kendall. Binary classification results require Matthews correlation. Using the wrong method can yield misleading conclusions.