Understanding Confusion Matrix Components
A confusion matrix breaks down classification predictions into four categories based on whether predictions match reality. True positives (TP) are cases correctly predicted as positive. False positives (FP) are cases incorrectly predicted as positive when they were actually negative. False negatives (FN) are cases incorrectly predicted as negative when they were actually positive. True negatives (TN) are cases correctly predicted as negative.
Arranging these four values in a 2×2 matrix reveals patterns in model behaviour. A model may excel at identifying positives but struggle with negatives, or vice versa. The relative sizes of these counts directly influence all downstream metrics, making them the foundation of performance assessment.
Key Performance Metrics Explained
Accuracy measures the proportion of correct predictions overall: how often the model got it right across all cases. It works well when classes are balanced, but can be misleading with imbalanced data.
Precision answers: of all cases predicted positive, how many were actually positive? High precision means few false alarms—critical in applications like medical screening where false positives carry cost.
Recall (also called sensitivity) answers: of all truly positive cases, how many did we catch? High recall matters when missing positives is costly, such as fraud detection.
F1 score balances precision and recall into a single metric, useful when you want both qualities without favouring one over the other.
Specificity measures the true negative rate—how well the model identifies negative cases. False positive rate and false negative rate show the flip side: what fraction of negatives and positives does the model misclassify.
Confusion Matrix Formulas
The following equations derive performance metrics from the four confusion matrix counts:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = (2 × Precision × Recall) / (Precision + Recall)
True Positive Rate = TP / (TP + FN)
True Negative Rate = TN / (TN + FP)
False Positive Rate = FP / (FP + TN)
False Negative Rate = FN / (FN + TP)
Matthews Correlation Coefficient = (TP × TN − FP × FN) / √((TP + FP)(TP + FN)(TN + FP)(TN + FN))
TP— Count of cases correctly predicted as positiveTN— Count of cases correctly predicted as negativeFP— Count of cases incorrectly predicted as positiveFN— Count of cases incorrectly predicted as negative
Interpreting Results with a Practical Example
Suppose a medical diagnostic model tested 200 patients: 80 TP, 30 TN, 20 FP, and 70 FN. Accuracy is (80+30)/200 = 0.55 or 55%—moderate overall performance. Precision is 80/(80+20) = 0.80, meaning 80% of patients flagged as having the disease actually do. Recall is 80/(80+70) ≈ 0.53, revealing that the model only catches about half of truly sick patients. The low recall is alarming in medical contexts, suggesting 70 diseased individuals would be missed. F1 score of approximately 0.64 reflects this imbalance. This example illustrates why aggregate accuracy can mislead: despite moderate accuracy, high false negatives make this model unsuitable for clinical use.
Common Pitfalls When Using Confusion Matrices
Avoid these frequent mistakes when assessing classification model performance.
- Ignoring class imbalance — When one class dominates (e.g., 95% negative cases), accuracy alone can be deceptive. A naive model predicting everything as negative achieves 95% accuracy despite being useless. Always examine precision, recall, and F1 score when classes are imbalanced.
- Confusing precision and recall — Precision focuses on false alarms among positive predictions, while recall measures missed true positives. A spam filter needs high precision to avoid blocking legitimate mail; a cancer detector needs high recall to avoid missing patients. Know which metric matters for your use case.
- Forgetting the cost of errors — Not all mistakes carry equal weight. A false negative in fraud detection costs more than a false positive, while the reverse may be true for email classification. Adjust your threshold and metric priorities based on real business consequences.
- Assuming one test set is enough — A confusion matrix from a single hold-out test set can be misleading due to random variation. Cross-validation over multiple folds provides a more robust picture of model reliability and generalization.