
Classification Metrics
- Aryan

- Mar 2
- 11 min read
Imagine you have developed a logistic regression model designed for detecting spam emails. This model outputs a probability score between 0 and 1, indicating the likelihood that an email is spam. For example, a score of 0.50 suggests a 50% chance of being spam, while a score of 0.75 indicates a 75% likelihood.
Setting a Classification Threshold
To effectively filter spam, you must convert the model's probability output into clear categories: "spam" or "not spam". This is done by setting a classification threshold.
Emails with a predicted probability higher than the threshold are classified as spam (positive class).
Emails with a predicted probability lower than the threshold are classified as not spam (negative class).
For example, if the threshold is 0.5, any email with a score above 0.5 will be marked as spam. If you raise the threshold to 0.95, only emails with a probability of 95% or higher will be classified as spam.
Special Case: If a score exactly matches the threshold (like 0.5), how it's classified depends on the tool used. For instance, the Keras library defaults to predicting the negative class in such cases, but other tools might behave differently.
Choosing the Right Threshold
While setting the threshold at 0.5 might seem logical, it isn’t always ideal. If your dataset is imbalanced (e.g., only 0.01% of emails are spam) or if incorrectly classifying a legitimate email as spam has severe consequences, a higher threshold could help reduce errors.
CONFUSION MATRIX
A confusion matrix is a table that evaluates the performance of a classification model by comparing predictions with actual outcomes. It consists of four categories:
True Positive (TP): Model correctly predicts a positive outcome.
True Negative (TN): Model correctly predicts a negative outcome.
False Positive (FP): Model incorrectly predicts a positive outcome (Type I error).
False Negative (FN): Model incorrectly predicts a negative outcome (Type II error).
The confusion matrix helps identify model weaknesses and areas for improvement.

Why Do We Need a Confusion Matrix ?
The confusion matrix provides a detailed breakdown of correct and incorrect predictions. It helps calculate performance metrics like accuracy, precision, recall, and F1-score.
Type I and Type II Errors
Type I Error (False Positive): Incorrectly classifies a negative instance as positive.
Type II Error (False Negative): Incorrectly classifies a positive instance as negative.
Confusion Matrix for Binary Classification
A 2×2 confusion matrix example for Dog image recognition:
Example: Confusion Matrix for Dog Image Recognition
Confusion Matrix for Multi-Class Classification
For multi-class classification, the confusion matrix expands based on the number of classes.
Example with Numbers
Accuracy
Accuracy score is a metric used to evaluate the performance of a classification model. It measures the ratio of correctly predicted instances to the total number of instances in a dataset. The formula for accuracy is :

where :
TP (True Positive): Correctly predicted positive instances.
TN (True Negative): Correctly predicted negative instances.
FP (False Positive): Incorrectly predicted positive instances.
FN (False Negative): Incorrectly predicted negative instances.
Accuracy provides a straightforward measure of overall model performance but may not always be the best metric, especially for imbalanced datasets.
Accuracy in Binary Classification
Binary classification deals with two classes, such as spam vs. not spam in email classification.
Example: Spam Email Classification
Consider a dataset of 1000 emails where:
700 are not spam (negative class)
300 are spam (positive class)
A model makes the following predictions:
TP = 250 (Correctly classified spam emails)
TN = 600 (Correctly classified non-spam emails)
FP = 100 (Non-spam emails incorrectly classified as spam)
FN = 50 (Spam emails incorrectly classified as non-spam)

Accuracy is useful when both classes are balanced, but in highly imbalanced datasets, other metrics like precision, recall, or F1-score may be more informative.
Accuracy in Multi-Class Classification
Multi-class classification involves more than two classes, such as classifying handwritten digits (0-9) or categorizing news articles into multiple topics.
Example: Handwritten Digit Recognition (0-9)
A model is trained to classify images of handwritten digits (0-9). The dataset has 10,000 test images evenly distributed across the 10 classes. If the model correctly classifies 9,000 images, the accuracy is :

In multi-class classification, accuracy is calculated in the same way but may not reflect class-wise performance. For example, a model may perform well on some classes but poorly on others. In such cases, class-wise precision, recall, and the confusion matrix provide deeper insights.
Limitations of Accuracy in Multiclass Scenarios
When dealing with imbalanced datasets (where some classes have significantly more samples than others), accuracy can be misleading. A model might perform well on the majority class while performing poorly on the minority classes, yet still show a high overall accuracy.
Alternative Metrics for Multiclass Classification
To get a more comprehensive evaluation of your model's performance, consider these metrics:
Precision (per class): Measures the proportion of true positives among all predicted positives for a class.
Recall (per class): Measures the proportion of true positives detected out of all actual instances of that class.
F1-Score: Harmonic mean of precision and recall, offering a balance between the two.
Macro Average: Calculates metrics independently for each class and then takes the average, treating all classes equally.
Weighted Average: Similar to macro but accounts for class imbalance by weighting each class’s contribution by its support (number of true instances).
What Is a Good Accuracy Score?
The definition of a "good" accuracy score depends on the problem you are trying to solve and the real-world consequences of errors made by the model.
Context-Dependent Accuracy Requirements
High-Stakes Applications (Critical Systems)
In cases where even a small error can have severe consequences, such as self-driving cars, the model must have near-perfect accuracy (close to 100%).
For example, if a self-driving car has an accuracy of 97%, it might still misinterpret important situations, potentially leading to accidents. Hence, deploying such a model would be unsafe.
Low-Stakes Applications (Non-Critical Systems)
In scenarios where minor errors are acceptable and won’t cause significant harm, a lower accuracy score can be sufficient.
For instance, a food delivery app like Swiggy that predicts whether a customer will order food can be deployed even with an accuracy of 80%. A wrong prediction here won’t have serious consequences—it might only result in a missed sales opportunity.
Key Takeaway
The acceptable level of accuracy depends on:
The importance of the decision (life-critical vs. business optimization)
The cost of false positives and false negatives
The balance between accuracy and other metrics like precision, recall, or F1-score, especially for imbalanced datasets
In short:
Critical systems demand very high accuracy (near 100%)
Non-critical systems can perform well with moderate accuracy (70-90%), depending on business needs.
What Is the Problem with Accuracy?
While accuracy is a commonly used metric to evaluate machine learning models, it has significant limitations—especially when it doesn't reveal the type of mistakes a model is making.
Why Is This a Problem?
Accuracy simply measures the proportion of correct predictions out of the total predictions made. However, it doesn’t indicate what kind of errors the model is making—whether they are false positives or false negatives—and this distinction can be critical depending on the problem.
Example Scenarios
Self-Driving Car (Critical System)
A model with 97% accuracy might still misclassify stop signs as speed limit signs just 3% of the time, which could lead to severe accidents.
Here, false negatives (failing to detect a stop sign) are much more dangerous than false positives.
Swiggy Customer Prediction (Non-Critical System)
A model predicting whether a customer will order food might have 80% accuracy.
If most of the mistakes are false positives (predicting the customer will order when they won’t), the business might waste resources on unnecessary marketing campaigns.
If the mistakes are false negatives (predicting the customer won’t order when they would), the company might lose potential sales.
The Core Issue
Accuracy fails to answer:
What kind of errors is the model making?
Are these errors equally harmful, or does one type of mistake cost more than the other?
This is why, especially in imbalanced datasets or high-risk applications, relying solely on accuracy can be misleading. Metrics like precision, recall, and the confusion matrix provide deeper insights into the types of errors being made.
Precision
Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It answers the question:
"Of all the positive predictions made by the model, how many were actually correct?"
The formula for Precision is:
Precision = TP / (TP + FP)
Where:
True Positives (TP): Correctly predicted positive cases
False Positives (FP): Incorrectly predicted positive cases (actually negative)
A high precision indicates that the model makes fewer false positive errors.
Importance of Precision
Useful when False Positives are costly
Example: In spam detection, low precision means many non-spam emails are incorrectly classified as spam, which is undesirable.
Inverse relationship with Recall
Increasing precision often decreases recall and vice versa (Precision-Recall tradeoff).
Calculating Precision for Binary Classification
Consider a spam detection model with the following confusion matrix:
Step 1: Calculate Precision
Precision = TP / (TP + FP)
= 80 / (80 + 10)
= 80 / 90
= 0.89 (or 89%)
Interpretation: Out of all emails classified as spam, 89% were actually spam.
Calculating Precision for Multi-Class Classification
For multi-class classification, precision is calculated per class using a one-vs-all approach.
Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:
Step 1: Calculate Precision per Class
For Cat
Precision = TP / (TP + FP)
= 40 / (40 + 15)
= 40 / 55
= 0.727 (or 72.7%)
For Dog
Precision = TP / (TP + FP)
= 35 / (35 + 10)
= 35 / 45
= 0.778 (or 77.8%)
For Rabbit
Precision = TP / (TP + FP)
= 40 / (40 + 10)
= 40 / 50
= 0.80 (or 80%)
Step 2: Compute Macro and Weighted Precision
Macro Precision (Unweighted Average) = (Precision_Cat + Precision_Dog + Precision_Rabbit) / 3
= (0.727 + 0.778 + 0.80) / 3
= 0.768
Weighted Precision (Weighted by Class Size) = [(0.727 x 55) + (0.778 x 45) + (0.80 x 50)
= (39.985 + 35.01 + 40) / 150
= 114.995 / 150
= 0.767
Recall
Recall, also known as Sensitivity or True Positive Rate (TPR), measures the ability of a classification model to correctly identify all relevant instances of a particular class. It is defined as:

Where:
True Positives (TP): Correctly predicted positive cases
False Negatives (FN): Actual positive cases that were incorrectly classified as negative
A high recall means that very few positive instances were missed by the model.
Importance of Recall
Crucial in scenarios where missing a positive case is costly (e.g., medical diagnoses, fraud detection).
Inversely related to Precision; improving recall often decreases precision and vice versa.
Used with the Precision-Recall trade-off to find an optimal model performance.
Calculating Recall for Binary Classification
Consider a binary classification problem for spam detection where:
Positive Class (1) → Spam Email
Negative Class (0) → Not Spam Email
Using the formula :
Recall = TP/(TP+FN)
= 80/(80+20)
= 80/100
= 0.80 (or 80%)
Interpretation: The model correctly identifies 80% of spam emails but misses 20% of them.
Calculating Recall for Multi-Class Classification
For multi-class classification, recall is calculated per class using a one-vs-all approach.
Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits). The confusion matrix is:
Recall Calculation per Class
Recall for Cat
Recall = TP/(TP+FN)
= 40/(40+5+5)
= 40/50
= 0.80 (or 80%)
Recall for Dog
Recall = TP/(TP+FN)
= 35/(35+10+5)
= 35/50
= 0.70 (or 70%)
Recall for Rabbit
Recall = TP/(TP+FN)
= 40/(40+5+5)
= 40/50
= 0.80 (or 80%)
Macro and Weighted Recall
Macro Recall (Unweighted Average):
Macro Recall = (Recall_Cat + Recall_Dog + Recall_Rabbit) / 3
= (0.80 + 0.70 + 0.80) / 3
= 0.77
Weighted Recall (Weighted by class size):
weighted recall = [(0.80 x 50) + (0.70 x 50) + (0.80 x 50)] / 150
= 115 / 150
= 0.77
F1 Score
The F1 Score is the harmonic mean of Precision and Recall. It is a measure of a model’s accuracy that balances both Precision and Recall, especially when the dataset is imbalanced.
The formula for F1 Score is:
F1 Score = 2 x (Precision x Recall) / (Precision + Recall)
Where:
Precision = TP / (TP + FP) (How many predicted positives are actually positive?)
Recall = TP / (TP + FN) (How many actual positives were correctly predicted?)
The F1 Score ranges between 0 and 1, where:
1 means perfect precision and recall.
0 means the model failed completely.
Importance of F1 Score
Handles Class Imbalance: F1 Score is preferred over Accuracy when classes are imbalanced.
Balances Precision and Recall: Useful when one metric is more important than the other.
Prevents Misleading Interpretations: High accuracy in imbalanced datasets can be misleading; F1 Score gives a more balanced view.
Calculating F1 Score for Binary Classification
Consider a spam detection model with the following confusion matrix:
Step 1: Calculate Precision and Recall
Precision = TP / (TP + FP) = 80 / (80 + 10) = 80/90 = 0.89
Recall = TP / (TP + FN) = 80/ (80 + 20) = 80/100 = 0.80
Step 2: Calculate F1 Score
F1 Score = 2 x (Precision x Recall) / (Precision + Recall)
= 2 x (0.89 x 0.80) / (0.89 + 0.80)
= 2 x (0.712) / (1.69)
= 1.424 / 1.69
= 0.84 (or 84%)
Interpretation: The model achieves 84% balance between precision and recall.
Calculating F1 Score for Multi-Class Classification
For multi-class classification, the F1 Score is calculated per class using a one-vs-all approach and then averaged using Macro F1 or Weighted F1.
Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:
Step 1: Calculate Precision, Recall and F1 per Class
For Cat
Precision = TP / (TP + FP) = 40 / (40 + 15) = 40 / 55 = 0.727
Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80
F1 Score = 2 x (0.727 x 0.80) / (0.727 + 0.80) = 0.761
For Dog
Precision = TP / (TP + FP) = 35 / (35 + 10) = 35 / 45 = 0.778
Recall = TP / (TP + FN) = 35 / (35 + 10 + 5) = 35/50 = 0.70
F1 Score = 2 x (0.778 x 0.70) / (0.778 + 0.70) = 0.737
For Rabbit
Precision = TP / (TP + FP) = 40 / (40 + 10) = 40 / 50 = 0.80
Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80
F1 Score = 2 x (0.80 x 0.80) / (0.80 + 0.80) = 0.80
Step 2: Compute Macro and Weighted F1 Score
Macro F1 Score (Unweighted Average) = (F1_Cat + F1_Dog + F1_Rabbit) / 3
= (0.761 + 0.737 + 0.80) / 3
= 0.766
Weighted F1 Score (Weighted by Class Size) = [(0.761 x 50) + (0.737 x 50) + (0.80 x 50)] / 150
= (38.05 + 36.85 + 40) / 150
= 114.9 / 150
= 0.766

