Classification Metrics

Aryan
Mar 2
11 min read

Imagine you have developed a logistic regression model designed for detecting spam emails. This model outputs a probability score between 0 and 1, indicating the likelihood that an email is spam. For example, a score of 0.50 suggests a 50% chance of being spam, while a score of 0.75 indicates a 75% likelihood.

Setting a Classification Threshold

To effectively filter spam, you must convert the model's probability output into clear categories: "spam" or "not spam". This is done by setting a classification threshold.

Emails with a predicted probability higher than the threshold are classified as spam (positive class).
Emails with a predicted probability lower than the threshold are classified as not spam (negative class).

For example, if the threshold is 0.5, any email with a score above 0.5 will be marked as spam. If you raise the threshold to 0.95, only emails with a probability of 95% or higher will be classified as spam.

Special Case: If a score exactly matches the threshold (like 0.5), how it's classified depends on the tool used. For instance, the Keras library defaults to predicting the negative class in such cases, but other tools might behave differently.

Choosing the Right Threshold

While setting the threshold at 0.5 might seem logical, it isn’t always ideal. If your dataset is imbalanced (e.g., only 0.01% of emails are spam) or if incorrectly classifying a legitimate email as spam has severe consequences, a higher threshold could help reduce errors.

CONFUSION MATRIX

A confusion matrix is a table that evaluates the performance of a classification model by comparing predictions with actual outcomes. It consists of four categories:

True Positive (TP): Model correctly predicts a positive outcome.
True Negative (TN): Model correctly predicts a negative outcome.
False Positive (FP): Model incorrectly predicts a positive outcome (Type I error).
False Negative (FN): Model incorrectly predicts a negative outcome (Type II error).

The confusion matrix helps identify model weaknesses and areas for improvement.

Why Do We Need a Confusion Matrix ?

The confusion matrix provides a detailed breakdown of correct and incorrect predictions. It helps calculate performance metrics like accuracy, precision, recall, and F1-score.

Type I and Type II Errors

Type I Error (False Positive): Incorrectly classifies a negative instance as positive.
Type II Error (False Negative): Incorrectly classifies a positive instance as negative.

Confusion Matrix for Binary Classification

A 2×2 confusion matrix example for Dog image recognition:

	Predicted Dog	Predicted Not Dog
Actual Dog	True Positive (TP)	False Negative (FN)
Actual Not Dog	False Positive (FP)	True Negative (TN)

Example: Confusion Matrix for Dog Image Recognition

Index	Actual	Predicted	Result
1	Dog	Dog	TP
2	Dog	Not Dog	FN
3	Dog	Dog	TP
4	Not Dog	Not Dog	TN
5	Dog	Dog	TP
6	Not Dog	Dog	FP
7	Dog	Dog	TP
8	Dog	Dog	TP
9	Not Dog	Not Dog	TN
10	Not Dog	Not Dog	TN

Confusion Matrix for Multi-Class Classification

For multi-class classification, the confusion matrix expands based on the number of classes.

	Predicted Cat	Predicted Dog	Predicted Horse
Actual Cat	TP	FN	FN
Actual Dog	FN	TP	FN
Actual Horse	FN	FN	TP

Example with Numbers

	Predicted Cat	Predicted Dog	Predicted Horse
Actual Cat	8	1	1
Actual Dog	2	10	0
Actual Horse	0	2	8

Accuracy

Accuracy score is a metric used to evaluate the performance of a classification model. It measures the ratio of correctly predicted instances to the total number of instances in a dataset. The formula for accuracy is :

where :

TP (True Positive): Correctly predicted positive instances.
TN (True Negative): Correctly predicted negative instances.
FP (False Positive): Incorrectly predicted positive instances.
FN (False Negative): Incorrectly predicted negative instances.

Accuracy provides a straightforward measure of overall model performance but may not always be the best metric, especially for imbalanced datasets.

Accuracy in Binary Classification

Binary classification deals with two classes, such as spam vs. not spam in email classification.

Example: Spam Email Classification

Consider a dataset of 1000 emails where:

700 are not spam (negative class)
300 are spam (positive class)

A model makes the following predictions:

TP = 250 (Correctly classified spam emails)
TN = 600 (Correctly classified non-spam emails)
FP = 100 (Non-spam emails incorrectly classified as spam)
FN = 50 (Spam emails incorrectly classified as non-spam)

Accuracy is useful when both classes are balanced, but in highly imbalanced datasets, other metrics like precision, recall, or F1-score may be more informative.

Accuracy in Multi-Class Classification

Multi-class classification involves more than two classes, such as classifying handwritten digits (0-9) or categorizing news articles into multiple topics.

Example: Handwritten Digit Recognition (0-9)

A model is trained to classify images of handwritten digits (0-9). The dataset has 10,000 test images evenly distributed across the 10 classes. If the model correctly classifies 9,000 images, the accuracy is :

In multi-class classification, accuracy is calculated in the same way but may not reflect class-wise performance. For example, a model may perform well on some classes but poorly on others. In such cases, class-wise precision, recall, and the confusion matrix provide deeper insights.

Limitations of Accuracy in Multiclass Scenarios

When dealing with imbalanced datasets (where some classes have significantly more samples than others), accuracy can be misleading. A model might perform well on the majority class while performing poorly on the minority classes, yet still show a high overall accuracy.

Alternative Metrics for Multiclass Classification

To get a more comprehensive evaluation of your model's performance, consider these metrics:

Precision (per class): Measures the proportion of true positives among all predicted positives for a class.
Recall (per class): Measures the proportion of true positives detected out of all actual instances of that class.
F1-Score: Harmonic mean of precision and recall, offering a balance between the two.
Macro Average: Calculates metrics independently for each class and then takes the average, treating all classes equally.
Weighted Average: Similar to macro but accounts for class imbalance by weighting each class’s contribution by its support (number of true instances).

What Is a Good Accuracy Score?

The definition of a "good" accuracy score depends on the problem you are trying to solve and the real-world consequences of errors made by the model.

Context-Dependent Accuracy Requirements

High-Stakes Applications (Critical Systems)
- In cases where even a small error can have severe consequences, such as self-driving cars, the model must have near-perfect accuracy (close to 100%).
- For example, if a self-driving car has an accuracy of 97%, it might still misinterpret important situations, potentially leading to accidents. Hence, deploying such a model would be unsafe.
Low-Stakes Applications (Non-Critical Systems)
- In scenarios where minor errors are acceptable and won’t cause significant harm, a lower accuracy score can be sufficient.
- For instance, a food delivery app like Swiggy that predicts whether a customer will order food can be deployed even with an accuracy of 80%. A wrong prediction here won’t have serious consequences—it might only result in a missed sales opportunity.

Key Takeaway

The acceptable level of accuracy depends on:

The importance of the decision (life-critical vs. business optimization)
The cost of false positives and false negatives
The balance between accuracy and other metrics like precision, recall, or F1-score, especially for imbalanced datasets

In short:

Critical systems demand very high accuracy (near 100%)
Non-critical systems can perform well with moderate accuracy (70-90%), depending on business needs.

What Is the Problem with Accuracy?

While accuracy is a commonly used metric to evaluate machine learning models, it has significant limitations—especially when it doesn't reveal the type of mistakes a model is making.

Why Is This a Problem?

Accuracy simply measures the proportion of correct predictions out of the total predictions made. However, it doesn’t indicate what kind of errors the model is making—whether they are false positives or false negatives—and this distinction can be critical depending on the problem.

Example Scenarios

Self-Driving Car (Critical System)
- A model with 97% accuracy might still misclassify stop signs as speed limit signs just 3% of the time, which could lead to severe accidents.
- Here, false negatives (failing to detect a stop sign) are much more dangerous than false positives.
Swiggy Customer Prediction (Non-Critical System)
- A model predicting whether a customer will order food might have 80% accuracy.
- If most of the mistakes are false positives (predicting the customer will order when they won’t), the business might waste resources on unnecessary marketing campaigns.
- If the mistakes are false negatives (predicting the customer won’t order when they would), the company might lose potential sales.

The Core Issue

Accuracy fails to answer:

What kind of errors is the model making?
Are these errors equally harmful, or does one type of mistake cost more than the other?

This is why, especially in imbalanced datasets or high-risk applications, relying solely on accuracy can be misleading. Metrics like precision, recall, and the confusion matrix provide deeper insights into the types of errors being made.

Precision

Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It answers the question:

"Of all the positive predictions made by the model, how many were actually correct?"

The formula for Precision is:

Precision = TP / (TP + FP)

Where:

True Positives (TP): Correctly predicted positive cases
False Positives (FP): Incorrectly predicted positive cases (actually negative)

A high precision indicates that the model makes fewer false positive errors.

Importance of Precision

Useful when False Positives are costly
- Example: In spam detection, low precision means many non-spam emails are incorrectly classified as spam, which is undesirable.
Inverse relationship with Recall
- Increasing precision often decreases recall and vice versa (Precision-Recall tradeoff).

Calculating Precision for Binary Classification

Consider a spam detection model with the following confusion matrix:

Actual / Predicted	Spam (1)	Not Spam (0)
Spam (1)	80 (TP)	20 (FN)
Not Spam	10 (FP)	90 (TN)

Step 1: Calculate Precision

Precision = TP / (TP + FP)

= 80 / (80 + 10)

= 80 / 90

= 0.89 (or 89%)

Interpretation: Out of all emails classified as spam, 89% were actually spam.

Calculating Precision for Multi-Class Classification

For multi-class classification, precision is calculated per class using a one-vs-all approach.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:

Actual / Predicted	Cat	Dog	Rabbit
Cat	40	5	5
Dog	10	35	5
Rabbit	5	5	40

Step 1: Calculate Precision per Class

For Cat

Precision = TP / (TP + FP)

= 40 / (40 + 15)

= 40 / 55

= 0.727 (or 72.7%)

For Dog

Precision = TP / (TP + FP)

= 35 / (35 + 10)

= 35 / 45

= 0.778 (or 77.8%)

For Rabbit

Precision = TP / (TP + FP)

= 40 / (40 + 10)

= 40 / 50

= 0.80 (or 80%)

Step 2: Compute Macro and Weighted Precision

Macro Precision (Unweighted Average) = (Precision_Cat + Precision_Dog + Precision_Rabbit) / 3

= (0.727 + 0.778 + 0.80) / 3

= 0.768

Weighted Precision (Weighted by Class Size) = [(0.727 x 55) + (0.778 x 45) + (0.80 x 50)

= (39.985 + 35.01 + 40) / 150

= 114.995 / 150

= 0.767

Recall

Recall, also known as Sensitivity or True Positive Rate (TPR), measures the ability of a classification model to correctly identify all relevant instances of a particular class. It is defined as:

Where:

True Positives (TP): Correctly predicted positive cases
False Negatives (FN): Actual positive cases that were incorrectly classified as negative

A high recall means that very few positive instances were missed by the model.

Importance of Recall

Crucial in scenarios where missing a positive case is costly (e.g., medical diagnoses, fraud detection).
Inversely related to Precision; improving recall often decreases precision and vice versa.
Used with the Precision-Recall trade-off to find an optimal model performance.

Calculating Recall for Binary Classification

Consider a binary classification problem for spam detection where:

Positive Class (1) → Spam Email
Negative Class (0) → Not Spam Email

Actual/Predicted	Spam (1)	Not Spam (0)
Spam (1)	80 (TP)	20 (FN)
Not Spam (0)	10 (FP)	90 (TN)

Using the formula :

Recall = TP/(TP+FN)

= 80/(80+20)

= 80/100

= 0.80 (or 80%)

Interpretation: The model correctly identifies 80% of spam emails but misses 20% of them.

Calculating Recall for Multi-Class Classification

For multi-class classification, recall is calculated per class using a one-vs-all approach.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits). The confusion matrix is:

Actual/Predicted	Cat	Dog	Rabbit
Cat	40	5	5
Dog	10	35	5
Rabbit	5	5	40

Recall Calculation per Class

Recall for Cat

Recall = TP/(TP+FN)

= 40/(40+5+5)

= 40/50

= 0.80 (or 80%)

Recall for Dog

Recall = TP/(TP+FN)

= 35/(35+10+5)

= 35/50

= 0.70 (or 70%)

Recall for Rabbit

Recall = TP/(TP+FN)

= 40/(40+5+5)

= 40/50

= 0.80 (or 80%)

Macro and Weighted Recall

Macro Recall (Unweighted Average):
Macro Recall = (Recall_Cat + Recall_Dog + Recall_Rabbit) / 3

= (0.80 + 0.70 + 0.80) / 3

= 0.77

Weighted Recall (Weighted by class size):

weighted recall = [(0.80 x 50) + (0.70 x 50) + (0.80 x 50)] / 150

= 115 / 150

= 0.77

F1 Score

The F1 Score is the harmonic mean of Precision and Recall. It is a measure of a model’s accuracy that balances both Precision and Recall, especially when the dataset is imbalanced.

The formula for F1 Score is:

F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

Where:

Precision = TP / (TP + FP) (How many predicted positives are actually positive?)
Recall = TP / (TP + FN) (How many actual positives were correctly predicted?)

The F1 Score ranges between 0 and 1, where:

1 means perfect precision and recall.
0 means the model failed completely.

Importance of F1 Score

Handles Class Imbalance: F1 Score is preferred over Accuracy when classes are imbalanced.
Balances Precision and Recall: Useful when one metric is more important than the other.
Prevents Misleading Interpretations: High accuracy in imbalanced datasets can be misleading; F1 Score gives a more balanced view.

Calculating F1 Score for Binary Classification

Consider a spam detection model with the following confusion matrix:

Actual/Predicted	Spam (1)	Not Spam (0)
Spam (1)	80 (TP)	20 (FN)
Not Spam (0)	10 (FP)	90 (TN)

Step 1: Calculate Precision and Recall

Precision = TP / (TP + FP) = 80 / (80 + 10) = 80/90 = 0.89

Recall = TP / (TP + FN) = 80/ (80 + 20) = 80/100 = 0.80

Step 2: Calculate F1 Score

F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

= 2 x (0.89 x 0.80) / (0.89 + 0.80)

= 2 x (0.712) / (1.69)

= 1.424 / 1.69

= 0.84 (or 84%)

Interpretation: The model achieves 84% balance between precision and recall.

Calculating F1 Score for Multi-Class Classification

For multi-class classification, the F1 Score is calculated per class using a one-vs-all approach and then averaged using Macro F1 or Weighted F1.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:

Actual / Predicted	Cat	Dog	Rabbit
Cat	40	5	5
Dog	10	35	5
Rabbit	5	5	40

Step 1: Calculate Precision, Recall and F1 per Class

For Cat

Precision = TP / (TP + FP) = 40 / (40 + 15) = 40 / 55 = 0.727

Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80

F1 Score = 2 x (0.727 x 0.80) / (0.727 + 0.80) = 0.761

For Dog

Precision = TP / (TP + FP) = 35 / (35 + 10) = 35 / 45 = 0.778

Recall = TP / (TP + FN) = 35 / (35 + 10 + 5) = 35/50 = 0.70

F1 Score = 2 x (0.778 x 0.70) / (0.778 + 0.70) = 0.737

For Rabbit

Precision = TP / (TP + FP) = 40 / (40 + 10) = 40 / 50 = 0.80

Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80

F1 Score = 2 x (0.80 x 0.80) / (0.80 + 0.80) = 0.80

Step 2: Compute Macro and Weighted F1 Score

Macro F1 Score (Unweighted Average) = (F1_Cat + F1_Dog + F1_Rabbit) / 3

= (0.761 + 0.737 + 0.80) / 3

= 0.766

Weighted F1 Score (Weighted by Class Size) = [(0.761 x 50) + (0.737 x 50) + (0.80 x 50)] / 150

= (38.05 + 36.85 + 40) / 150

= 114.9 / 150

= 0.766

Classification Metrics

Recent Posts

© 2025 Aryan Upadhyay |