top of page

Classification Metrics

  • Writer: Aryan
    Aryan
  • Mar 2
  • 11 min read

Imagine you have developed a logistic regression model designed for detecting spam emails. This model outputs a probability score between 0 and 1, indicating the likelihood that an email is spam. For example, a score of 0.50 suggests a 50% chance of being spam, while a score of 0.75 indicates a 75% likelihood.

 

Setting a Classification Threshold


To effectively filter spam, you must convert the model's probability output into clear categories: "spam" or "not spam". This is done by setting a classification threshold.

  • Emails with a predicted probability higher than the threshold are classified as spam (positive class).

  • Emails with a predicted probability lower than the threshold are classified as not spam (negative class).

For example, if the threshold is 0.5, any email with a score above 0.5 will be marked as spam. If you raise the threshold to 0.95, only emails with a probability of 95% or higher will be classified as spam.

Special Case: If a score exactly matches the threshold (like 0.5), how it's classified depends on the tool used. For instance, the Keras library defaults to predicting the negative class in such cases, but other tools might behave differently.

Choosing the Right Threshold

While setting the threshold at 0.5 might seem logical, it isn’t always ideal. If your dataset is imbalanced (e.g., only 0.01% of emails are spam) or if incorrectly classifying a legitimate email as spam has severe consequences, a higher threshold could help reduce errors.


CONFUSION MATRIX

 

A confusion matrix is a table that evaluates the performance of a classification model by comparing predictions with actual outcomes. It consists of four categories:

  • True Positive (TP): Model correctly predicts a positive outcome.

  • True Negative (TN): Model correctly predicts a negative outcome.

  • False Positive (FP): Model incorrectly predicts a positive outcome (Type I error).

  • False Negative (FN): Model incorrectly predicts a negative outcome (Type II error).

The confusion matrix helps identify model weaknesses and areas for improvement.

ree

Why Do We Need a Confusion Matrix ?


The confusion matrix provides a detailed breakdown of correct and incorrect predictions. It helps calculate performance metrics like accuracy, precision, recall, and F1-score.

 

Type I and Type II Errors

  • Type I Error (False Positive): Incorrectly classifies a negative instance as positive.

  • Type II Error (False Negative): Incorrectly classifies a positive instance as negative.


Confusion Matrix for Binary Classification

 

A 2×2 confusion matrix example for Dog image recognition:

 

Predicted Dog

Predicted Not Dog

Actual Dog

True Positive (TP)

False Negative (FN)

Actual Not Dog

False Positive (FP)

True Negative (TN)

Example: Confusion Matrix for Dog Image Recognition

Index

Actual

Predicted

Result

1

Dog

Dog

TP

2

Dog

Not Dog

FN

3

Dog

Dog

TP

4

Not Dog

Not Dog

TN

5

Dog

Dog

TP

6

Not Dog

Dog

FP

7

Dog

Dog

TP

8

Dog

Dog

TP

9

Not Dog

Not Dog

TN

10

Not Dog

Not Dog

TN

Confusion Matrix for Multi-Class Classification

 

For multi-class classification, the confusion matrix expands based on the number of classes.

 

Predicted Cat

Predicted Dog

Predicted Horse

Actual Cat

TP

FN

FN

Actual Dog

FN

TP

FN

Actual Horse

FN

FN

TP

 Example with Numbers

 

Predicted Cat

Predicted Dog

Predicted Horse

Actual Cat

8

1

1

Actual Dog

2

10

0

Actual Horse

0

2

8

Accuracy

 

Accuracy score is a metric used to evaluate the performance of a classification model. It measures the ratio of correctly predicted instances to the total number of instances in a dataset. The formula for accuracy is :

ree

where :

  • TP (True Positive): Correctly predicted positive instances.

  • TN (True Negative): Correctly predicted negative instances.

  • FP (False Positive): Incorrectly predicted positive instances.

  • FN (False Negative): Incorrectly predicted negative instances.

Accuracy provides a straightforward measure of overall model performance but may not always be the best metric, especially for imbalanced datasets.


Accuracy in Binary Classification

 

Binary classification deals with two classes, such as spam vs. not spam in email classification.

Example: Spam Email Classification

Consider a dataset of 1000 emails where:

  • 700 are not spam (negative class)

  • 300 are spam (positive class)

A model makes the following predictions:

  • TP = 250 (Correctly classified spam emails)

  • TN = 600 (Correctly classified non-spam emails)

  • FP = 100 (Non-spam emails incorrectly classified as spam)

  • FN = 50 (Spam emails incorrectly classified as non-spam)

ree

Accuracy is useful when both classes are balanced, but in highly imbalanced datasets, other metrics like precision, recall, or F1-score may be more informative.

 

Accuracy in Multi-Class Classification

 

Multi-class classification involves more than two classes, such as classifying handwritten digits (0-9) or categorizing news articles into multiple topics.

Example: Handwritten Digit Recognition (0-9)

A model is trained to classify images of handwritten digits (0-9). The dataset has 10,000 test images evenly distributed across the 10 classes. If the model correctly classifies 9,000 images, the accuracy is :

ree

In multi-class classification, accuracy is calculated in the same way but may not reflect class-wise performance. For example, a model may perform well on some classes but poorly on others. In such cases, class-wise precision, recall, and the confusion matrix provide deeper insights.

 

Limitations of Accuracy in Multiclass Scenarios

 

When dealing with imbalanced datasets (where some classes have significantly more samples than others), accuracy can be misleading. A model might perform well on the majority class while performing poorly on the minority classes, yet still show a high overall accuracy.


Alternative Metrics for Multiclass Classification

 

To get a more comprehensive evaluation of your model's performance, consider these metrics:

  • Precision (per class): Measures the proportion of true positives among all predicted positives for a class.

  • Recall (per class): Measures the proportion of true positives detected out of all actual instances of that class.

  • F1-Score: Harmonic mean of precision and recall, offering a balance between the two.

  • Macro Average: Calculates metrics independently for each class and then takes the average, treating all classes equally.

  • Weighted Average: Similar to macro but accounts for class imbalance by weighting each class’s contribution by its support (number of true instances).

 

What Is a Good Accuracy Score?

 

The definition of a "good" accuracy score depends on the problem you are trying to solve and the real-world consequences of errors made by the model.


Context-Dependent Accuracy Requirements

 

  1. High-Stakes Applications (Critical Systems)

    • In cases where even a small error can have severe consequences, such as self-driving cars, the model must have near-perfect accuracy (close to 100%).

    • For example, if a self-driving car has an accuracy of 97%, it might still misinterpret important situations, potentially leading to accidents. Hence, deploying such a model would be unsafe.


  2. Low-Stakes Applications (Non-Critical Systems)


    • In scenarios where minor errors are acceptable and won’t cause significant harm, a lower accuracy score can be sufficient.

    • For instance, a food delivery app like Swiggy that predicts whether a customer will order food can be deployed even with an accuracy of 80%. A wrong prediction here won’t have serious consequences—it might only result in a missed sales opportunity.


Key Takeaway

 

The acceptable level of accuracy depends on:

  • The importance of the decision (life-critical vs. business optimization)

  • The cost of false positives and false negatives

  • The balance between accuracy and other metrics like precision, recall, or F1-score, especially for imbalanced datasets

In short:

  • Critical systems demand very high accuracy (near 100%)

  • Non-critical systems can perform well with moderate accuracy (70-90%), depending on business needs.


What Is the Problem with Accuracy?

 

While accuracy is a commonly used metric to evaluate machine learning models, it has significant limitations—especially when it doesn't reveal the type of mistakes a model is making.

 

Why Is This a Problem?

 

Accuracy simply measures the proportion of correct predictions out of the total predictions made. However, it doesn’t indicate what kind of errors the model is making—whether they are false positives or false negatives—and this distinction can be critical depending on the problem.


Example Scenarios

 

  1. Self-Driving Car (Critical System)

    • A model with 97% accuracy might still misclassify stop signs as speed limit signs just 3% of the time, which could lead to severe accidents.

    • Here, false negatives (failing to detect a stop sign) are much more dangerous than false positives.


  2. Swiggy Customer Prediction (Non-Critical System)

    • A model predicting whether a customer will order food might have 80% accuracy.

    • If most of the mistakes are false positives (predicting the customer will order when they won’t), the business might waste resources on unnecessary marketing campaigns.

    • If the mistakes are false negatives (predicting the customer won’t order when they would), the company might lose potential sales.


The Core Issue

Accuracy fails to answer:

  • What kind of errors is the model making?

  • Are these errors equally harmful, or does one type of mistake cost more than the other?

This is why, especially in imbalanced datasets or high-risk applications, relying solely on accuracy can be misleading. Metrics like precision, recall, and the confusion matrix provide deeper insights into the types of errors being made.


Precision

 

Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It answers the question:

"Of all the positive predictions made by the model, how many were actually correct?"

The formula for Precision is:

                                      Precision =  TP / (TP + FP)

Where:

  • True Positives (TP): Correctly predicted positive cases

  • False Positives (FP): Incorrectly predicted positive cases (actually negative)

A high precision indicates that the model makes fewer false positive errors.

Importance of Precision

  • Useful when False Positives are costly

    • Example: In spam detection, low precision means many non-spam emails are incorrectly classified as spam, which is undesirable.

  • Inverse relationship with Recall

    • Increasing precision often decreases recall and vice versa (Precision-Recall tradeoff).


Calculating Precision for Binary Classification

 

Consider a spam detection model with the following confusion matrix:

Actual / Predicted

Spam (1)

Not Spam (0)

Spam (1)

80 (TP)

20 (FN)

Not Spam

10 (FP)

90 (TN)

 

Step 1: Calculate Precision

 

Precision = TP / (TP + FP)

                 = 80 / (80 + 10)

                 = 80 / 90

                 = 0.89 (or 89%)

Interpretation: Out of all emails classified as spam, 89% were actually spam.

 

Calculating Precision for Multi-Class Classification

 

For multi-class classification, precision is calculated per class using a one-vs-all approach.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:

 

Actual / Predicted

Cat

Dog

Rabbit

Cat

40

5

5

Dog

10

35

5

Rabbit

5

5

40

 

Step 1: Calculate Precision per Class

 

For Cat

        Precision = TP / (TP + FP)

                         = 40 / (40 + 15)

                         = 40 / 55

                         = 0.727 (or 72.7%)

For Dog

        Precision = TP / (TP + FP)

                         = 35 / (35 + 10)

                         = 35 / 45

                         = 0.778 (or 77.8%)

For Rabbit

        Precision = TP / (TP + FP)

                         = 40 / (40 + 10)

                         = 40 / 50

                         = 0.80 (or 80%)

 

Step 2: Compute Macro and Weighted Precision

 

Macro Precision (Unweighted Average) = (Precision_Cat + Precision_Dog + Precision_Rabbit) / 3

                                                                       = (0.727 + 0.778 + 0.80) / 3

                                                                       = 0.768

Weighted Precision (Weighted by Class Size) = [(0.727 x 55) + (0.778 x 45) + (0.80 x 50)

                                                                                 = (39.985 + 35.01 + 40) / 150

                                                                                 = 114.995 / 150

                                                                                 = 0.767   


Recall

 

Recall, also known as Sensitivity or True Positive Rate (TPR), measures the ability of a classification model to correctly identify all relevant instances of a particular class. It is defined as:

ree

Where:

  • True Positives (TP): Correctly predicted positive cases

  • False Negatives (FN): Actual positive cases that were incorrectly classified as negative

A high recall means that very few positive instances were missed by the model.

 

Importance of Recall


  • Crucial in scenarios where missing a positive case is costly (e.g., medical diagnoses, fraud detection).

  • Inversely related to Precision; improving recall often decreases precision and vice versa.

  • Used with the Precision-Recall trade-off to find an optimal model performance.

 

Calculating Recall for Binary Classification

 

Consider a binary classification problem for spam detection where:

  • Positive Class (1) → Spam Email

  • Negative Class (0) → Not Spam Email

Actual/Predicted

Spam (1)

Not Spam (0)

Spam (1)

80 (TP)

20 (FN)

Not Spam (0)

10 (FP)

90 (TN)

Using the formula :

 

Recall = TP/(TP+FN)

           = 80/(80+20)

           = 80/100

           = 0.80 (or 80%)


Interpretation: The model correctly identifies 80% of spam emails but misses 20% of them.

 

Calculating Recall for Multi-Class Classification

 

For multi-class classification, recall is calculated per class using a one-vs-all approach.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits). The confusion matrix is:

Actual/Predicted

Cat

Dog

Rabbit

Cat

40

5

5

Dog

10

35

5

Rabbit

5

5

40

 

Recall Calculation per Class


  • Recall for Cat

        Recall = TP/(TP+FN)

                   = 40/(40+5+5)

                   = 40/50

                   = 0.80 (or 80%)

     

  • Recall for Dog

        Recall = TP/(TP+FN)

                   = 35/(35+10+5)

                   = 35/50

                   = 0.70 (or 70%)

  • Recall for Rabbit

        Recall = TP/(TP+FN)

                   = 40/(40+5+5)

                   = 40/50

                   = 0.80 (or 80%)

 

Macro and Weighted Recall


  1. Macro Recall (Unweighted Average):

    Macro Recall = (Recall_Cat + Recall_Dog + Recall_Rabbit) / 3

                        = (0.80 + 0.70 + 0.80) / 3

                        = 0.77

  1. Weighted Recall (Weighted by class size):

  weighted recall = [(0.80 x 50) + (0.70 x 50) + (0.80 x 50)] / 150

                             = 115 / 150

                             = 0.77


F1 Score

 

The F1 Score is the harmonic mean of Precision and Recall. It is a measure of a model’s accuracy that balances both Precision and Recall, especially when the dataset is imbalanced.

 

The formula for F1 Score is:

 

F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

 

Where:

  • Precision = TP / (TP + FP) (How many predicted positives are actually positive?)

  • Recall = TP / (TP + FN) (How many actual positives were correctly predicted?)

 

The F1 Score ranges between 0 and 1, where:

  • 1 means perfect precision and recall.

  • 0 means the model failed completely.

 

Importance of F1 Score

  • Handles Class Imbalance: F1 Score is preferred over Accuracy when classes are imbalanced.

  • Balances Precision and Recall: Useful when one metric is more important than the other.

  • Prevents Misleading Interpretations: High accuracy in imbalanced datasets can be misleading; F1 Score gives a more balanced view.

 

Calculating F1 Score for Binary Classification

 

Consider a spam detection model with the following confusion matrix:

Actual/Predicted

Spam (1)

Not Spam (0)

Spam (1)

80 (TP)

20 (FN)

Not Spam (0)

10 (FP)

90 (TN)

 

Step 1: Calculate Precision and Recall

 

Precision = TP / (TP + FP) = 80 / (80 + 10) = 80/90 = 0.89

Recall = TP / (TP + FN) = 80/ (80 + 20) = 80/100 = 0.80

 

Step 2: Calculate F1 Score

 

F1 Score = 2 x (Precision x Recall) / (Precision + Recall)

                = 2 x (0.89 x 0.80) / (0.89 + 0.80)

                = 2 x (0.712) / (1.69)

                = 1.424 / 1.69

                = 0.84 (or 84%)

 

Interpretation: The model achieves 84% balance between precision and recall.

 

Calculating F1 Score for Multi-Class Classification

 

For multi-class classification, the F1 Score is calculated per class using a one-vs-all approach and then averaged using Macro F1 or Weighted F1.

Consider a three-class classification problem (e.g., classifying animals as Cats, Dogs, or Rabbits) with the following confusion matrix:

 

Actual / Predicted

Cat

Dog

Rabbit

Cat

40

5

5

Dog

10

35

5

Rabbit

5

5

40

 

Step 1: Calculate Precision, Recall and F1  per Class

 

For Cat

   Precision = TP / (TP + FP) = 40 / (40 + 15) = 40 / 55 = 0.727

   Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80

   F1 Score = 2 x (0.727 x 0.80) / (0.727 + 0.80) = 0.761

 

For Dog

   Precision = TP / (TP + FP) = 35 / (35 + 10) = 35 / 45 = 0.778

   Recall = TP / (TP + FN) = 35 / (35 + 10 + 5) = 35/50 = 0.70

   F1 Score = 2 x (0.778 x 0.70) / (0.778 + 0.70) = 0.737

 

For Rabbit

   Precision = TP / (TP + FP) = 40 / (40 + 10) = 40 / 50 = 0.80

   Recall = TP / (TP + FN) = 40 / (40 + 5 + 5) = 40/50 = 0.80

   F1 Score = 2 x (0.80 x 0.80) / (0.80 + 0.80) = 0.80

 

Step 2: Compute Macro and Weighted F1 Score

Macro F1 Score (Unweighted Average) = (F1_Cat + F1_Dog + F1_Rabbit) / 3

                                                                       = (0.761 + 0.737 + 0.80) / 3

                                                                       = 0.766

Weighted F1 Score (Weighted by Class Size) = [(0.761 x 50) + (0.737 x 50) + (0.80 x 50)] / 150

                                                                               = (38.05 + 36.85 + 40) / 150

                                                                               = 114.9 / 150

                                                                               = 0.766


bottom of page