top of page

Loss Functions in Deep Learning: A Complete Guide to MSE, MAE, Cross-Entropy & More

  • Writer: Aryan
    Aryan
  • 4 days ago
  • 10 min read

Updated: 2 days ago

What is a Loss Function?

 

A loss function is a method of evaluating how well your algorithm is modeling your dataset.

It helps measure the performance of a model — how far the predicted outputs are from the actual values.

If the loss value is high, it means the algorithm is performing poorly.

If it’s low, it means the algorithm is doing well.

Mathematically, the loss function depends on the parameters of the model — and whenever we change those parameters, the value of the loss also changes.


Example: Linear Regression

 

In Linear Regression, the line equation is:

y = mx+b

Here, m (slope) and b (intercept) are the parameters of our model.

We use a loss function such as Mean Squared Error (MSE):

ree

This loss is a function of m and b.

As we adjust their values, the loss changes.

Our goal is to find those values of m and b for which the loss L(m,b) is minimum.

That means our model’s predictions are as close as possible to the real data points.


Why is the Loss Function Important?

 

There’s a simple rule in machine learning:

“You can’t improve what you can’t measure.”

This perfectly explains why the loss function is essential.

At the start, the model begins with random parameters (random points or lines).

We calculate the loss (error) to see how wrong the model is.

Then we tweak the parameters to reduce this loss, and we keep repeating this process until we find the minimum loss — that’s when the model performs best.

In short, the loss function acts as the eyes of the algorithm — it guides the learning process toward the best possible parameters so that the model makes fewer mistakes.

Different problems use different types of loss functions depending on the task (e.g., regression, classification, etc.).


Loss Function in Deep Learning

 

We have the following dataset of students containing CGPA, IQ, and package offered (in LPA):

CGPA

IQ

Package (LPA)

7.1

83

3.2

8.5

91

4.5

6.3

102

6.1

5.1

87

2.7

...

...

...

 Understanding How Loss Works

Let’s take one student’s data and feed it into a neural network.

At the beginning, our network starts with random values of weights and biases.

We perform forward propagation and predict an output — say, the model predicts:

ŷ​=3.7

while the actual value is:

y=3.2

Now, we compute the loss using the Mean Squared Error (MSE) function:

L = (y − ŷ)² = (3.2 − 3.7)² = 0.25

This number (0.25) tells us how wrong our model’s prediction was.

Based on this loss value, we perform backpropagation —

we go back and adjust the weights and biases using Gradient Descent to minimize the loss.

We then take the next student’s data, predict again, compute the loss, and update the weights and biases once more.

This process continues for all students across multiple epochs.

The goal is simple:

Minimize L ⇒ Find the best weights and biases.

When the loss reaches its minimum, the corresponding weights and biases are considered optimal.

So, in deep learning — just like in machine learning — minimizing the loss function is a key objective.


Common Loss Functions in Deep Learning

Different problems require different loss functions.

Here’s a quick overview:

Task Type

Common Loss Functions

Use Case

Regression

Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber Loss

Used to measure prediction error between actual and predicted continuous values.

Binary Classification

Binary Cross-Entropy

Used when predicting two possible outcomes (e.g., yes/no, 0/1).

Multiclass Classification

Categorical Cross-Entropy, Hinge Loss

Used for multi-class problems where the model predicts one class out of many.

Autoencoders

Kullback–Leibler Divergence (KL Divergence)

Measures the difference between two probability distributions (input vs. reconstructed).

GANs

Discriminator Loss, Minimax Loss

Used to train generator–discriminator pairs in Generative Adversarial Networks.

Object Detection

Focal Loss

Helps handle class imbalance by focusing more on hard-to-classify examples.

Embeddings

Triplet Loss

Encourages similar samples to be close and dissimilar samples to be far in embedding space.

Custom Loss Functions

Depending on the problem, we can also design our own loss functions.

For example, in frameworks like Keras or PyTorch, we can easily implement custom losses to suit a specific goal.

The choice of loss function defines how your model learns —

choosing the right one is crucial for achieving the best performance.

 

Loss Function vs Cost Function — They Are Different

Let’s take a neural network example with the following dataset:

CGPA

IQ

Package (yᵢ)

Prediction (ŷᵢ)

6.3

100

6.3

6.1

7.1

91

4.1

4.0

8.5

83

3.5

3.7

9.2

102

7.2

7.0

 

Understanding the Difference

We have to predict the package (in LPA) for these four students using a neural network.

The loss function is calculated for a single training example.

For instance, if we take the first student, the true and predicted values are:

y₁ = 6.3 , ŷ₁ = 6.1

Then the loss for this example (using Mean Squared Error) is:

L = (y₁ − ŷ₁)² = (6.3 − 6.1)² = 0.04

Similarly, we calculate the loss for every other student individually.


When we compute the average loss over the entire training dataset,

we call it the Cost Function (or Objective Function).

It is defined as:

ree

For our dataset:

ree

Key Takeaway

  • Loss Function → Error for a single training example.

  • Cost Function → Average loss over the entire dataset or batch.

So while they are related, they are not the same —

the loss function gives individual errors, and the cost function provides a global view of model performance.

 

Loss Functions :

A loss function helps us measure how well a model is performing by calculating the difference between the actual and predicted values.

The smaller the loss value, the better the model is performing.

 

Mean Squared Error (MSE)


Definition

Mean Squared Error (MSE) — also known as Squared Loss or L2 Loss — calculates the average of the squared differences between the actual and predicted values.

ree

Where:

  • yᵢ = actual value

  • ŷᵢ​ = predicted value

  • n = total number of samples


Conceptual Intuition

MSE measures how far predictions are from the true values and penalizes large errors more heavily because of the square term.

When we train a model, it starts with random parameters. Using MSE, we:

  1. Calculate the loss (how wrong the model is).

  2. Use backpropagation to update weights and biases to reduce this loss.

  3. Repeat this process until we reach the minimum loss value.

The goal is to find the parameters (weights and biases) that minimize the MSE — meaning the model’s predictions are as close as possible to the true outputs.

 

Why Do We Square the Error?

We square the error for two main reasons:

  1. To remove negative signs — otherwise, positive and negative errors would cancel out.

  2. To penalize larger errors more strongly — because squaring amplifies big differences.

For example:

  • Error = 1 → 1² = 1

  • Error = 2 → 2² = 4

  • Error = 4 → 4² = 16

Hence, far-away points (outliers) have a much stronger influence, causing larger weight adjustments, while smaller errors lead to gentler updates.


Mathematical Properties

  • Differentiable: Easy to optimize using gradient descent.

  • Convex: Has a single global minimum → ensures stable convergence.

  • Quadratic: Grows faster with larger errors.

 

Advantages

-> Simple and easy to interpret

-> Differentiable and suitable for gradient-based optimization

-> Ensures a single global minimum due to convexity

Disadvantages

-> Sensitive to outliers (large errors dominate the loss)

-> The error unit gets squared (not directly comparable to target scale)

 

When to Use MSE

  • When outliers are not a major concern

  • When you want to penalize large errors more strongly

  • When the output layer of your neural network uses a linear activation (important in regression models)


Mean Absolute Error (MAE)

 

Definition

Mean Absolute Error (MAE) — also known as L1 Loss — calculates the average of the absolute differences between actual and predicted values.

ree

Here, instead of squaring, we take the absolute value of the difference.

 

Conceptual Intuition

MAE measures how far predictions are from the actual values, treating all errors equally — whether small or large.

Unlike MSE, it does not over-penalize large errors, which makes it more robust to outliers.

However, because the absolute function is not differentiable at zero, optimization becomes slightly trickier (gradient is undefined exactly at 0), and learning can be slower.

 

Advantages

-> Robust to outliers (less affected by extreme values)

-> Error scale remains the same as the original target variable

Disadvantages

-> Not differentiable at zero — can make gradient-based optimization slower

-> Converges slower compared to MSE

 

When to Use MAE

  • When outliers are present in data

  • When you want equal importance for all errors

  • For tasks where robustness is preferred over sensitivity


Huber Loss

 

Definition

The Huber Loss is a hybrid between Mean Squared Error (MSE) and Mean Absolute Error (MAE).

It behaves like MSE when the error is small and like MAE when the error is large (to reduce sensitivity to outliers).

ree

Where:

  • y = actual value

  • ŷᵢ​ = predicted value

  • δ = threshold that decides when to switch from MSE to MAE behavior

 

Conceptual Understanding

  • When the difference between actual and predicted value is small, we use the squared loss part → acts like MSE (smooth and differentiable).

  • When the difference is large, we use the absolute loss part → acts like MAE (less sensitive to outliers).

This makes Huber Loss particularly useful when some data points are outliers (e.g., 20–30% of your dataset).

It provides a balance between the stability of MSE and the robustness of MAE.

 

Advantages

-> More robust to outliers than MSE

-> Differentiable everywhere (unlike MAE)

-> Combines the benefits of both MSE and MAE

Disadvantages

-> Requires setting a tuning parameter δ (not always easy to choose)

-> Slightly slower to compute than pure MSE

 

When to Use

Use Huber Loss when:

  • Your dataset contains some outliers, but you don’t want to ignore them completely.

  • You want a smooth loss curve for gradient descent (unlike MAE).


Binary Cross-Entropy (Log Loss)

 

Definition

Binary Cross-Entropy (BCE) — also known as Log Loss — is the most common loss function used in binary classification problems (e.g., spam/not spam, 0/1, yes/no).

ree

and for all n samples:

ree

Where:

  • yᵢ​​ = true label (0 or 1)

  • ŷᵢ​ = predicted probability (between 0 and 1)

 

Conceptual Intuition

Binary Cross-Entropy measures how close the predicted probabilities are to the actual class labels.

  • If the true label is 1, only the first term −y log(ŷ) matters.

  • If the true label is 0, only the second term −(1−y)log(1−ŷ) matters.

It heavily penalizes wrong confident predictions — for example, predicting 0.99 when the true label is 0 gives a very high loss.

 

How It Works in a Neural Network

  1. Forward pass: The model outputs raw scores (logits).

  2. Sigmoid activation: Converts logits into probabilities between 0 and 1.

  3. Loss calculation: BCE compares predicted probabilities with actual labels.

  4. Backward pass: The gradient of BCE guides how much each weight and bias should adjust to minimize the loss.


Advantages

-> Works well for binary classification tasks

-> Provides probabilistic interpretation of predictions

-> Penalizes confident but wrong predictions strongly

Disadvantages

-> Requires sigmoid activation in the output layer

-> Can be sensitive to imbalanced datasets (needs proper class weighting)

 

When to Use

  • In binary classification problems (output labels 0 or 1)

  • When the output layer uses a sigmoid activation function

  • Examples: spam detection, disease prediction, sentiment analysis

 

Categorical Cross-Entropy Loss


What is Categorical Cross-Entropy ?

 

Categorical Cross-Entropy (often called CCE) is one of the most widely used loss functions for multi-class classification problems.

In simple terms, it tells us how far off our predicted probabilities are from the actual class label.

You’ll mostly see it used in Softmax regression and deep learning models that predict one label out of several possible classes.

The mathematical formula looks like this:

ree

where:

  • k → total number of classes

  • yⱼ​ → true value (1 for the correct class, 0 for others)

  • ŷⱼ​ → predicted probability for each class

 

How It Works (Step by Step)

When you’re doing multi-class classification, the final layer of your neural network usually has:

  • One neuron for each class.

  • A Softmax activation function, which turns raw model outputs (called logits) into probabilities.

The Softmax function looks like this:

ree

Softmax ensures two important things:

  1. Every output is between 0 and 1, and

  2. The total of all outputs equals 1 — so they form a valid probability distribution.

For example, if your model produces logits [2.1, 1.5, 0.3], Softmax converts them to probabilities [0.65, 0.28, 0.07].

That means the model thinks there’s a 65% chance the input belongs to the first class.

 

Forward Pass Example

Let’s see how this loss is calculated in practice.

Imagine a classification task with three classes: “yes”, “no”, and “maybe”.

We one-hot encode these labels as:

  • yes → [1, 0, 0]

  • no → [0, 1, 0]

  • maybe → [0, 0, 1]

Suppose the model predicts: [0.2, 0.3, 0.5]

and the true label is [1, 0, 0].

Plugging these into the formula:

L = -[1 log(0.2) + 0 log(0.3) + 0 log(0.5)] = -log(0.2)

That’s the loss for this one example.

If the model had predicted a higher probability for the true class (say 0.9 instead of 0.2), the loss would be much smaller.

This is how the model learns — it adjusts its parameters during training to reduce this loss over time.


Why Cross-Entropy Works So Well

Cross-Entropy measures how close the predicted probability distribution is to the actual one.

If the model assigns a high probability to the correct class, the loss is small.

If it’s unsure or confident about the wrong class, the loss becomes large.

You can think of it as measuring how surprised the model is when it sees the correct label.

Less surprise → smaller loss → better predictions.

 

Sparse Categorical Cross-Entropy

Now, what if your labels aren’t one-hot encoded?

For example, instead of [1, 0, 0], you have labels like 0, 1, or 2.

In that case, we use Sparse Categorical Cross-Entropy (SCCE).

It calculates the same type of loss but skips the one-hot encoding step.


The Formulas

Categorical Cross-Entropy (CCE)

ree

Here, labels are one-hot encoded, and only the correct class contributes to the sum.

 

Sparse Categorical Cross-Entropy (SCCE)

ree

Here, labels are integers (for example, 2 means “class 3”).

This formula directly picks the probability of the correct class and takes the negative log of it.

Both produce the same numerical loss — the difference is only in how the labels are represented.

 

Example

Let’s take the same prediction:

[0.1,0.4,0.5]

True Label

Type

Formula

Result

[1, 0, 0] (CCE)

One-hot

−log(0.1)

2.30

0 (SCCE)

Integer

−log(0.1)

2.30

1 (SCCE)

Integer

−log(0.4)

0.91

Both losses are identical when referring to the same true class — the difference is only in whether the label is [1, 0, 0] or just 0.

 

Key Differences Between CCE and SCCE

ree

Both measure how much “penalty” to apply when the model predicts wrong.

SCCE just skips the one-hot encoding part and directly uses the label index.

 

Mathematical Form (Across a Dataset)

When calculated over the entire dataset:

ree

For Sparse CCE, this simplifies to:

ree

Key Notes to Remember

  • Categorical Cross-Entropy → For multi-class problems with one-hot encoded labels.

  • Sparse Categorical Cross-Entropy → For integer-encoded labels (saves time and memory).

  • Binary Cross-Entropy → For binary classification.

Use SCCE when:

  • You have a large number of classes, and

  • You want to avoid converting labels into one-hot form.

 

Common Loss Function Guide

Task Type

Recommended Loss Function

Regression (clean data)

Mean Squared Error (MSE)

Regression (with outliers)

Mean Absolute Error (MAE)

Binary Classification

Binary Cross-Entropy

Multi-class (few classes)

Categorical Cross-Entropy

Multi-class (many classes)

Sparse Categorical Cross-Entropy

Regression + Classification Mix

Huber Loss

  • MSE / MAE → For continuous outputs.

  • Cross-Entropy → For probability-based outputs.

  • Sparse Cross-Entropy → Same idea as cross-entropy, just simpler integer labels.

 

You can think of Cross-Entropy Loss as a measure of how surprised the model is when it learns what the correct label was.

  • If it was already confident and right → little surprise → low loss.

  • If it was confident but wrong → big surprise → high loss.

During training, your model keeps adjusting itself to minimize this surprise —

which means it’s getting better at being confident and correct at the same time.

 

Both CCE and SCCE help your model learn to trust the right answer more strongly.

They’re just two different ways of writing the same idea — one for one-hot labels and one for integer labels.




bottom of page