Loss Functions in Deep Learning: A Complete Guide to MSE, MAE, Cross-Entropy & More

Aryan
4 days ago
10 min read

Updated: 2 days ago

What is a Loss Function?

A loss function is a method of evaluating how well your algorithm is modeling your dataset.

It helps measure the performance of a model — how far the predicted outputs are from the actual values.

If the loss value is high, it means the algorithm is performing poorly.

If it’s low, it means the algorithm is doing well.

Mathematically, the loss function depends on the parameters of the model — and whenever we change those parameters, the value of the loss also changes.

Example: Linear Regression

In Linear Regression, the line equation is:

y = mx+b

Here, m (slope) and b (intercept) are the parameters of our model.

We use a loss function such as Mean Squared Error (MSE):

This loss is a function of m and b.

As we adjust their values, the loss changes.

Our goal is to find those values of m and b for which the loss L(m,b) is minimum.

That means our model’s predictions are as close as possible to the real data points.

Why is the Loss Function Important?

There’s a simple rule in machine learning:

“You can’t improve what you can’t measure.”

This perfectly explains why the loss function is essential.

At the start, the model begins with random parameters (random points or lines).

We calculate the loss (error) to see how wrong the model is.

Then we tweak the parameters to reduce this loss, and we keep repeating this process until we find the minimum loss — that’s when the model performs best.

In short, the loss function acts as the eyes of the algorithm — it guides the learning process toward the best possible parameters so that the model makes fewer mistakes.

Different problems use different types of loss functions depending on the task (e.g., regression, classification, etc.).

Loss Function in Deep Learning

We have the following dataset of students containing CGPA, IQ, and package offered (in LPA):

CGPA	IQ	Package (LPA)
7.1	83	3.2
8.5	91	4.5
6.3	102	6.1
5.1	87	2.7
...	...	...

Understanding How Loss Works

Let’s take one student’s data and feed it into a neural network.

At the beginning, our network starts with random values of weights and biases.

We perform forward propagation and predict an output — say, the model predicts:

ŷ=3.7

while the actual value is:

y=3.2

Now, we compute the loss using the Mean Squared Error (MSE) function:

L = (y − ŷ)² = (3.2 − 3.7)² = 0.25

This number (0.25) tells us how wrong our model’s prediction was.

Based on this loss value, we perform backpropagation —

we go back and adjust the weights and biases using Gradient Descent to minimize the loss.

We then take the next student’s data, predict again, compute the loss, and update the weights and biases once more.

This process continues for all students across multiple epochs.

The goal is simple:

Minimize L ⇒ Find the best weights and biases.

When the loss reaches its minimum, the corresponding weights and biases are considered optimal.

So, in deep learning — just like in machine learning — minimizing the loss function is a key objective.

Common Loss Functions in Deep Learning

Different problems require different loss functions.

Here’s a quick overview:

Task Type	Common Loss Functions	Use Case
Regression	Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber Loss	Used to measure prediction error between actual and predicted continuous values.
Binary Classification	Binary Cross-Entropy	Used when predicting two possible outcomes (e.g., yes/no, 0/1).
Multiclass Classification	Categorical Cross-Entropy, Hinge Loss	Used for multi-class problems where the model predicts one class out of many.
Autoencoders	Kullback–Leibler Divergence (KL Divergence)	Measures the difference between two probability distributions (input vs. reconstructed).
GANs	Discriminator Loss, Minimax Loss	Used to train generator–discriminator pairs in Generative Adversarial Networks.
Object Detection	Focal Loss	Helps handle class imbalance by focusing more on hard-to-classify examples.
Embeddings	Triplet Loss	Encourages similar samples to be close and dissimilar samples to be far in embedding space.

Custom Loss Functions

Depending on the problem, we can also design our own loss functions.

For example, in frameworks like Keras or PyTorch, we can easily implement custom losses to suit a specific goal.

The choice of loss function defines how your model learns —

choosing the right one is crucial for achieving the best performance.

Loss Function vs Cost Function — They Are Different

Let’s take a neural network example with the following dataset:

CGPA	IQ	Package (yᵢ)	Prediction (ŷᵢ)
6.3	100	6.3	6.1
7.1	91	4.1	4.0
8.5	83	3.5	3.7
9.2	102	7.2	7.0

Understanding the Difference

We have to predict the package (in LPA) for these four students using a neural network.

The loss function is calculated for a single training example.

For instance, if we take the first student, the true and predicted values are:

y₁ = 6.3 , ŷ₁ = 6.1

Then the loss for this example (using Mean Squared Error) is:

L = (y₁ − ŷ₁)² = (6.3 − 6.1)² = 0.04

Similarly, we calculate the loss for every other student individually.

When we compute the average loss over the entire training dataset,

we call it the Cost Function (or Objective Function).

It is defined as:

For our dataset:

Key Takeaway

Loss Function → Error for a single training example.
Cost Function → Average loss over the entire dataset or batch.

So while they are related, they are not the same —

the loss function gives individual errors, and the cost function provides a global view of model performance.

Loss Functions :

A loss function helps us measure how well a model is performing by calculating the difference between the actual and predicted values.

The smaller the loss value, the better the model is performing.

Mean Squared Error (MSE)

Definition

Mean Squared Error (MSE) — also known as Squared Loss or L2 Loss — calculates the average of the squared differences between the actual and predicted values.

Where:

yᵢ = actual value
ŷᵢ = predicted value
n = total number of samples

Conceptual Intuition

MSE measures how far predictions are from the true values and penalizes large errors more heavily because of the square term.

When we train a model, it starts with random parameters. Using MSE, we:

Calculate the loss (how wrong the model is).
Use backpropagation to update weights and biases to reduce this loss.
Repeat this process until we reach the minimum loss value.

The goal is to find the parameters (weights and biases) that minimize the MSE — meaning the model’s predictions are as close as possible to the true outputs.

Why Do We Square the Error?

We square the error for two main reasons:

To remove negative signs — otherwise, positive and negative errors would cancel out.
To penalize larger errors more strongly — because squaring amplifies big differences.

For example:

Error = 1 → 1² = 1
Error = 2 → 2² = 4
Error = 4 → 4² = 16

Hence, far-away points (outliers) have a much stronger influence, causing larger weight adjustments, while smaller errors lead to gentler updates.

Mathematical Properties

Differentiable: Easy to optimize using gradient descent.
Convex: Has a single global minimum → ensures stable convergence.
Quadratic: Grows faster with larger errors.

Advantages

-> Simple and easy to interpret

-> Differentiable and suitable for gradient-based optimization

-> Ensures a single global minimum due to convexity

Disadvantages

-> Sensitive to outliers (large errors dominate the loss)

-> The error unit gets squared (not directly comparable to target scale)

When to Use MSE

When outliers are not a major concern
When you want to penalize large errors more strongly
When the output layer of your neural network uses a linear activation (important in regression models)

Mean Absolute Error (MAE)

Definition

Mean Absolute Error (MAE) — also known as L1 Loss — calculates the average of the absolute differences between actual and predicted values.

Here, instead of squaring, we take the absolute value of the difference.

Conceptual Intuition

MAE measures how far predictions are from the actual values, treating all errors equally — whether small or large.

Unlike MSE, it does not over-penalize large errors, which makes it more robust to outliers.

However, because the absolute function is not differentiable at zero, optimization becomes slightly trickier (gradient is undefined exactly at 0), and learning can be slower.

Advantages

-> Robust to outliers (less affected by extreme values)

-> Error scale remains the same as the original target variable

Disadvantages

-> Not differentiable at zero — can make gradient-based optimization slower

-> Converges slower compared to MSE

When to Use MAE

When outliers are present in data
When you want equal importance for all errors
For tasks where robustness is preferred over sensitivity

Huber Loss

Definition

The Huber Loss is a hybrid between Mean Squared Error (MSE) and Mean Absolute Error (MAE).

It behaves like MSE when the error is small and like MAE when the error is large (to reduce sensitivity to outliers).

Where:

y = actual value
ŷᵢ = predicted value
δ = threshold that decides when to switch from MSE to MAE behavior

Conceptual Understanding

When the difference between actual and predicted value is small, we use the squared loss part → acts like MSE (smooth and differentiable).
When the difference is large, we use the absolute loss part → acts like MAE (less sensitive to outliers).

This makes Huber Loss particularly useful when some data points are outliers (e.g., 20–30% of your dataset).

It provides a balance between the stability of MSE and the robustness of MAE.

Advantages

-> More robust to outliers than MSE

-> Differentiable everywhere (unlike MAE)

-> Combines the benefits of both MSE and MAE

Disadvantages

-> Requires setting a tuning parameter δ (not always easy to choose)

-> Slightly slower to compute than pure MSE

When to Use

Use Huber Loss when:

Your dataset contains some outliers, but you don’t want to ignore them completely.
You want a smooth loss curve for gradient descent (unlike MAE).

Binary Cross-Entropy (Log Loss)

Definition

Binary Cross-Entropy (BCE) — also known as Log Loss — is the most common loss function used in binary classification problems (e.g., spam/not spam, 0/1, yes/no).

and for all n samples:

Where:

yᵢ = true label (0 or 1)
ŷᵢ = predicted probability (between 0 and 1)

Conceptual Intuition

Binary Cross-Entropy measures how close the predicted probabilities are to the actual class labels.

If the true label is 1, only the first term −y log(ŷ) matters.
If the true label is 0, only the second term −(1−y)log(1−ŷ) matters.

It heavily penalizes wrong confident predictions — for example, predicting 0.99 when the true label is 0 gives a very high loss.

How It Works in a Neural Network

Forward pass: The model outputs raw scores (logits).
Sigmoid activation: Converts logits into probabilities between 0 and 1.
Loss calculation: BCE compares predicted probabilities with actual labels.
Backward pass: The gradient of BCE guides how much each weight and bias should adjust to minimize the loss.

Advantages

-> Works well for binary classification tasks

-> Provides probabilistic interpretation of predictions

-> Penalizes confident but wrong predictions strongly

Disadvantages

-> Requires sigmoid activation in the output layer

-> Can be sensitive to imbalanced datasets (needs proper class weighting)

When to Use

In binary classification problems (output labels 0 or 1)
When the output layer uses a sigmoid activation function
Examples: spam detection, disease prediction, sentiment analysis

Categorical Cross-Entropy Loss

What is Categorical Cross-Entropy ?

Categorical Cross-Entropy (often called CCE) is one of the most widely used loss functions for multi-class classification problems.

In simple terms, it tells us how far off our predicted probabilities are from the actual class label.

You’ll mostly see it used in Softmax regression and deep learning models that predict one label out of several possible classes.

The mathematical formula looks like this:

where:

k → total number of classes
yⱼ → true value (1 for the correct class, 0 for others)
ŷⱼ → predicted probability for each class

How It Works (Step by Step)

When you’re doing multi-class classification, the final layer of your neural network usually has:

One neuron for each class.
A Softmax activation function, which turns raw model outputs (called logits) into probabilities.

The Softmax function looks like this:

Softmax ensures two important things:

Every output is between 0 and 1, and
The total of all outputs equals 1 — so they form a valid probability distribution.

For example, if your model produces logits [2.1, 1.5, 0.3], Softmax converts them to probabilities [0.65, 0.28, 0.07].

That means the model thinks there’s a 65% chance the input belongs to the first class.

Forward Pass Example

Let’s see how this loss is calculated in practice.

Imagine a classification task with three classes: “yes”, “no”, and “maybe”.

We one-hot encode these labels as:

yes → [1, 0, 0]
no → [0, 1, 0]
maybe → [0, 0, 1]

Suppose the model predicts: [0.2, 0.3, 0.5]

and the true label is [1, 0, 0].

Plugging these into the formula:

L = -[1 log(0.2) + 0 log(0.3) + 0 log(0.5)] = -log(0.2)

That’s the loss for this one example.

If the model had predicted a higher probability for the true class (say 0.9 instead of 0.2), the loss would be much smaller.

This is how the model learns — it adjusts its parameters during training to reduce this loss over time.

Why Cross-Entropy Works So Well

Cross-Entropy measures how close the predicted probability distribution is to the actual one.

If the model assigns a high probability to the correct class, the loss is small.

If it’s unsure or confident about the wrong class, the loss becomes large.

You can think of it as measuring how surprised the model is when it sees the correct label.

Less surprise → smaller loss → better predictions.

Sparse Categorical Cross-Entropy

Now, what if your labels aren’t one-hot encoded?

For example, instead of [1, 0, 0], you have labels like 0, 1, or 2.

In that case, we use Sparse Categorical Cross-Entropy (SCCE).

It calculates the same type of loss but skips the one-hot encoding step.

The Formulas

Categorical Cross-Entropy (CCE)

Here, labels are one-hot encoded, and only the correct class contributes to the sum.

Sparse Categorical Cross-Entropy (SCCE)

Here, labels are integers (for example, 2 means “class 3”).

This formula directly picks the probability of the correct class and takes the negative log of it.

Both produce the same numerical loss — the difference is only in how the labels are represented.

Example

Let’s take the same prediction:

[0.1,0.4,0.5]

True Label	Type	Formula	Result
[1, 0, 0] (CCE)	One-hot	−log(0.1)	2.30
0 (SCCE)	Integer	−log(0.1)	2.30
1 (SCCE)	Integer	−log(0.4)	0.91

Both losses are identical when referring to the same true class — the difference is only in whether the label is [1, 0, 0] or just 0.

Key Differences Between CCE and SCCE

Both measure how much “penalty” to apply when the model predicts wrong.

SCCE just skips the one-hot encoding part and directly uses the label index.

Mathematical Form (Across a Dataset)

When calculated over the entire dataset:

For Sparse CCE, this simplifies to:

Key Notes to Remember

Categorical Cross-Entropy → For multi-class problems with one-hot encoded labels.
Sparse Categorical Cross-Entropy → For integer-encoded labels (saves time and memory).
Binary Cross-Entropy → For binary classification.

Use SCCE when:

You have a large number of classes, and
You want to avoid converting labels into one-hot form.

Common Loss Function Guide

Task Type	Recommended Loss Function
Regression (clean data)	Mean Squared Error (MSE)
Regression (with outliers)	Mean Absolute Error (MAE)
Binary Classification	Binary Cross-Entropy
Multi-class (few classes)	Categorical Cross-Entropy
Multi-class (many classes)	Sparse Categorical Cross-Entropy
Regression + Classification Mix	Huber Loss

MSE / MAE → For continuous outputs.
Cross-Entropy → For probability-based outputs.
Sparse Cross-Entropy → Same idea as cross-entropy, just simpler integer labels.

You can think of Cross-Entropy Loss as a measure of how surprised the model is when it learns what the correct label was.

If it was already confident and right → little surprise → low loss.
If it was confident but wrong → big surprise → high loss.

During training, your model keeps adjusting itself to minimize this surprise —

which means it’s getting better at being confident and correct at the same time.

Both CCE and SCCE help your model learn to trust the right answer more strongly.

They’re just two different ways of writing the same idea — one for one-hot labels and one for integer labels.

Loss Functions in Deep Learning: A Complete Guide to MSE, MAE, Cross-Entropy & More

Recent Posts

© 2025 Aryan Upadhyay |