
LOGISTIC REGRESSION - 2
- Aryan

- Apr 16
- 9 min read
GRADIENT DESCENT

So what we do is :

Suppose we have a dataset, and we start with a random decision boundary (a line). This line is defined by coefficients (parameters) like
β₀ , β₁ , β₂
Initially, these coefficient values are randomly assigned.
Then we start updating these coefficient values using gradient descent. The update rules are :


What is multi-class classification ?

Multi-class classification refers to classification tasks that involve more than two classes, where each input is assigned to exactly one of the possible classes.
Examples :
Digit Recognition
Predict which digit (0–9) is in an image → 10 classes
(e.g., MNIST dataset)
Animal Classification
Classify an image as cat, dog, or rabbit → 3 classes
Sentiment Analysis
Predict whether a review is positive, neutral, or negative → 3 classes
How Logistic Regression Handles Multi-class Classification Problems
Logistic regression can be extended to handle multi-class classification using two main techniques :
One-vs-Rest (OvR), also known as One-vs-All (OvA)
Multinomial Logistic Regression, also known as Softmax Regression
OVR (One-vs-Rest) Approach
Suppose we have a dataset where the input features are cgpa and iq, and the target/output feature is placement, which can take one of three values: Yes, No, or Opt Out.
In the One-vs-Rest (OvR) approach, for each unique class, we train a separate logistic regression model. So, if we have 3 classes, we train 3 logistic regression models — one for each class.
Steps :
Split the output into separate binary classification problems — one for each class :
Model 1: "Yes" vs (No + Opt Out)
Model 2: "No" vs (Yes + Opt Out)
Model 3: "Opt Out" vs (Yes + No)
Apply One-Hot Encoding to the output column. This converts the categorical target variable into multiple binary columns — one for each class :
This is the transformed data used to train the three logistic regression models — each trying to predict whether a sample belongs to its respective class or not.
In the OvR approach, the dataset is split into three binary classification problems based on the target variable placement :
Class 1 (Placement = Yes) : Dataset includes samples where placement = Yes, and the others (No, Opt Out) are considered negative examples.
Class 2 (Placement = No) : Dataset includes samples where placement = No, and the others (Yes, Opt Out) are considered negative examples.
Class 3 (Placement = Opt Out) : Dataset includes samples where placement = Opt Out, and the others (Yes, No) are considered negative examples.
Thus, the multiclass classification problem is transformed into three binary classification problems.
How the Prediction Works :
Suppose we have a new data point with cgpa = 6.5 and iq = 65, and we want to predict the placement (whether the student gets placed, is not placed, or opts out).
The point is passed through all three trained classifiers, and we get three probabilities :
P(Placed) = 0.6
P(Not Placed) = 0.3
P(Opt Out) = 0.5
Next, we normalize the probabilities to make sure they sum to 1 :

Now, the sum of the probabilities is 1, and we have the normalized probabilities for each class :
P(Placed) = 0.4286
P(Not Placed) = 0.2143
P(Opt Out) = 0.3571
The class with the highest probability (0.4286 for "Placed") is chosen as the predicted outcome.
Advantages and Disadvantages of the OvR Approach
Advantages :
Simplicity: Each classifier is a binary logistic regression model, which makes the approach easy to implement and understand.
Flexibility: The approach can be extended to a large number of classes, allowing us to handle problems with more than three classes by training additional binary classifiers.
Independence of Classifiers: Since each classifier is independent, you can optimize and train them separately, which can make the process more efficient in certain scenarios.
Interpretability: The coefficients of each logistic regression model are interpretable, making it easy to understand how each feature influences the decision for each class.
Disadvantages :
Imbalanced Data: If one class is significantly larger than the others, the classifiers for the smaller classes might perform poorly due to imbalance.
Multiple Predictions: During prediction, we need to compute multiple probabilities (one for each classifier) and then normalize them, which adds extra computational steps.
Overlapping Decision Boundaries: Each binary classifier might misclassify samples that are close to the decision boundary between two classes. As a result, the overall predictions can be less accurate if the classes are not well-separated.
Training Time: As the number of classes increases, the number of binary classifiers grows, which can increase the training time and complexity.
No Global Probability Relationship: Since each classifier is trained independently, there is no direct relationship between the predicted probabilities across the classifiers. This can cause inconsistencies when comparing the probabilities.
Softmax function :-> The softmax function is used to convert a vector of real numbers (logits) into a probability distribution. It is commonly used in classification problems where the goal is to assign probabilities to each class.

Suppose we have three real-valued scores (also called logits) :
z₁ , z₂ , z₃
The softmax of each score is calculated as :

Each output of the softmax function represents the predicted probability of the corresponding class. These probabilities are :
Always positive
Add up to 1, i.e. :

Softmax Approach

Softmax regression (also known as multinomial logistic regression) is a generalization of logistic regression used for multi-class classification problems. While logistic regression is used to separate two classes using a decision boundary (like a line in 2D or a plane in 3D), softmax regression handles more than two classes.
Since softmax regression is still a linear model, it creates linear decision boundaries between classes.
Understanding the Model :
Suppose we have k classes. The softmax regression model will learn k sets of parameters, one for each class. Each set of parameters includes :

This means that for each class, the model learns a linear function :

These scores zᵢ are then passed through the softmax function to convert them into probabilities for each class.
Parameter Count :
For k classes and n features, the model needs to learn :
Total parameters = k × (n + 1)
(where "+1" accounts for the bias term)
Example :
If we have 3 classes and 2 features, then the model will learn 3 × (2 + 1) = 9 parameters
If we increase to 3 features, then 3 × (3 + 1) = 12 parameters
Intuition :
Each class's parameter set defines a linear boundary. The model uses these boundaries to create decision regions, effectively assigning a class to every point in the feature space based on the highest softmax score.
Multi-class Classification Example using Softmax Regression
Suppose we have a dataset with the following features :
cgpa (academic performance)
iq (intelligence score)
placement (target/output variable with three classes : Yes, No, and O for "Opt-out")
To handle multiple classes in the target variable, we first apply one-hot encoding to convert categorical labels into a numeric format.
One-Hot Encoding of the Target Variable :
Each row will now have a vector of size 3 representing the actual class label using 1s and 0s.
Loss Function: Categorical Cross-Entropy
To train the softmax regression model, we use the cross-entropy loss function. It measures the difference between the true distribution (from one-hot encoding) and the predicted distribution (from softmax outputs).


Special Case – Binary Classification :
If there are only two classes (i.e., K = 2), the categorical cross-entropy loss simplifies to the binary cross-entropy loss, commonly used in logistic regression.
Understanding Cross-Entropy Loss for One-Hot Encoded Data (Step-by-Step)
Let’s assume, for simplicity, that our dataset contains only one data point, i.e., n = 1 .
In that case, the categorical cross-entropy loss simplifies to :

Expanding the Loss for a Single Data Point
Assume the true label for the first row is "Yes" (so the one-hot encoding is [1, 0, 0]).
Then the loss becomes :

Since the one-hot vector is [1, 0, 0], this simplifies to :

Extending to Multiple Data Points
Now assume we have three data points, each with different class labels :
First row → class: Yes
Second row → class: No
Third row → class: Opt
Then the total loss becomes :


Out of the total 9 terms (3 data points × 3 classes), only 3 terms remain non-zero—the ones corresponding to the true class labels. All others are multiplied by 0 due to one-hot encoding.
The goal of training is to minimize this loss function. Minimizing the loss corresponds to maximizing the predicted probabilities of the correct classes.
How the Softmax Model Computes Predicted Probabilities
Earlier, we saw that our loss function for multi-class classification looks like :

From Binary to Multi-class Logistic Regression
In binary logistic regression, the predicted probability is computed using the sigmoid function :

In multi-class classification (softmax regression), instead of the sigmoid, we use the softmax function.
Computing Softmax Scores :
Suppose we have three classes: Yes, No, and Opt. For each class, the model learns its own set of coefficients :

Assume we have a data point with features: cgpa = 8, iq = 80.
We compute the class scores (logits) :

These scores are then passed into the softmax function to get predicted probabilities :

Loss Function Depends on the Parameters
Now let’s assume we have 3 data points, with respective true labels: Yes, No, Opt.
Then the loss function becomes :


Goal of Optimization
We aim to find the values of these 9 parameters that minimize the loss function.
This is typically done using gradient descent or other optimization algorithms.
Training Softmax Regression: An Optimization Problem
Softmax regression is essentially an optimization problem. Our objective is to find the best values for the model parameters (i.e., the coefficients for each class) that minimize the loss function (cross-entropy loss).
Step 1 : Initialize Parameters
We begin by randomly initializing the model parameters. For a 3-class classification problem with 2 features (cgpa, iq) and a bias term, we have 9 parameters in total :

Step 2 : Apply Gradient Descent
We use gradient descent to update the parameters and minimize the loss. A general update rule for any parameter looks like :


Where :
j ∈ {0,1,2} : refers to the bias, feature 1 (cgpa), and feature 2 (iq)
k ∈ {1,2,3} : refers to the class index
We run this update iteratively for multiple epochs (e.g., 1000 times) until the loss converges.
Step 3 : Final Model Parameters
After training, we obtain optimized values for the parameters :

These parameter values are such that the loss function is minimized, i.e., the model has learned to best separate the classes.
Step 4 : Making Predictions
Suppose we get a new data point with :
cgpa = 6.5
iq = 65
We compute the scores (logits) for each class :

We then apply the softmax function to convert these scores into probabilities :

These three values represent the probabilities of the new point belonging to each class.
Final Prediction
Whichever class has the highest probability becomes the predicted class for the new data point.
For example :

From Multiclass Cross-Entropy to Binary Cross-Entropy
Let’s now understand how the cross-entropy loss function for multiclass classification reduces to the binary cross-entropy loss when the number of classes k = 2 .
Multiclass Cross-Entropy Loss
The general form of the cross-entropy loss for k classes is :

Binary Cross-Entropy Case (K = 2)
When the number of classes is 2, say class 0 and class 1, the formula becomes :

For binary classification, we use the sigmoid function and binary cross-entropy loss.
For multiclass classification, we use the softmax function and categorical cross-entropy loss.
In Our Case
We have :
2 input features (cgpa, iq)
3 output classes (Yes, No, Opt-out)
This means :
We use the softmax function for prediction
We use the categorical cross-entropy loss for training
There are 9 parameters to optimize (3 weights + 1 bias for each class)
We minimize the loss using gradient descent to find the optimal parameter values that best separate the classes.
When to use what ?
Use One-vs-Rest (OVR) when :
Classes are Non-Mutually Exclusive: OVR is appropriate if an instance can belong to more than one class, as each classifier provides an independent probability foreach class.
Dealing with Imbalanced Data: OVR might perform better when class distribution is highly imbalanced since each class gets a dedicated model.
Use Multinomial Logistic Regression (SoftMax Regression) when :
Computational Efficiency is Required: Softmax Regression is generally more efficient for large datasets and a high number of classes.
Classes are Mutually Exclusive: SoftMax Regression is a good choice when each instance can only belong to one class. The SoftMax function provides a set of probabilities that sum to 1, fitting well with mutually exclusive classes.
Interpretability is Important: The probabilities output by SoftMax Regression are more interpretable than those from OVR, as they always sum to 1. This can make model predictions easier to explain.


