top of page

LOGISTIC REGRESSION - 2

  • Writer: Aryan
    Aryan
  • Apr 16
  • 9 min read

GRADIENT DESCENT

ree

So what we do is :

ree

Suppose we have a dataset, and we start with a random decision boundary (a line). This line is defined by coefficients (parameters) like

β₀ , β₁ , β₂

Initially, these coefficient values are randomly assigned.

 

Then we start updating these coefficient values using gradient descent. The update rules are :

ree
ree

What is multi-class classification ?

ree

Multi-class classification refers to classification tasks that involve more than two classes, where each input is assigned to exactly one of the possible classes.

 

Examples :

  1. Digit Recognition

    Predict which digit (0–9) is in an image → 10 classes

    (e.g., MNIST dataset)

  2. Animal Classification

    Classify an image as cat, dog, or rabbit → 3 classes

  3. Sentiment Analysis

    Predict whether a review is positive, neutral, or negative → 3 classes


How Logistic Regression Handles Multi-class Classification Problems

 

Logistic regression can be extended to handle multi-class classification using two main techniques :

  1. One-vs-Rest (OvR), also known as One-vs-All (OvA)

  2. Multinomial Logistic Regression, also known as Softmax Regression


OVR (One-vs-Rest) Approach

 

Suppose we have a dataset where the input features are cgpa and iq, and the target/output feature is placement, which can take one of three values: Yes, No, or Opt Out.

cgpa

iq

placement

-

-

Y

-

-

N

-

-

O

In the One-vs-Rest (OvR) approach, for each unique class, we train a separate logistic regression model. So, if we have 3 classes, we train 3 logistic regression models — one for each class.

Steps :

  1. Split the output into separate binary classification problems — one for each class :

    • Model 1: "Yes" vs (No + Opt Out)

    • Model 2: "No" vs (Yes + Opt Out)

    • Model 3: "Opt Out" vs (Yes + No)


  2. Apply One-Hot Encoding to the output column. This converts the categorical target variable into multiple binary columns — one for each class :

cgpa

iq

Placement = Y

Placement = N

Placement = O

-

-

1

0

0

-

-

0

1

0

-

-

0

0

1

This is the transformed data used to train the three logistic regression models — each trying to predict whether a sample belongs to its respective class or not.


In the OvR approach, the dataset is split into three binary classification problems based on the target variable placement :

 

  1. Class 1 (Placement = Yes) : Dataset includes samples where placement = Yes, and the others (No, Opt Out) are considered negative examples.

  2. Class 2 (Placement = No) : Dataset includes samples where placement = No, and the others (Yes, Opt Out) are considered negative examples.

  3. Class 3 (Placement = Opt Out) : Dataset includes samples where placement = Opt Out, and the others (Yes, No) are considered negative examples.

 

Thus, the multiclass classification problem is transformed into three binary classification problems.


How the Prediction Works :

 

  • Suppose we have a new data point with cgpa = 6.5 and iq = 65, and we want to predict the placement (whether the student gets placed, is not placed, or opts out).

  • The point is passed through all three trained classifiers, and we get three probabilities :

    • P(Placed) = 0.6

    • P(Not Placed) = 0.3

    • P(Opt Out) = 0.5

  • Next, we normalize the probabilities to make sure they sum to 1 :

ree
  • Now, the sum of the probabilities is 1, and we have the normalized probabilities for each class :

    • P(Placed) = 0.4286

    • P(Not Placed) = 0.2143

    • P(Opt Out) = 0.3571

  • The class with the highest probability (0.4286 for "Placed") is chosen as the predicted outcome.


Advantages and Disadvantages of the OvR Approach

 

Advantages :

 

  1. Simplicity: Each classifier is a binary logistic regression model, which makes the approach easy to implement and understand.

  2. Flexibility: The approach can be extended to a large number of classes, allowing us to handle problems with more than three classes by training additional binary classifiers.

  3. Independence of Classifiers: Since each classifier is independent, you can optimize and train them separately, which can make the process more efficient in certain scenarios.

  4. Interpretability: The coefficients of each logistic regression model are interpretable, making it easy to understand how each feature influences the decision for each class.

 

Disadvantages :

 

  1. Imbalanced Data: If one class is significantly larger than the others, the classifiers for the smaller classes might perform poorly due to imbalance.

  2. Multiple Predictions: During prediction, we need to compute multiple probabilities (one for each classifier) and then normalize them, which adds extra computational steps.

  3. Overlapping Decision Boundaries: Each binary classifier might misclassify samples that are close to the decision boundary between two classes. As a result, the overall predictions can be less accurate if the classes are not well-separated.

  4. Training Time: As the number of classes increases, the number of binary classifiers grows, which can increase the training time and complexity.

  5. No Global Probability Relationship: Since each classifier is trained independently, there is no direct relationship between the predicted probabilities across the classifiers. This can cause inconsistencies when comparing the probabilities.


Softmax function :-> The softmax function is used to convert a vector of real numbers (logits) into a probability distribution. It is commonly used in classification problems where the goal is to assign probabilities to each class.

ree

Suppose we have three real-valued scores (also called logits) :

z₁ , z₂ , z₃

The softmax of each score is calculated as :

ree

Each output of the softmax function represents the predicted probability of the corresponding class. These probabilities are :

  • Always positive

  • Add up to 1, i.e. :

ree

Softmax Approach

ree

Softmax regression (also known as multinomial logistic regression) is a generalization of logistic regression used for multi-class classification problems. While logistic regression is used to separate two classes using a decision boundary (like a line in 2D or a plane in 3D), softmax regression handles more than two classes.

Since softmax regression is still a linear model, it creates linear decision boundaries between classes.

 

Understanding the Model :

 

Suppose we have k classes. The softmax regression model will learn k sets of parameters, one for each class. Each set of parameters includes :

ree

This means that for each class, the model learns a linear function :

ree

These scores zᵢ ​are then passed through the softmax function to convert them into probabilities for each class.

 

Parameter Count :

  • For k classes and n features, the model needs to learn :

    Total parameters = k × (n + 1)

    (where "+1" accounts for the bias term)

  • Example :

    • If we have 3 classes and 2 features, then the model will learn 3 × (2 + 1) = 9 parameters

    • If we increase to 3 features, then 3 × (3 + 1) = 12 parameters


Intuition :

 

Each class's parameter set defines a linear boundary. The model uses these boundaries to create decision regions, effectively assigning a class to every point in the feature space based on the highest softmax score.


Multi-class Classification Example using Softmax Regression

 

Suppose we have a dataset with the following features :

 

  • cgpa (academic performance)

  • iq (intelligence score)

  • placement (target/output variable with three classes : Yes, No, and O for "Opt-out")

 

To handle multiple classes in the target variable, we first apply one-hot encoding to convert categorical labels into a numeric format.

 

One-Hot Encoding of the Target Variable :

cgpa

iq

Placement = Yes

Placement = No

Placement = O

1

0

0

0

1

0

0

0

1

Each row will now have a vector of size 3 representing the actual class label using 1s and 0s.


Loss Function: Categorical Cross-Entropy

 

To train the softmax regression model, we use the cross-entropy loss function. It measures the difference between the true distribution (from one-hot encoding) and the predicted distribution (from softmax outputs).

ree
ree

Special Case – Binary Classification :

 

If there are only two classes (i.e., K = 2), the categorical cross-entropy loss simplifies to the binary cross-entropy loss, commonly used in logistic regression.


Understanding Cross-Entropy Loss for One-Hot Encoded Data (Step-by-Step)

 

Let’s assume, for simplicity, that our dataset contains only one data point, i.e., n = 1 .

In that case, the categorical cross-entropy loss simplifies to :

ree

Expanding the Loss for a Single Data Point

 

Assume the true label for the first row is "Yes" (so the one-hot encoding is [1, 0, 0]).

Then the loss becomes :

ree

Since the one-hot vector is [1, 0, 0], this simplifies to :

ree

Extending to Multiple Data Points

 

Now assume we have three data points, each with different class labels :


  1. First row → class: Yes

  2. Second row → class: No

  3. Third row → class: Opt

 

Then the total loss becomes :

ree
ree

Out of the total 9 terms (3 data points × 3 classes), only 3 terms remain non-zero—the ones corresponding to the true class labels. All others are multiplied by 0 due to one-hot encoding.

 

The goal of training is to minimize this loss function. Minimizing the loss corresponds to maximizing the predicted probabilities of the correct classes.


How the Softmax Model Computes Predicted Probabilities

 

Earlier, we saw that our loss function for multi-class classification looks like :

ree

From Binary to Multi-class Logistic Regression

 

In binary logistic regression, the predicted probability is computed using the sigmoid function :

ree

In multi-class classification (softmax regression), instead of the sigmoid, we use the softmax function.

 

Computing Softmax Scores :

 

Suppose we have three classes: Yes, No, and Opt. For each class, the model learns its own set of coefficients :

ree

Assume we have a data point with features: cgpa = 8, iq = 80.

We compute the class scores (logits) :

ree

These scores are then passed into the softmax function to get predicted probabilities :

ree

Loss Function Depends on the Parameters

 

Now let’s assume we have 3 data points, with respective true labels: Yes, No, Opt.

Then the loss function becomes :

ree
ree

Goal of Optimization

 

We aim to find the values of these 9 parameters that minimize the loss function.

This is typically done using gradient descent or other optimization algorithms.


Training Softmax Regression: An Optimization Problem

 

Softmax regression is essentially an optimization problem. Our objective is to find the best values for the model parameters (i.e., the coefficients for each class) that minimize the loss function (cross-entropy loss).

 

Step 1 : Initialize Parameters

 

We begin by randomly initializing the model parameters. For a 3-class classification problem with 2 features (cgpa, iq) and a bias term, we have 9 parameters in total :

ree

Step 2 : Apply Gradient Descent

 

We use gradient descent to update the parameters and minimize the loss. A general update rule for any parameter looks like :

ree
ree

Where :

  • j ∈ {0,1,2} : refers to the bias, feature 1 (cgpa), and feature 2 (iq)

  • k ∈ {1,2,3} : refers to the class index

We run this update iteratively for multiple epochs (e.g., 1000 times) until the loss converges.


Step 3 : Final Model Parameters

 

After training, we obtain optimized values for the parameters :

ree

These parameter values are such that the loss function is minimized, i.e., the model has learned to best separate the classes.


Step 4 : Making Predictions

 

Suppose we get a new data point with :

  • cgpa = 6.5

  • iq = 65

We compute the scores (logits) for each class :

ree

We then apply the softmax function to convert these scores into probabilities :

ree

These three values represent the probabilities of the new point belonging to each class.

 

Final Prediction

 

Whichever class has the highest probability becomes the predicted class for the new data point.

For example :

ree

From Multiclass Cross-Entropy to Binary Cross-Entropy

 

Let’s now understand how the cross-entropy loss function for multiclass classification reduces to the binary cross-entropy loss when the number of classes k = 2 .

 

Multiclass Cross-Entropy Loss

 

The general form of the cross-entropy loss for k classes is :

ree

Binary Cross-Entropy Case (K = 2)

 

When the number of classes is 2, say class 0 and class 1, the formula becomes :

ree
  • For binary classification, we use the sigmoid function and binary cross-entropy loss.

  • For multiclass classification, we use the softmax function and categorical cross-entropy loss.


In Our Case

 

We have :

  • 2 input features (cgpa, iq)

  • 3 output classes (Yes, No, Opt-out)

This means :

  • We use the softmax function for prediction

  • We use the categorical cross-entropy loss for training

  • There are 9 parameters to optimize (3 weights + 1 bias for each class)

We minimize the loss using gradient descent to find the optimal parameter values that best separate the classes.


When to use what ?

 

Use One-vs-Rest (OVR) when :

 

  1. Classes are Non-Mutually Exclusive: OVR is appropriate if an instance can belong to more than one class, as each classifier provides an independent probability foreach class.   

  2. Dealing with Imbalanced Data: OVR might perform better when class distribution is highly imbalanced since each class gets a dedicated model.

 

Use Multinomial Logistic Regression (SoftMax Regression) when :

 

  1. Computational Efficiency is Required: Softmax Regression is generally more efficient for large datasets and a high number of classes.

  2. Classes are Mutually Exclusive: SoftMax Regression is a good choice when each instance can only belong to one class. The SoftMax function provides a set of probabilities that sum to 1, fitting well with mutually exclusive classes.   

  3. Interpretability is Important: The probabilities output by SoftMax Regression are more interpretable than those from OVR, as they always sum to 1. This can make model predictions easier to explain.   


bottom of page