top of page

LOGISTIC REGRESSION - 1

  • Writer: Aryan
    Aryan
  • Apr 14
  • 11 min read

INTRODUCTION TO LOGISTIC REGRESSION


Logistic regression is a fundamental statistical and machine learning technique used for solving binary classification problems. Despite its name containing "regression," it's actually a classification algorithm that predicts the probability of an instance belonging to a particular class. Unlike linear regression which predicts continuous values, logistic regression predicts discrete class labels by modeling the probability that a given input belongs to a specific category.


SOME BASIC GEOMETRY

 

  1.  Every Line Has a Positive Side and a Negative Side

 

In the figure, the line represents the equation:

                                     4x+3y+5=0

The blue region corresponds to the positive side of the line, where:

                                    4x+3y+5>0

The green region corresponds to the negative side of the line, where:

                                    4x+3y+5<0

The line itself (shown as the boundary between the two regions) represents the set of points satisfying:

                                    4x+3y+5=0

Any linear equation in two variables (x and y) divides the 2D plane into two regions—called half-planes—based on whether the expression evaluates to a positive or negative value.

ree
  1. How to Determine if a Given Point Lies on a Given Line


Suppose we are given the equation of a line :

                                             4x + 3y + 5 = 0

Now, let’s say we have a point (5,2) and we want to check whether this point lies on the line.

To do this, substitute the x- and y-coordinates of the point into the equation :

                                4(5) + 3(2) + 5 = 20 + 6 + 5 = 31 ≠ 0

Since the result is not zero, the point (5,2) does not lie on the line.

A point (x,y) lies on the line Ax + By + C = 0 if and only if substituting its coordinates into the equation results in a value of zero.


  1. How to Determine if a Given Point Lies on the Positive or Negative Side of a Line

ree

Suppose we are given a line represented by the equation :

                                               Ax + By + C = 0

and a point (x₁ , y₁). To determine whether this point lies on the positive side or the negative side of the line, substitute the coordinates into the left-hand side of the equation :

                                                Ax₁+By₁+C

If the result is greater than 0, the point lies on the positive side of the line.

  • If the result is less than 0, the point lies on the negative side.

  • If the result is equal to 0, the point lies on the line.


Example 1 :

 

Given the line :

                4x+3y+5=0

Let’s test the point (2,2) :

              4(2)+3(2)+5=8+6+5=19>0

So, the point (2,2) lies on the positive side of the line.

ree

Example 2 :

 

Now test the point (−2,−2) :

                 4(−2)+3(−2)+5 = −8−6+5 = −9 < 0

Therefore, the point (−2,−2) lies on the negative side of the line.


ree

THE PROBLEM

 

Logistic Regression is a classification algorithm. Given two classes, it learns a decision boundary (a line in 2D) that separates the data points belonging to different classes.

Let’s say we have data for four students. Each student has two features: IQ and CGPA.

  • If a student got placed, they are marked with a green dot.

  • If a student did not get placed, they are marked with a red dot.

We want to train a logistic regression model to predict if a student will get placed, given their IQ and CGPA.

ree

Comparing Two Models

 

We trained two different models. Each model draws a different decision boundary:

  • Model 1: Separates green and red dots correctly (0 misclassifications).

  • Model 2: Misclassifies two red dots as green.

Clearly, Model 1 is better because it has fewer misclassifications. So we might be tempted to always select the model with the least misclassification count.

 

Let the equation of the decision boundary be :

 

                                        Ax + By + C = 0

 

To classify any point (xᵢ , yᵢ) , compute :

 

                                       Z = Axᵢ + Byᵢ + C

Then:

  • If the point is green (should be positive) and z > 0 :  correct

  • If the point is red (should be negative) and z < 0 : correct

  • Otherwise:  misclassification


But There’s a Problem…


ree

What if two or more models result in zero misclassification ? For example :

  • Model 1 and Model 2 both classify all training points correctly.

  • Yet, the boundaries they draw are very different.

Now, which model should we choose ?

We can't use misclassification count anymore because it’s the same for both.

This exposes a flaw in the naïve approach. Just using the count of misclassified points is not enough.


New Problem Formulation


ree

As we know, there is a problem where we have two models—both with zero misclassification. Because of this, it's difficult to determine which model is better, as both seem equally good in terms of classification accuracy.

However, classification shouldn't be treated as strictly black and white. The current logic assumes that any point above a decision boundary (the line) lies in the positive region and is therefore classified as positive, while any point below the line is in the negative region and classified as negative. This binary perspective is overly simplistic and rigid.

A more nuanced approach would be to assign a score or value to each point instead of labeling them directly as positive or negative. For example, we can assign a value between 0 and 1 to each point based on how far it is from the decision boundary.

  • Points that are farther from the line in the positive region can be assigned a higher score, indicating stronger confidence in the positive classification.

  • Points closer to the line in the positive region get a lower score, indicating lower confidence.

  • Similarly, for the negative region, points closer to the line can be assigned higher (less negative) values, while points farther from the line receive lower values (more negative), reflecting stronger confidence in the negative class.

This highlights that distance from the decision boundary matters, and classification should consider this.

In summary, this approach—where we use a continuous score based on distance from the boundary—is more informative than binary classification. Once these scores are assigned, classification can be performed more meaningfully based on thresholds or probability-like interpretations.


Understanding Logistic Regression: How 0 and 1 Are Predicted

 

When working on a classification problem, let's say we have training data and we apply logistic regression. The goal is to train the model so that it can make predictions for every data point. Each prediction is either 0 or 1 — where 1 might represent a "positive" class (e.g., green), and 0 represents a "negative" class (e.g., red).

 

Example Dataset

 

Suppose we have the following dataset : 

IQ

CGPA

Placed

60

6

0

40

4

0

80

8

1

90

9

1

 Now, we want to make a prediction for the fourth data point.

Let’s assume our logistic regression model has learned the following linear equation:

                                               3X + 5Y + 6 = 0

To make a prediction, we plug the values of CGPA = 9 and IQ = 90 into the equation:

                                         3(9) + 5(90) + 6 = 483

Since the result is greater than 0, we predict the label as 1 (placed/green). If the result were less than 0, we would predict 0 (not placed/red).


Mathematical Interpretation in Machine Learning

 

The equation can be written in the standard linear model form :

                                               z = β₀ + β₁X₁ + β₂X₂

Where :

  • β₀ (intercept),

  • β₁ , β₂ ​are the weights (coefficients),

  • X₁ = CGPA, X₂=IQ ,

  • z is the linear combination of inputs.

 

If z < 0 , we label the point as 0; if z > 0 , we label it as 1.

 

This kind of thresholding is done using a step function :

  • If z is positive → output = 1

  • If z is negative → output = 0

Graphically, this looks like a sharp jump from 0 to 1, which is why it's called a step function.


Why Not Use the Step Function ?

 

While the step function can classify, it doesn't give us a probabilistic or continuous output. Ideally, we want a model that:

  • Outputs values close to 1 for points far from the decision boundary (more confident predictions).

  • Outputs values close to 0.5 for points near the decision boundary (less confident predictions).

So instead of a step function, we use the sigmoid function to convert the value of z into a probability between 0 and 1.

ree

The sigmoid function smoothly maps any real-valued number into the (0, 1) range. This allows logistic regression to estimate the probability that a given input belongs to class 1.

  1. We compute a linear combination :

                    z = β₀ + β₁X₁ + β₂X₂

  1. Instead of directly thresholding z with a step function, we pass it through the sigmoid :

ree
  1. We interpret the output ​ŷ ​as the probability of belonging to class 1.

  2. We can then use a threshold (e.g., 0.5) to decide whether to classify it as 0 or 1.


Sigmoid Function


ree

Equation of the Sigmoid Function :

ree

Step Function (for Comparison) :

ree

Understanding the Behavior :


  • In the sigmoid function :

    • When the value of X is very large, the output Y approaches 1.

    • When X = 0 , the output Y = 0.5 .

    • When X is very small or approaches negative infinity, the output Y approaches 0.

    • This smooth transition is the key behavior of the sigmoid function.

  • In the step function :

    • No matter how large or small the value of X :

      • If X is positive, the output Y=1 .

      • If X is negative, the output Y=0 .

    • It's a sharp, binary decision — either 0 or 1 — with no smooth transition.


Why Use Sigmoid Instead of Step Function ?

 

  • The step function is too rigid — it classifies values as either 0 or 1 with no sense of uncertainty.

  • The sigmoid function provides a smooth and continuous output between 0 and 1.

  • This makes it more suitable for machine learning models, especially when we want to interpret outputs probabilistically (e.g., “this input has a 70% chance of belonging to class 1”).


Let our equation of the line be :

Z = β₀ + β₁X₁ + β₂X₂

First, we calculate the value of Z .

Instead of passing this value to a step function, we pass it to the sigmoid function :

                                          σ(z)

This gives us a value between 0 and 1 depending on the value of z . If z=0 , the output is 0.5. This means that any point on the line will have a sigmoid output of 0.5.

So, our point X lying on the line will also output 0.5 from the sigmoid.

ree

Now, consider the case where :

  • Point 3 lies just above the blue line — its z value is slightly greater than 0, so the sigmoid output could be around 0.6.

  • Point 4, further from the line, could have a sigmoid output around 0.7.

As we continue moving further in the positive direction, the sigmoid output approaches 1.

Conversely, if we move below the decision boundary into the negative side, sigmoid values decrease :

  • At Point 1, we might get an output of 0.3.

  • At Point 2, which is farther, the output might be 0.2.

  • As we go further negative (towards infinity), the output approaches 0.

ree

Essentially, our whole classification region gets converted into a gradient of probabilities.

The green decision boundary (our equation of the line) corresponds to a sigmoid value of 0.5.

As we move towards positive infinity, the sigmoid approaches 1.

As we move towards negative infinity, the sigmoid approaches 0.

So, if a point lies at positive infinity (far on the positive side), we are nearly certain it belongs to the positive class. Similarly, a point far in the negative side is almost certainly negative.

ree

This gives us a probabilistic interpretation of classification :

  • If a point has a probability of 0.6 of being positive, it has a 0.4 probability of being negative.

  • If a point has a 0.7 probability of being positive, the probability of it being negative is 0.3.

This smooth gradient transforms our entire classification region into a probability landscape :

  • On the line → probabilities are equal (0.5 positive, 0.5 negative)

  • Farther in the positive direction → probability of positive increases

  • Farther in the negative direction → probability of negative increases


From Step Function to Sigmoid Function

 

In the step function, if a point lies in the positive region of the line, we classify it as positive (1).

If it lies on the negative side, we classify it as negative (0).

But this is a rigid approach—it doesn't capture uncertainty.

With the sigmoid function, we now get a probability score for every point:

  • The farther a point is in the positive region, the higher its chance of being 1.

  • The farther a point is in the negative region, the higher its chance of being 0.

This turns our classification from black-and-white to a shaded gradient, allowing for a more nuanced, probabilistic understanding of each prediction.

 

Maximum Likelihood in Logistic Regression


ree

When trying to classify data points using logistic regression, we want to find the best decision boundary (line) that separates the classes most accurately. But there are many possible lines—so how do we choose the best one?

We use a loss function.

The idea is to:

  • Define a loss function that tells us how well a line (or model) performs.

  • Minimize this loss function to find the best parameters (and hence the best line).

In logistic regression, the loss function used is based on maximum likelihood estimation (MLE).


What is Maximum Likelihood ?

 

Maximum Likelihood Estimation is a method that finds the parameters (i.e., the line) that make the observed data most probable.

Instead of predicting exact labels (0 or 1), logistic regression gives probabilities. For each data point:

  • If it’s a positive class (e.g., green), we want its predicted probability to be close to 1.

  • If it’s a negative class (e.g., red), we want its predicted probability to be close to 0.

 

Comparing Two Models Using Likelihood


ree

Suppose we have two models (Model 1 and Model 2), each with a different decision boundary. We calculate how likely each model is to produce the observed data by multiplying the predicted probabilities for all the points.

Let’s walk through an example :

 

Model 1 Predictions :

Point

True Class

Predicted Probability

Contribution to Likelihood

1

Red

0.4 (→ red: 0.6)

0.6

2

Red

0.3 (→ red: 0.7)

0.7

3

Green

0.6

0.6

4

Green

0.8

0.8

Likelihood (Model 1) = 0.6 × 0.7 × 0.6 × 0.8 = 0.2016

 

Model 2 Predictions :

Point

True Class

Predicted Probability

Contribution to Likelihood

1

Red

0.4 (→ red: 0.6)

0.6

2

Red

0.6 (→ red: 0.4)

0.4

3

Green

0.3

0.3

4

Green

0.6

0.6

Likelihood (Model 2) = 0.6 × 0.4 × 0.3 × 0.6 = 0.0432

 

  • Model 1 has a higher likelihood than Model 2.

  • Therefore, Model 1 is a better fit according to maximum likelihood estimation.

  • In logistic regression, the line that gives maximum likelihood becomes our final decision boundary.


Log Loss (Binary Cross-Entropy)

 

Let’s say we have four data points, and a logistic regression model is used to predict the probabilities of those points belonging to the green (class = 1) or red (class = 0) classes.

Suppose the model gives the following predicted probabilities for the true classes :

  • For a green point : 0.8

  • For another green point : 0.6

  • For a red point : 0.6

  • For another red point : 0.7

The joint likelihood (i.e., the product of individual probabilities) for correct predictions would be :

ML = 0.8 × 0.6 × (1−0.6) × (1−0.7) = 0.8 × 0.6 × 0.4 × 0.3 = 0.0576

This is a small number — and in real-world datasets with thousands of samples, the product of so many probabilities (each between 0 and 1) becomes extremely small, approaching zero, which causes numerical underflow in computations.


Using Log to Avoid Underflow

 

To avoid underflow, instead of multiplying probabilities, we take the logarithm of the likelihood. The logarithm converts the product into a sum :

ree

However, since log values of numbers between 0 and 1 are negative, to make the loss value positive (since we aim to minimize it), we take the negative log-likelihood :

ree

So, instead of maximizing the likelihood (which was our original objective in Maximum Likelihood Estimation), we equivalently minimize the negative log-likelihood. This transformation makes the objective numerically stable and computationally feasible.


Why We Cannot Use Only Ŷᵢ in All Cases

 

We might be tempted to write :

ree

But this is incorrect. Why ?

 

Because Ŷᵢ represents the predicted probability of the positive class (e.g., green). If a data point actually belongs to the negative class (e.g., red), using Ŷᵢ would incorrectly measure the probability of the wrong class. So we must account for the true label Yᵢ .


Unified Expression for All Data Points

 

To correctly handle both classes (0 and 1), we use this combined expression for each data point :

ree

This works as follows :

ree

Applying this to all four points, we compute :

ree

Log Loss Formula (Binary Cross-Entropy)

 

Now, generalizing for all n data points :

ree

Where :

ree

This function L is called :

  • Log Loss

  • Binary Cross-Entropy Loss

  • It is the standard loss function for binary classification tasks using logistic regression.

  • Minimizing Log Loss is equivalent to maximizing the likelihood of the observed data under the model.

We need to find the values of β₀ , β₁ , β₂ ​that minimize the value of our loss function.

ree
ree

Suppose we have data like the one shown above. When we apply logistic regression to this data, any of the blue lines could represent a possible logistic regression decision boundary. Each line corresponds to a specific set of coefficients β₀ , β₁ , β₂ ​. However, we aim to find the line (i.e., the set of coefficients) that minimizes the loss function. That line will be our optimal logistic regression model .


bottom of page