Gradient Boosting For Classification - 2

Aryan
Jun 25
4 min read

Gradient Boosting Classification : Mathematical Formulation

Let’s see how gradient boosting works mathematically in the case of classification tasks. We can use the same algorithm as shown above for both regression and classification, by simply changing the loss function L(y , f(x)) . When we change the loss function and use the classification-specific loss (log loss), this algorithm works for classification .

Step 1: Choosing the Loss Function

First, we have to decide which loss function to use. For binary classification, we use log loss.

The formula of log loss is :

Step 2 : Converting from Probability Space to Log-Odds Space

This is currently expressed in terms of probability space. But we need to convert it into log-odds space, because boosting models typically operate there.

So, we start manipulating :

Now group and rearrange terms :

Since :

So the expression becomes :

Step 3: Expressing pᵢ in terms of log-odds

Now we convert the predicted probability back into log-odds form :

Then :

So we update the loss function :

Step 4: Final Simplification

Now simplify it :

Conclusion

So, the final loss function now operates in log-odds space :

Now, we can easily use this in gradient boosting since all steps are defined over log-odds, which aligns with the additive nature of boosting models (especially when using trees that predict in the log-odds domain) .

Step 1: Initializing the First Model f₀(x)

Now that we have our log-loss function formulated in terms of log-odds, let’s move to Step 1 of the gradient boosting algorithm.

In gradient boosting, we add models stage by stage. In Step 1, we must initialize our first model f₀(x) . This means we need to find the value of γ that minimizes the loss function :

Starting with the loss function :

At this point, our loss is in terms of log(oddᵢ) , but we want to generalize it to γ , since f₀(x) = γ is constant for all samples in the first step.

So, we rewrite the loss function as :

Now differentiate with respect to γ and set derivative to zero to minimize :

Split and rearrange :

Step 2(a) : Compute Residuals rᵢₘ

Once we've initialized the model with f₀(x) , we begin adding models stage by stage. In each stage mmm, for each data point i, we compute the residual :

This derivative tells us how much the loss would change with respect to the model’s prediction. These residuals are what the next decision tree will try to fit .

Derivation for Binary Log Loss

Final Result

For each data point, the residual is simply the difference between actual label and predicted probability :

This makes perfect sense intuitively:

If prediction is too high, residual is negative → pull it down
If prediction is too low, residual is positive → pull it up

These residuals now act as pseudo-targets for the next tree to learn from .

Step 2(b): Fitting the First Regression Tree on Residuals

"Now the residuals we calculated earlier—it's time to make a regression tree out of them."

After computing the initial prediction f₀(x) , we calculate the residuals—the errors between the true labels and the predictions (in log-odds or probability form).

What We Have :

Features (X) :
- cgpa, iq (input features)
Target (y) :
- Residuals r₁ , computed from the initial model f₀(x)

These residuals indicate how far off our initial model was from the actual label .

What We Do:

We now train a regression tree (not a classifier !) to predict the residuals.

Input: cgpa, iq
Target: Residuals r₁

This regression tree becomes our first correction model :

f₁(x)

Here’s the decision tree trained on residuals :

The tree learns to approximate the residuals based on the features—trying to minimize the squared error.

Why This Matters:

This is the heart of boosting. Instead of trying to get the correct answer directly, we build a model that learns how wrong we were, and use that to correct ourselves.

Each subsequent model will repeat this :

Predict residuals → Train regression tree → Add corrections

Step 2(c) : Calculating γ for Each Leaf Node (Part 1)

Once we've trained the regression tree, the next task is to calculate the output value for each leaf node, denoted by :

Hessian (Denominator)

Next, we need the second derivative :

Final Explanation in Words

This gives the leaf output value γ for node 2.
Similarly, we repeat the same computation for every leaf node in the regression tree.
The computed γ for each node becomes the prediction/output that will be added to the ensemble model at the current boosting stage.

Step 2(d) : Updating the Model with the Tree Output

Now that we've trained our first regression tree on the residuals, it’s time to update our model.

Boosting Step – Model Update Rule

After fitting the regression tree to the residuals, we add its prediction to the current model’s output to improve it. This forms the basis of how boosting works :

f₁(x) = f₀(x) + output from decision tree

This updated model f₁(x) is now a better approximation of the true log-odds than f₀(x) , because it's been "boosted" by correcting its earlier mistakes .

Intuition

You’re not replacing the model. You’re stacking improvements .

Each tree tries to fix what the previous ensemble couldn’t .

Each step is a course correction, like GPS adjusting your route as you go off track .

That’s the essence of boosting: update → correct → repeat → converge .

Gradient Boosting For Classification - 2

Recent Posts

© 2025 Aryan Upadhyay |