XGBoost Optimizations

Aryan
Sep 12
8 min read

In this blog we will learn how XGBoost works on high-dimensional data and why it performs so fast on very large datasets.

We will explore:

How XGBoost is different from other boosting algorithms.
Its internal optimizations.
Why it was designed to be fast.

Exact Greedy Split Finding

To understand why XGBoost is fast, we first need to see why other boosting algorithms are slow.

Let’s take an example. Suppose we have two features (cgpa, package) and five students’ data:

cgpa	package	pred	res1
5	5	11	6
6	13	11	-2
7	14	11	-3
8	9	11	2
9	14	11	-3

Step 1: Base Model

We start with the first model, which predicts the average of output features.
This becomes our base prediction.
Then we calculate pseudo residuals.

Step 2: Add Decision Tree

Next, we add Model 2, which is a decision tree.
We first find the root node.
Then we search for the best splitting criteria.

Step 3: Splitting Process

Example: splitting on the cgpa feature.
For each possible split, we:
1. Calculate similarity score.
2. Calculate gain.
We repeat this for all possible splits across all features.

Complexity

If we have 1 feature with n values, we need to test (n – 1) splits.
If we have m features and n rows, we need to do this m × (n – 1) times.

Example: With 100 columns and 1,00,000 rows, we need to evaluate:

100 × (100000 − 1) = 9,999,900 splits!

This makes training very slow on large datasets.

Why Is It Slow ?

For every node, we test all possible splits.
At each split, we compute gain and check which is best.
This exhaustive testing ensures best splitting (highest accuracy), but it takes too long.

This method is called the Exact Greedy Algorithm.

Trade-Off

Advantage → Finds the absolute best split, leading to maximum accuracy.
Disadvantage → On large datasets, it becomes extremely slow.

This is the root cause why traditional boosting algorithms are slow.

And this is the intuition behind why XGBoost needed new optimizations.

Approximate Method for Split Finding

XGBoost also has another technique called the Approximate Method, which makes the algorithm much faster.

Suppose we have the following dataset with two features (cgpa, package):

cgpa	package
4	5
5	13
6	14
7	9
8	15
9	4

Exact Greedy vs. Approximate

In the Exact Greedy Algorithm, we try every possible splitting point.

For example, with the cgpa values above, the possible splits are:

4.5,5.5,6.5,7.5,8.5

That’s 5 possible splits.

But in the Approximate Method, we do not test every possible split. Instead, we use quantiles.

Using Quantiles

Suppose we divide the data into 3 quantiles (for illustration) :

(4, 5)
(6, 7)
(8, 9)

Now, instead of testing all splits, we only check splits between quantile boundaries:

Between (4, 5) and (6, 7) → 5.5
Between (6, 7) and (8, 9) → 7.5

So, instead of 5 splits, we only test 2 splits (5.5 and 7.5).

Why It Speeds Up

Using greedy method → 5 splitting points.
Using approximate method → only 2 splitting points.

For every feature, the number of splitting points is greatly reduced, which makes the algorithm much faster.

If we add another feature (say iq), it will also be split using quantiles, further reducing the number of splits to check.

Trade-Off

Advantage → Faster training (since we test fewer splits).
Disadvantage → May lead to a suboptimal solution, because we are not testing all possible splits.

If we want more accuracy, we can increase the number of quantiles.

More quantiles → Closer to greedy method (better performance).
But more quantiles → More splits to test (slower speed).

Key Idea

The Approximate Method reduces computation by splitting only at quantile boundaries.

It is faster than the greedy method.
But may give a less-than-optimal split (trade-off between speed and accuracy).

Step-by-Step Working of the Approximate Method in XGBoost

Let’s go step by step to understand how the Approximate Split Finding Algorithm works.

Dataset Example :

cgpa	package
4	5
5	13
6	14
7	9
8	15
9	4

Here, gᵢ and hᵢ are calculated for every point, while Gᵢ and Hᵢ are the aggregated values for quantiles.

Step 1 → Decide Number of Quantiles

We divide the dataset into buckets (quantiles).

Suppose we choose 3 buckets.

Step 2 → Calculate Gradient & Hessian for Each Data Point

For regression, the formulas are:

gᵢ = ŷᵢ - yᵢ , hᵢ = 1

cgpa	package	pred	gᵢ	hᵢ
4	5	10	5	1
5	13	10	-3	1
6	14	10	-4	1
7	9	10	1	1
8	15	10	-5	1
9	4	10	6	1

Step 3 → Build Histogram of Quantiles

We group the data into 3 buckets:

Bucket 1 → (4, 5) → G₁ = 5+(−3) = 2 , H₁ = 2
Bucket 2 → (6, 7) → G₂ = −4+1 = −3 , H₂ = 2
Bucket 3 → (8, 9) → G₃ = −5+6 = 1 , H₃ = 2

Now we have aggregated values:

G₁ , G₂ , G₃
H₁ , H₂ , H₃

This histogram representation allows us to store only quantile-level values, instead of all data points.

Step 4 → Try Different Splits

From earlier, our possible splits are 5.5 and 7.5.

Note: For illustration purposes, we ignore the regularization terms (λ,γ).

Split 1: cgpa < 5.5

Left node → (4, 5) → Bucket 1
Right node → (6, 7, 8, 9) → Buckets 2 + 3

Similarity Scores :

Split 2: cgpa < 7.5

Left node → (4, 5, 6, 7) → Buckets 1 + 2
Right node → (8, 9) → Bucket 3

Similarity Scores:

We pick the split with the highest gain.

Key Optimization → Precalculation & Reuse

By calculating gradients and hessians per quantile, we only need to reuse values when testing splits.
For large datasets (e.g., 1 lakh rows, 100 quantiles), we calculate once and reuse → saving huge computation time.

Step 5 → Parallelization

Gradients and Hessians of different quantiles are independent.
This means they can be computed in parallel.
With multiple CPU cores, each core can handle different quantiles.
For very large datasets, we can also use distributed computing.

This is a major optimization — making training much faster.

Step 6 → Cache Memory Optimization

Normally, gradients and hessians are stored in RAM.
But XGBoost stores frequently used values (quantile-level Gᵢ , Hᵢ) in cache memory, which is much faster.
This avoids recomputation and speeds up the process.

Final Insights

XGBoost achieves high speed because of four key optimizations:

Quantile-based splits → fewer split points.
Precalculated histograms → reuse gradients & hessians.
Parallel & distributed computing → faster execution.
Cache memory usage → faster access to frequently used values.

That’s why the name is XGBoost → eXtreme Gradient Boosting.

QUANTILE SKETCH

Step-by-step (high level)

Choose the number of quantiles (buckets).
Calculate gradient gᵢ and hessian hᵢ for every point.
Aggregate gradients and hessians into quantile bins to build a histogram.
3(a). Optionally store the aggregated bin values in cache memory.
Try candidate splitting criteria using the quantile-level aggregates.

The approximate split-finding algorithm begins by deciding how many quantiles to use and then estimating those quantiles. But what if the dataset is huge and cannot be stored or sorted on a single machine (for example, hundreds of millions of rows or 100+ GB of data)? Computing exact quantiles normally requires sorting the dataset, which is expensive or even infeasible on a single machine. That is where quantile sketches come in.

A sketch is a compact data structure / streaming algorithm that summarizes a large data stream or dataset with limited memory. A quantile sketch estimates the data’s quantiles without sorting or storing all points. XGBoost uses a quantile-sketch-based approach to find approximate split candidates efficiently.

Why quantile sketches are useful

Efficiency — They compute approximate quantiles without sorting the whole dataset, reducing time and CPU work.
Scalability — By summarizing the distribution compactly, sketches let XGBoost handle very large datasets.
Low memory usage — Sketches avoid keeping every data point in memory; they store a small summary instead.
Split selection — Sketches provide effective candidate split points (quantile boundaries) for tree construction.
Distributed processing — Sketches computed on different data chunks can be merged. This enables XGBoost to approximate split points when data is distributed across multiple machines.
Streaming — Sketches work in streaming scenarios where data arrives online and you don’t have the entire dataset upfront.

How this helps in practice

When data is too large for one machine, you can partition it into chunks and run a sketch on each chunk (this is common in distributed computing). The per-chunk sketches are merged to form a global sketch — giving approximate quantiles without ever sorting the full dataset on a single node. This is the key optimization for very large or streaming datasets.

Final note

Quantile sketching is one of XGBoost’s major optimizations (especially for huge datasets). It lets the algorithm find good split candidates quickly and with small memory overhead — enabling the other optimizations (quantile histograms, precomputed G/H, parallel computation, and cache use) to scale in real-world, large-scale settings.

Weighted Quantile Sketch

The Weighted Quantile Sketch is an improvement over the basic quantile sketch.

1. Normal Quantile Sketch

Suppose we have 1 crore rows and we want to calculate 100 quantiles.

A normal quantile sketch divides the dataset into 100 bins, each with the same number of data points.
For example, if there are 10 crore rows and we want 10 quantiles, each bin will have 1 crore points.
Key idea → All bins contain equal counts of data points.

2. Problem with Normal Quantile

Not all data points are equally important.

Some points contribute more to the learning process.
Treating every point as equal (same weight) may ignore this.

3. Weighted Quantile Sketch

Here, every row/data point is assigned a weight. Instead of splitting by equal counts, we split by equal total weight.

In classification, we use Hessian values as weights.
For the binary logistic objective (log-loss, with pᵢ = σ(fᵢ)), the Hessian is:

hᵢ = pᵢ ⋅ (1 − pᵢ)

where pᵢ is the predicted probability.

If the model is uncertain about a point (e.g., probability ≈ 0.5), the Hessian is large (close to 0.25).
If the model is very confident (probability ≈ 0 or 1), the Hessian is small (close to 0).
Thus, Hessians capture how important or uncertain a point is.

So in weighted quantile sketch, we treat Hessians as weights.

4. Example

Suppose we have this dataset:

cgpa	placement	prediction	gradient	hessian
3	1	0.7	-0.3	0.21
4	0	0.2	0.2	0.16
5	0	0.1	0.1	0.09
6	1	0.8	-0.2	0.16
4	0	0.3	0.3	0.21
8	1	0.5	-0.5	0.25

Now, let’s create 3 quantile bins.

Normal quantile binning → sort values, then divide into equal counts (e.g., 2-2-2 points per bin).
Weighted quantile binning → sort values, but divide based on equal sum of Hessians, not counts.

Sorted by cgpa, with Hessians:

cgpa 3 → 0.21
cgpa 4 → 0.16
cgpa 4 → 0.21
cgpa 5 → 0.09
cgpa 6 → 0.16
cgpa 8 → 0.25

Total Hessian weight = 1.08 → each bin ≈ 0.36.

So bins look like this:

Bin 1 → (0.21, 0.16) → total ≈ 0.37
Bin 2 → (0.21, 0.09, 0.16) → total ≈ 0.46
Bin 3 → (0.25) → total ≈ 0.25

Notice:

Bin sizes (number of points) are different.
But sum of weights per bin is more balanced.

5. Key Difference

Normal quantile → Equal counts per bin.
Weighted quantile → Equal weight sum per bin.

This ensures the model focuses more on points with high Hessians (uncertain points).

6. Regression vs Classification

In regression, all Hessians = 1 → every point has equal weight → weighted quantile is useless (same as normal quantile).
In classification, Hessians vary → weighted quantile is very useful.

7. Improvement Timeline in XGBoost

Reduced number of splits (approximate method).
Compute gradients and Hessians for every point.
Aggregate gradients and Hessians for quantiles.
Cache memory optimization.
Sketch algorithm for quantiles.
Weighted quantile sketch (performance boost, especially for classification).

XGBoost Optimizations

Recent Posts

© 2025 Aryan Upadhyay |