GOSS Explained: How LightGBM Achieves Faster Training Without Sacrificing Accuracy

Aryan
Sep 19
4 min read

Introduction to GOSS (Gradient-based One-Side Sampling) in LightGBM

GOSS is an intelligent data-sampling technique in LightGBM that speeds up training without sacrificing accuracy.

In gradient-boosting models, there is always a trade-off:

Speed – faster training by using a smaller subset of data.
Accuracy – better performance by using the full dataset.

Traditional random sampling can accelerate training but often causes information loss, which leads to performance degradation.

GOSS overcomes this by selecting data points intelligently rather than randomly.

Why GOSS Is Important in LightGBM

LightGBM introduces two key innovations:

GOSS (Gradient-based One-Side Sampling) → reduces the number of rows (data points).
EFB (Exclusive Feature Bundling) → reduces the number of columnas (features).

Together, they let LightGBM train on very large datasets (e.g., 10 GB) far faster than XGBoost while maintaining high accuracy.

For example, if XGBoost takes 5 hours, LightGBM may finish in about 1 hour.

Row reduction → GOSS
Feature reduction → EFB

This dual optimization makes LightGBM both faster and more memory-efficient.

Role of Sampling in Boosting

Consider a dataset of 1,000 rows.

Instead of using all the data, we might randomly sample 70 % (700 rows) and train each tree on that subset, repeating the process for every boosting round.

Benefits of random sampling

Faster training (less data per tree).
Lower risk of overfitting (different trees see different subsets).

Drawback

Possible information loss, which can reduce accuracy.

Key innovation of GOSS

Instead of discarding data at random, GOSS speeds up training without degrading performance by selecting points according to their importance.

Core Idea of GOSS

Not all samples are equally valuable during training.

In boosting, the gradient (error/residual) indicates how important a data point is:

Large gradient → the model struggles on this point → very important.
Small gradient → the model fits this point well → less important.

GOSS therefore:

Keeps all points with large gradients (high importance).
Randomly samples a subset of small-gradient points (low importance).

This balances speed and accuracy:

Focuses on important points to maintain model quality.
Uses fewer unimportant points to save time.

Why Use Gradients Instead of Hessians

Both gradients and Hessians are available in gradient boosting, but:

Hessians are often constant in regression tasks (e.g., always 1) and offer little information about point importance.
Gradients directly reflect how wrong the model is on each data point.

Hence, GOSS relies on gradients as a more reliable importance signal.

Example for Intuition

Suppose we predict salaries from CGPA.

After training a weak learner, we compute residuals (gradients) and sort them: 5, 4, 3, 2, 1.

Points with gradients 5 and 4 are most important → definitely kept.
Points with gradients 2 and 1 are less important → only a random fraction is kept.

This ensures the next model focuses on hard-to-learn examples while preserving overall data distribution.

The Distribution Problem

If we kept only the largest-gradient points and ignored the rest, the training set would lose diversity and distort the data distribution.

Random subsampling → preserves distribution but may drop important points.
Pure gradient selection → keeps important points but creates a biased distribution.

GOSS combines both:

Retains all large-gradient points.
Randomly samples small-gradient points to maintain distribution.

In short

Random sampling → fast but less accurate.
Pure gradient sampling → accurate but biased.
GOSS → fast and accurate.

Gradient-based One-Side Sampling (GOSS) in LightGBM

Motivation

In boosting algorithms, training on large datasets can be slow.

A common way to speed up training is row sampling—using only a subset of data for each tree.

Pros: Faster training, reduced overfitting.
Cons: Random subsampling may discard important data points, causing information loss and lower accuracy.

LightGBM introduces GOSS (Gradient-based One-Side Sampling) to speed up training without sacrificing model accuracy.

Core Idea

Not all training samples are equally important during boosting.

Large-gradient samples → Points where the model predicts poorly → high importance.
Small-gradient samples → Points already well-predicted → lower importance.

GOSS keeps all large-gradient samples and randomly selects only a small subset of small-gradient samples.

This ensures:

The model focuses on hard-to-learn points.
Overall data distribution is preserved by including some easy points.

Step-by-Step Example

Suppose we have a dataset of students with CGPA (input) and Package (output).

CGPA	Package	Prediction	Residual / Gradient
8	8	6	-2
6	6	3	-3
5	5	8	3
4	13	25	12
…	…	…	…

Compute gradients for all points.
Take the absolute gradients and sort them: 12, 3, 3, 2, …
Choose two parameters:
- top_rate (a) – Fraction of highest-gradient samples to keep (e.g., 0.3).
- other_rate (b) – Fraction of low-gradient samples to randomly keep (e.g., 0.15).

For a dataset of 20 samples:

Keep a × 20 = 6 highest-gradient points.
Randomly select b × 20 = 3 from the remaining 14 points.
Total training subset = 9 samples for this tree.

Upsampling Low-Gradient Points

Since we downsampled low-gradient points, we must re-balance their contribution.

Each selected low-gradient sample is given a higher weight so that the subset behaves like the original dataset.

The weight factor is :

For example, with a = 0.3 and b = 0.15 :

Thus, each low-gradient sample’s gradient and Hessian are multiplied by 4.66 to preserve the original data distribution.

Iterative Boosting Process

Train a tree on the sampled subset (e.g., 9 points).
Predict on the full dataset (20 points).
Compute new gradients.
Apply GOSS again (using top_rate and other_rate).
Train the next tree.

Repeat until boosting completes.

Implementation in LightGBM

To enable GOSS in LightGBM, set:

boosting_type = "goss"
top_rate = a (fraction of high-gradient samples)
other_rate = b (fraction of low-gradient samples)

Why GOSS Is Effective

Speed: Fewer rows per tree → much faster training.
Accuracy: Retains all important samples and re-balances low-gradient ones → no significant loss of performance.
Better than random sampling: Avoids the information loss common in purely random subsampling.

GOSS is one of the key innovations that makes LightGBM faster and more memory-efficient than XGBoost while maintaining high accuracy.

GOSS Explained: How LightGBM Achieves Faster Training Without Sacrificing Accuracy

Recent Posts

© 2025 Aryan Upadhyay |