Exclusive Feature Bundling (EFB) in LightGBM: Boost Speed & Reduce Memory Usage

Aryan
Sep 21
7 min read

EFB (Exclusive Feature Bundling)

One of the most important aspects of LightGBM is its ability to increase training speed and reduce memory usage.

To achieve this, the data footprint must be reduced—either by lowering the number of rows or columns:

Rows can be reduced using techniques like GOSS (Gradient-based One-Side Sampling), which performs intelligent sampling of instances.
Columns can be reduced using EFB (Exclusive Feature Bundling), which we focus on here.

Suppose a dataset contains 200 features.

EFB can bundle them into roughly 150 features (or fewer) while ensuring no significant loss of performance.

The algorithm reduces the number of features but maintains model accuracy and efficiency.

Purpose of EFB

Exclusive Feature Bundling reduces the dimensionality of a dataset by combining sparse and mutually exclusive features into single bundled features.

This is especially useful when working with datasets containing a large number of features where many do not co-occur.

Mechanism of Action

EFB identifies features that are exclusive—meaning their non-zero values rarely overlap with the non-zero values of other features.

These features are then bundled together, with each original feature occupying a unique value range created by adding small offsets.

Benefits for Machine Learning

By decreasing the number of features:

Computation and memory usage are reduced.
Training speed increases.
The model is less prone to overfitting because of reduced complexity.

Impact on Performance

EFB preserves the informational integrity of the original features.

The machine learning model can still differentiate the contribution of each feature even after bundling.

Application Context

This technique is particularly effective for high-dimensional, sparse datasets (e.g., click-through rate prediction, web-scale data) where many features are mostly zeros.

The Core Idea: Bundling Sparse Features

The EFB technique groups, or "bundles," multiple features into a single new feature. This method is specifically designed for datasets with many sparse features—features that consist mostly of zero values.

The key is to bundle features that are mutually exclusive, meaning they are rarely non-zero at the same time for the same row. Since their non-zero values don't overlap, their information can be safely combined into one feature.

	F0	F1	F2	F3	F4
i0	1	1	0	0	1
i1	0	0	1	1	1
i2	1	2	0	0	2
i3	0	0	2	3	1
i4	2	1	0	0	3
i5	3	3	0	0	1
i6	0	0	3	0	2
i7	1	2	3	4	3
i8	1	0	1	0	0
i9	2	3	0	0	2

For instance, in the table above, look at features F0 and F3. In almost every row, if one feature is non-zero, the other is zero. While not perfectly exclusive , their low number of conflicts makes them excellent candidates for bundling.

This process is considered nearly lossless because it preserves almost all the original information. A minor information loss only occurs when bundling features that are not perfectly exclusive (i.e., both have non-zero values in the same row), which we call a "conflict."

How EFB Works: A Two-Stage Process

The EFB process operates in two main stages to efficiently reduce the number of features:

Stage 1: Finding Feature Bundles The first goal is to identify which sparse features can be grouped together. The algorithm doesn't bundle all features at once. Instead, it strategically decides which features to group. For example, out of five features (F0 to F4), it might create two bundles like (F0, F3) and (F2, F4), leaving F1 separate. This initial grouping is typically handled by a Greedy Bundling algorithm.
Stage 2: Merging the Bundles Once the groups are decided, the next step is to merge the features within each bundle into a single new feature. This is performed by a Merge Exclusive Features algorithm, which cleverly combines the feature values while preserving their original information.

These two stages work together to reduce dimensionality while maintaining model performance.

Stage 1 in Detail: The Greedy Bundling Algorithm

This algorithm determines which features to group together. It follows a data-driven approach based on how often features conflict with each other.

	F0	F1	F2	F3	F4
i0	1	1	0	0	1
i1	0	0	1	1	1
i2	1	2	0	0	2
i3	0	0	2	3	1
i4	2	1	0	0	3
i5	3	3	0	0	1
i6	0	0	3	0	2
i7	1	2	3	4	3
i8	1	0	1	0	0
i9	2	3	0	0	2

Step 1: Count the Conflicts

First, we scan the dataset to count the number of conflicts between every pair of features. A conflict occurs when two features both have non-zero values in the same row. This information is stored in a conflict count matrix.

	F0	F1	F2	F3	F4
F0		6	2	1	6
F1	6		1	1	6
F2	2	1		3	4
F3	1	1	3		3
F4	6	6	4	3
Σ	15	14	10	8	19

Step 2: Build a Conflict Graph

Next, we convert this matrix into a weighted graph. In this graph:

Each node represents a feature.
An edge connects two nodes, and its weight is the conflict count between those two features.

Step 3: Apply the Conflict Threshold (k)

We now introduce a hyperparameter, k, which defines the maximum number of conflicts allowed within a single bundle. This threshold allows us to control the trade-off between bundling more features (higher performance) and losing information (lower accuracy).

For this example, let's set k = 1. We remove all edges from the graph with a weight greater than 1, leaving only the connections that meet our threshold.

Step 4: Form Bundles Greedily

Using the simplified graph, the algorithm greedily creates bundles:

It starts with the first feature, F0, and looks for a feature it can bundle with. It finds F3 (conflict = 1), creating the first bundle: {F0, F3}.
Next, it considers F1. F1 cannot join the first bundle because its conflict with F0 is 6, which is greater than k. Therefore, the algorithm creates a new bundle for F1. It then finds that F2 can join this new bundle (conflict = 1), creating the second bundle: {F1, F2}.
Finally, F4 has no connections in the simplified graph, so it remains in its own bundle: {F4}.

The result of this stage is three final feature groups: {F0, F3}, {F1, F2}, and {F4}. The total number of features has been successfully reduced from five to three.

Stage 2: Merge Exclusive Features (How to Bundle Them ?)

After Stage 1, we have our plan: bundle (F0, F3) and (F1, F2), leaving F4 separate. The second stage addresses how to merge the features within a bundle into a single new feature.

The key is to combine them in a way that preserves the identity of the original features. LightGBM's histogram-based algorithm works by creating bins and finding the best split points. The "Merge Exclusive Features" algorithm constructs the new bundled feature so that the values from the original features fall into different, exclusive bins.

The Offset Method

To keep the information from each feature separate, the algorithm uses an offset technique. It effectively shifts the value range of one feature to begin after the other one ends.

Let's use a simple analogy. Suppose we want to bundle two features:

Feature A, with values in the range [0, 10].
Feature B, with values in the range [0, 20].
Define an Offset: We take the maximum value of the first feature (10) and use it as an offset for the second feature.
Apply the Offset:
- The values for Feature A remain in their original range, [0, 10].
- We add the offset to every value from Feature B. This maps its values to a new, non-overlapping range: [0+10, 20+10], which becomes [10, 30].
Create the Bundle: The new, combined feature now has a total range of [0, 30].

When the model builds a histogram on this new feature, all values for Feature A will be below 10, and all values for Feature B will be above 10. This simple addition of an offset ensures the information from both original features is fully retained and usable for finding the best splits.

Merge Exclusive Features: A Detailed Walkthrough

	j=1	j=2
	F0	F3
i=1	1	0
i=2	0	1
i=3	1	0
i=4	0	3
i=5	2	0
i=6	3	0
i=7	0	0
i=8	1	4
i=9	1	0
i=10	2	0

Let's apply the offset method to our bundle of features F0 and F3.

Step 1: Determine Bins and Offsets

First, the algorithm establishes the value ranges for each feature to calculate the necessary offsets.

It finds the maximum value (or bin) for each feature in the bundle.
- max(F0) = 3
- max(F3) = 4
It creates a binRanges list to store the cumulative sum of these maximums, starting with 0. This list will define our offsets.
- Start with binRanges = [0]
- Add max(F0): binRanges = [0, 3]
- Add max(F3) to the last element: binRanges = [0, 3, 3+4] which becomes [0, 3, 7]

From this, we get our offsets:

Offset for F0 (the 1st feature) = binRanges[0] = 0
Offset for F3 (the 2nd feature) = binRanges[1] = 3

Step 2: Construct the New Feature (newBin)

Now, we create the new feature row by row. The rule is simple: for any non-zero value, we add its feature's offset. new_value = original_value + offset

Row(i)	F0	F3	Calculation Breakdown	Final newBin Value
1	1	0	1+0(offset for F0)	1
2	0	1	1+3(offset for F3)	4
3	1	0	1+0(offset for F0)	1
4	0	3	3+3(offset for F3)	6
5	2	0	2+0(offset for F0)	2
6	3	0	3+0(offset for F0)	3
7	0	0	0	0
8	1	4	(F0:1+0=1) then (F3:4+3=7)	7
9	1	0	1+0 (offset for F0)	1
10	2	0	2+0 (offset for F0)	2

Results and Interpretation

The newBin column is the new feature that will be used to train the model.

As you can see:

Whenever only F0 is non-zero, the newBin value is between 1 and 3.
Whenever only F3 is non-zero, the newBin value is between 4 and 7.

The conflict row (i=8) is a special case. Since both features are non-zero, the algorithm processes them sequentially. The value from F0 is calculated first (1+0=1), but it is then overwritten by the value from F3 (4+3=7). This is a clear example of the minor information loss that occurs during a conflict and highlights why the algorithm's first stage aims to create bundles with as few conflicts as possible.

This new feature is a powerful example of feature engineering, successfully retaining information from two separate columns in one.

Making Predictions with Bundled Features

The same bundling logic used during training must be applied to new data at prediction time. The model, having learned the specific value ranges for each original feature, can correctly interpret the transformed input.

Let's see two examples using our (F0, F3) bundle:

Scenario 1: New data is F0 = 3, F3 = 0
- Transformation: The value is transformed using F0's offset: 3 + 0 = 3.
- Model Action: The model is given the input 3. Since it learned that values in the range (0, 3] correspond to F0, it makes its prediction accordingly.
Scenario 2: New data is F0 = 0, F3 = 3
- Transformation: The value is transformed using F3's offset: 3 + 3 = 6.
- Model Action: The model receives the input 6. It knows this value falls within the range corresponding to F3 and uses that information to make its prediction.

This seamless transformation process at both training and prediction time is what makes Exclusive Feature Bundling a highly effective and efficient optimization technique for sparse datasets.

Exclusive Feature Bundling (EFB) in LightGBM: Boost Speed & Reduce Memory Usage

Recent Posts

© 2025 Aryan Upadhyay |