Random Forest Part - 1

Aryan
May 25
10 min read

Introduction to Random Forest

Random Forest is a versatile and widely used machine learning algorithm that belongs to the class of ensemble methods. More specifically, it is a type of bagging (Bootstrap Aggregating) technique. The core idea behind Random Forest is to construct a "forest" of many individual decision trees during training.

The "forest" in "Random Forest" directly refers to the collection of these multiple decision trees. Each tree in the forest is trained independently on a bootstrap sample of the data. After training, the predictions from all the individual decision trees are combined to produce a more robust and accurate final prediction. For classification tasks, this is often done by majority voting, and for regression tasks, by taking the average of the individual tree predictions. Random Forest extends Bagging by adding random feature selection at each node split, ensuring trees are diverse and less prone to overfitting.

Bagging (Bootstrap Aggregating)

Bagging, short for Bootstrap Aggregating, is a powerful ensemble learning technique used in both classification and regression tasks. Its primary goal is to improve the accuracy and stability of machine learning algorithms by reducing variance and minimizing the risk of overfitting, especially for high-variance models like decision trees.

Core Principles of Bagging

Bootstrapping (Sampling with Replacement):
- Multiple subsets of the original training dataset are created by randomly sampling with replacement.
- Each subset is the same size as the original dataset, but due to replacement, some samples may appear multiple times, while others may not appear at all.
- This process introduces variability, allowing each model to see a slightly different version of the training data.
Training Base Models in Parallel:
- An individual model (typically a weak learner such as a decision tree) is trained independently on each bootstrapped subset.
- Because each model is trained on different data, they learn diverse patterns.
Aggregation of Predictions:
- Once all models are trained, their predictions are combined:
  - Classification: Final output is based on majority voting (the class predicted by the most models).
  - Regression: Final output is the average of the individual model predictions.

Why Bagging Works

Reduces Variance:

Averaging the predictions of multiple models smooths out fluctuations due to specific training data, resulting in a model that generalizes better.
Mitigates Overfitting:
Since each model is trained on a different bootstrap sample, overfitting is less likely compared to a single complex model trained on the full dataset.

Illustrative Example: Predicting Customer Churn

Imagine you want to predict whether a customer will churn (leave the service) using Bagging with decision trees:

Original Dataset:
- You have 1,000 customer records with features like age, service usage, number of complaints, and churn status.
Bootstrapping the Data:
- Create 100 different training subsets by randomly sampling 1,000 records (with replacement) from the original data.
Training the Models:
- Train 100 decision tree models, one on each bootstrapped subset.
Making a Prediction:
- A new customer's data is input into all 100 trained models.
- Suppose 70 models predict "Churn" and 30 predict "No Churn".
- Final prediction: "Churn", as it received the majority vote.

Bagging vs. Random Forest: A Deeper Dive into Ensemble Mechanics

Bagging (Bootstrap Aggregating) and Random Forest are cornerstone ensemble learning techniques in machine learning. While they share foundational principles, Random Forest is a specialized extension of Bagging, designed specifically for Decision Trees, and introduces additional randomness that significantly boosts performance and generalization.

Understanding the mechanics behind these techniques reveals why Random Forest is one of the most widely used algorithms in practice.

1. Base Learners

Bagging:

A general-purpose ensemble method.
Can be applied to any base algorithm (e.g., SVMs, k-NN, Decision Trees), but all models in a given ensemble must be of the same type.
The core idea is to reduce variance by averaging multiple diverse models trained on different subsets of the data.

Random Forest:

A specialized form of Bagging tailored specifically to Decision Trees.
It cannot be used with SVMs or other classifiers — only decision trees are used as base learners.
Optimized with additional feature-level randomness to enhance decorrelation and robustness.

2. Row Sampling (Bootstrapping)

Shared by Both Bagging and Random Forest:

Each base model is trained on a bootstrap sample — a randomly selected subset of the training data with replacement.
The size of the bootstrap sample is typically equal to the original training set.
This sampling introduces diversity across models and helps reduce overfitting.

Example:

Given a dataset with 1,000 rows, each tree might train on 1,000 rows sampled with replacement (some rows repeated, others omitted).

3. Feature Sampling — The “Random” in Random Forest

This is the key distinction between Bagging and Random Forest.

Bagging (Default Behavior):

No feature sampling by default.
When Bagging is used with Decision Trees, each tree has access to all input features at every split.
The entire feature space is used consistently throughout the tree-building process.

Random Forest:

Performs feature sampling at each split node — this is the defining characteristic.
Instead of evaluating all features at a node, Random Forest selects a random subset of features (e.g., √n for classification, n/3 for regression) and only considers them for the best split.
This is called node-level feature subsampling, and it happens independently at every split node in every tree.

Detailed Example: Feature Sampling

Let’s assume a dataset has 5 features: F₁, F₂, F₃, F₄, F₅

Bagging with Tree-Level Feature Sampling (Non-standard but sometimes configured):

If you decide to sample 2 features per tree:
- Tree M₁ might be assigned F₁ and F₂ — it uses only these two for all splits.
- Tree M₂ might get F₁ and F₅, and so on.
- The feature selection happens once per tree, before tree construction.

Random Forest (Standard Behavior):

Every tree can consider all 5 features.
However, at each split node, the algorithm randomly selects a subset (say 2 out of 5) from which to choose the best split:
- Node 1 → selects {F₁, F₃}, splits on F₃
- Node 2 → selects {F₄, F₅}, splits on F₅
- Node 3 → selects {F₁, F₂}, splits on F₁
Feature sampling occurs dynamically, node-by-node, introducing finer granularity of randomness.

4. Why Feature Sampling Matters (Impact)

Reduced Tree Correlation

Without feature sampling, decision trees in Bagging may repeatedly choose dominant features, making them similar.
Random Forest prevents this by forcing trees to consider different features at different points, reducing similarity and boosting diversity.

Improved Generalization

Trees in a Random Forest learn different decision boundaries due to both row and column variability.
This leads to lower variance and improved performance on unseen data.

Avoids Feature Domination

Strong predictors don't monopolize splits.
Even weak or less frequent features get chances to influence splits, enhancing the ensemble’s robustness.

Example Scenario (Summary Table)

Method	Row Sampling	Feature Sampling	Source of Diversity
Bagging	Yes	None (uses all features)	Row randomness only
Random Forest	Yes	Per split node	Row + column (feature) randomness

Key Difference

Random Forest = Bagging + Feature Sampling at Each Node

This added randomness makes Random Forest less prone to overfitting and better at generalizing, especially in high-dimensional data.

Visual Analogy

Bagging:
Imagine training trees on different shuffled copies of a spreadsheet — all columns visible to every tree.
Random Forest:
Now imagine that at each decision point, the spreadsheet blurs out a random set of columns, forcing the model to choose from what's left.

Exception Note

While standard Bagging doesn’t include feature sampling, some libraries (like sklearn.ensemble.BaggingClassifier) allow enabling it using parameters like max_features. However:

This becomes a hybrid model, not traditional Bagging.
In contrast, Random Forest always performs feature sampling at each node by design.

Conclusion

Bagging reduces variance by aggregating predictions from multiple models trained on bootstrapped datasets.
Random Forest takes this further by introducing node-level feature randomness, making each tree more unique.
The result? A stronger, more decorrelated, and highly generalizable ensemble model — which is why Random Forest is often the go-to algorithm for many real-world classification and regression problems.

How Random Forest Works (Step-by-Step Intuition):

Let's illustrate the process with your example:

Scenario: You have a dataset with 1000 rows and 5 columns, and you want to perform a classification task using Random Forest, opting for 100 decision trees.

Define Number of Trees (N_estimators): You decide to build 100 individual decision trees (let's call them M1,M2,…,M100). Each of these will become a "tree" in your "forest."
Bootstrap Sampling (Row Randomness):
For each of the 100 trees, a bootstrap sample of the original dataset is created.
This means, for M1, a dataset D1 is generated by randomly selecting 1000 rows from the original 1000 rows with replacement. (You mentioned 500 rows, which is also a valid parameter to choose for subset size, but typically for bootstrapping, the sample size is the same as the original dataset's size).
This process is repeated for M2(creating D2), and so on, up to M100(creating D100).
Because of "sampling with replacement," each Diwill be slightly different from the original dataset, and also different from each other. Some original rows will be duplicated, others left out (these left-out rows are called "out-of-bag" samples and are useful for internal validation).
Feature Subsampling at Each Split (Column Randomness):
- Now, each decision tree (M1,…,M100) is trained on its respective bootstrap sample (D1,…,D100).
- Crucial step: When an individual tree is growing (i.e., deciding how to split a node), it does not consider all 5 available columns. Instead, it randomly selects a subset of these columns (e.g., typically total features for classification, or total features/3 for regression, though this is a hyperparameter). Let's say it randomly considers 2 or 3 of the 5 columns at each split.
- This forces the trees to be diverse and prevents strong features from dominating all trees, leading to lower correlation between the individual tree predictions.
Independent Training: Each of the 100 decision trees is grown completely independently, using its unique bootstrap sample and random feature subsets for splitting.
Aggregation of Predictions:
- When a new, unseen data point (query point) arrives, it is fed to all 100 trained decision trees.
- Each tree makes its own prediction (e.g., "0" or "1").
- For Classification: The final prediction is determined by majority voting. If 60 trees predict "0" and 40 trees predict "1", the Random Forest classifies the new data point as "0" because it received the most votes.
- For Regression: The final prediction is the average of the predictions from all individual trees.

Summary of Intuition:

Random Forest leverages the power of ensemble learning by building many independent decision trees. It introduces two levels of randomness: sampling data rows with replacement (bootstrapping) and sampling features at each split point. These randomizations ensure that the individual trees are diverse and uncorrelated, which in turn leads to a more stable, accurate, and robust final prediction compared to a single decision tree or even a general Bagging ensemble without feature randomness.

Why Does Random Forest Work?

Random Forest is powerful because it follows a principle known as the Wisdom of the Crowd — the idea that the aggregated opinion of a group of diverse, independent thinkers is often more accurate than that of an individual expert. This concept applies to machine learning ensembles as well.

Wisdom of the Crowd — Simple Examples:

Guessing Game: If 100 people guess the weight of an object, the average of all guesses is often more accurate than the majority of individual guesses.
Voting Systems: Diverse juries or committees make better, fairer decisions than individuals acting alone.

Similarly, in Random Forest:

A single decision tree might make errors due to overfitting or limited data.
But a forest of many trees, each trained on different subsets of data and features, will average out errors and provide a more robust, generalized prediction.

Why Random Forest Gives Better Results

Let’s break it down:

Bootstrap Sampling
From the original dataset, we create multiple bootstrap samples (random subsets with replacement). Each sample is used to train a separate decision tree.
Training Multiple Trees
Each tree learns from a different view of the data — both in terms of rows (samples) and columns (features). This introduces variation in how the trees split and make decisions.
Diverse Decision Boundaries
Every decision tree may create a slightly different and sometimes erratic decision boundary due to limited training data. But that’s the strength of diversity.

Aggregating Predictions
When a new data point arrives, each tree makes a prediction:
- Classification: Majority vote decides the class.
- Regression: Average of predictions is taken.

Majority Vote for Robustness

Let’s say we have a query point (shown below in red):

One decision tree might say the point belongs to the yellow region.
Another tree might classify it as purple.
A third tree might not even include it in its training data due to sampling.

But across the forest of decision trees, the majority may classify it as purple. This majority rule helps cancel out individual model flaws, leading to more accurate results.

Summary

Random Forest reduces overfitting by averaging multiple models.
It leverages the Wisdom of the Crowd to generalize better.
Diversity in training data and features creates independent models.
Majority vote or averaging improves reliability and robustness.

Bias-Variance Tradeoff

In machine learning, two main sources of error affect a model's performance:

Bias: Error due to overly simplistic assumptions in the model (underfitting).
Variance: Error due to sensitivity to small fluctuations in the training data (overfitting).

Key Idea:

Bias and variance are inversely proportional — reducing one often increases the other.

Ideally, we want both low bias and low variance, but achieving that is challenging.

Most machine learning models tend to fall into one of these two categories:

Model Type	Bias	Variance
Linear Regression, Naive Bayes	High Bias	Low Variance
Decision Trees	Low Bias	High Variance

How Random Forest Helps

Random Forest is designed to tackle the bias-variance tradeoff effectively:

Decision Trees alone have low bias but high variance — they overfit the data easily.
By using Bagging (Bootstrap Aggregation) and combining many decision trees, Random Forest reduces variance without increasing bias.

How it works :

Multiple Subsets: Random Forest creates many subsets of the data through bootstrap sampling (with replacement).
Train Many Trees: Each subset trains a separate decision tree.
Aggregate Outputs:
- For classification: take majority vote.
- For regression: take mean of all outputs.
Wisdom of the Crowd: Even though individual trees may be noisy, combining their predictions cancels out the noise (variance), while preserving accurate patterns (low bias).

Result: Low Bias + Low Variance, which leads to better generalization on unseen data.

Summary

Bias-Variance tradeoff is a fundamental concept in ML.
Random Forest converts a low bias, high variance model (decision tree) into a low bias, low variance model.
It does this using:
- Bootstrap sampling
- Feature randomness
- Aggregation of multiple models (wisdom of the crowd)

Feature Importance in Random Forest

One powerful advantage of using Random Forest is that it can be used as a feature selection tool. After training, a Random Forest model can provide the importance score of each feature — telling us which features contributed the most to the predictions.

How Feature Importance Works

As we've seen with Decision Trees, they can individually calculate feature importance based on how much each feature reduces impurity (like Gini Index or Entropy) at splits.

Now, let’s understand how this extends to Random Forest:

Suppose we train three decision trees as part of a Random Forest model.
Each of these trees will individually compute feature importances.
For example:
- Tree 1 says Feature 1 has an importance of 0.1
- Tree 2 gives Feature 1 an importance of 0.3
- Tree 3 gives Feature 1 an importance of 0.4
The average importance across all trees is then calculated
Final Importance of Feature 1 = (0.1+0.3+0.4)/3 = 0.2667

This process is repeated for all features, and we get the final feature importance scores.

Why It's Useful

Helps identify relevant features in your dataset.
Can be used for feature selection, potentially improving model performance and reducing overfitting.
Provides insights into which features influence predictions the most.

Random Forest Part - 1

Recent Posts

© 2025 Aryan Upadhyay |