
Random Forest Part - 2
- Aryan

- May 25
- 13 min read
Why Ensemble Techniques Work: The "Wisdom of Crowds"
Ensemble methods derive their power from the principle known as the "wisdom of crowds." This core idea posits that while individual models (or experts) may exhibit prediction errors, their collective judgment, when appropriately aggregated, typically yields superior accuracy and reliability compared to any single constituent model.
Consider an ensemble comprising three classification models: M1, M2, and M3. Each model possesses an individual accuracy of 70% (i.e., a 0.7 probability of making a correct prediction). Employing a majority voting strategy for classification, the ensemble's decision is the class that garners the most votes from its member models.
The intuitive benefit lies in error mitigation: even if one model makes an erroneous prediction, the remaining models may still be correct, effectively overriding the single error. The strength of the ensemble hinges on the errors of the individual models being uncorrelated or, at minimum, not perfectly correlated. If all models consistently make the same mistakes, their combination will not enhance accuracy. However, if their error patterns are diverse, their collective decision is significantly more likely to be accurate.
The Crucial Role of Model Diversity and Independence
The assumption of independence among M1,M2,and M3 is fundamental for the "wisdom of crowds" principle to be effective and for its mathematical underpinnings to hold true.
Independence Conditions:
Heterogeneous Ensemble (Different Algorithms): If models employ distinct algorithms (e.g., a Decision Tree, an SVM, and a Logistic Regression), they are inherently likely to learn different patterns and consequently err on different data points. This algorithmic diversity promotes independence.
Homogeneous Ensemble (Different Data): Even when models are of the same type (e.g., multiple Decision Trees in Bagging or Random Forest), independence is fostered by training them on distinct subsets of the data (e.g., bootstrap samples). This ensures each model processes a slightly different perspective of the data, leading to varied error profiles.
Importance of Independence: Models that are highly correlated (i.e., prone to making identical errors on the same data points) will offer minimal accuracy gains when combined, as they will collectively perpetuate the same mistakes. The efficacy of ensemble methods manifests when individual model errors are diverse and largely independent.
Mathematical Proof: Probability of Correct Prediction in an Ensemble
To mathematically illustrate the ensemble's superior performance, consider the following:
Assumptions:
An ensemble consists of N independent models (for this example, N = 3).
Each model Mihas an individual probability of correct prediction, p = 0.7.
The ensemble employs majority voting for classification.
Scenario for a Single Query Point:
For a new query point, each model Miwill either predict correctly (C) or incorrectly (I).
P(Model Correct) = p = 0.7
P(Model Incorrect) = q = 1−p = 0.3
For the ensemble (majority vote of 3 models) to be correct, a minimum of 2 out of the 3 models must predict correctly. The scenarios resulting in a correct ensemble prediction are:
All 3 models correct: P(C,C,C) = p × p × p = 0.343
Exactly 2 models correct (three possible permutations):
P(C,C,I) = p × p × q = 0.7 × 0.7 × 0.3 = 0.147
P(C,I,C) = p × q × p = 0.7 × 0.3 × 0.7 = 0.147
P(I,C,C) = q × p × p = 0.3 × 0.7 × 0.7 = 0.147
The total probability that the ensemble (majority vote) is correct is the sum of these probabilities:
P(Ensemble Correct) = P(C,C,C) + P(C,C,I) + P(C,I,C) + P(I,C,C)
P(Ensemble Correct) = 0.343 + 0.147 + 0.147 + 0.147
P(Ensemble Correct) = 0.343 + 3 × 0.147
P(Ensemble Correct) = 0.343 + 0.441
P(Ensemble Correct) = 0.784
Conclusion:
The calculated probability of the ensemble being correct is 0.784, or 78.4%. This significantly surpasses the individual model accuracy of 70%.
This mathematical illustration validates the "wisdom of crowds" principle, demonstrating how ensemble techniques, given diverse and reasonably independent base models, consistently achieve superior performance compared to individual models. This phenomenon can be generalized for a larger number of models using the binomial distribution: for N models, each with accuracy p, the probability of the ensemble being correct is
. As N increases, ensemble accuracy generally improves, provided p > 0.5 and the independence assumption is maintained.
Random Forest Hyperparameters
Optimizing a Random Forest model involves tuning its hyperparameters, which can significantly impact performance, training time, and generalization. These parameters are broadly categorized into those controlling the overall forest structure, those influencing individual tree growth, and miscellaneous settings.
Forest-Level Hyperparameters
These parameters control the overarching characteristics of the entire Random Forest ensemble.
n_estimators:
Description: This is one of the most important hyperparameters. It specifies the number of decision trees in the forest.
Impact: A higher number of trees generally leads to a more stable and accurate model, as it reduces variance. However, increasing n_estimators also increases computational cost and training time. Beyond a certain point, the accuracy gains diminish, while computation costs continue to rise.
max_features:
Description: This parameter controls the maximum number of features (columns) that each individual decision tree considers when looking for the best split at any given node.
Impact: This is the core source of "randomness" in Random Forest.
If max_features is set too high (e.g., equal to the total number of features), trees become more similar, increasing correlation and potentially overfitting.
If set too low, trees might not find good splits, leading to underfitting.
Common heuristic values include sqrt(n_features) for classification and n_features / 3 for regression, but it's often tuned.
bootstrap:
Description: A boolean (True/False) parameter that determines whether bootstrap samples are used when building trees.
Impact:
True (default): Each tree is trained on a bootstrap sample (random sampling with replacement) of the training data. This introduces randomness, helps reduce variance, and allows for Out-of-Bag (OOB) error estimation.
False: Each tree uses the entire dataset for training. This reduces randomness and may lead to higher variance and increased overfitting, effectively making it more like a "bagged decision tree" ensemble rather than a full Random Forest.
max_samples:
Description: If bootstrap is True, this parameter limits the number of samples (rows) to draw from the original dataset to train each base estimator (tree). Can be an integer or a float.
Impact:
If an integer, it's the absolute number of samples.
If a float (e.g., 0.7), it's the fraction of the total samples.
Allows for further control over the size of the bootstrap samples. Smaller samples can increase tree diversity but might also lead to underfitting if too small.
Tree-Level Hyperparameters
These parameters govern the construction and pruning of individual decision trees within the forest.
criterion:
Description: The function used to measure the quality of a split.
Impact:
gini (default for classification): Measures the impurity of a node. Favors larger partitions.
entropy (for classification): Measures information gain. Favors more balanced partitions.
squared_error (default for regression): Minimizes the mean squared error.
absolute_error (for regression): Minimizes the mean absolute error.
The choice can subtly influence tree structure and overall performance.
max_depth:
Description: The maximum depth (number of levels) of each decision tree.
Impact:
Controls tree complexity. A deeper tree can capture more complex relationships but is more prone to overfitting.
Restricting max_depth helps prevent individual trees from becoming too specialized to the training data.
min_samples_split:
Description: The minimum number of samples (data points) required to split an internal node.
Impact:
If a node has fewer samples than min_samples_split, it will not be split, even if it could lead to a pure leaf.
Higher values prevent trees from learning highly specific patterns from small groups of data, thus reducing overfitting.
min_samples_leaf:
Description: The minimum number of samples required to be at a leaf node.
Impact:
Ensures that each leaf node has a minimum number of samples, preventing the creation of leaf nodes that represent too few training examples.
Similar to min_samples_split, it helps to regularize the tree and prevent overfitting.
min_weight_fraction_leaf:
Description: The minimum weighted fraction of the total number of training samples required to be at a leaf node.
Impact: Similar to min_samples_leaf, but considers sample weights (if provided). Useful when dealing with imbalanced datasets where some samples are more important than others.
max_leaf_nodes:
Description: The maximum number of leaf nodes allowed in a tree.
Impact: Limits the complexity of the tree by restricting the total number of end nodes. Can serve as an alternative to max_depth for controlling tree size.
min_impurity_decrease:
Description: A node will be split if this split results in a decrease of impurity greater than or equal to this value.
Impact: Prunes the tree by requiring a significant reduction in impurity (e.g., Gini impurity or MSE) for a split to occur. Higher values lead to less complex trees and can prevent overfitting.
ccp_alpha (Cost-Complexity Pruning Alpha):
Description: Complexity parameter used for Minimal Cost-Complexity Pruning (also known as weakest link pruning). Any split that decreases the impurity by less than ccp_alpha will not be created.
Impact: This is a post-pruning technique. A higher ccp_alpha value increases the number of nodes pruned, resulting in smaller, less complex trees. It's an effective way to control overfitting after trees are grown.
Miscellaneous Hyperparameters
These settings control the operational aspects of the Random Forest algorithm.
oob_score (Out-of-Bag Score):
Description: A boolean (True/False) parameter. If True, the out-of-bag samples are used to estimate the generalization accuracy.
Impact: Provides an "internal" cross-validation score without needing a separate validation set. For each tree, about one-third of the data is left out of the bootstrap sample; these are the OOB samples. The OOB score is calculated by making predictions on these samples. It's a reliable estimate of the model's performance on unseen data.
n_jobs:
Description: The number of jobs (CPU cores) to run in parallel for fitting and predicting.
Impact:
-1: Uses all available processors.
1: Uses only one processor (no parallelization).
Positive integer: Specifies that number of processors.
Leverages multi-core CPUs to speed up training, especially for a large number of n_estimators.
random_state:
Description: Controls the randomness of the bootstrapping of the samples and the splitting of features at each node.
Impact: Setting an integer value ensures reproducibility of the results. The same random_state will always produce the same random splits and sample selections, making experiments repeatable.
verbose:
Description: Controls the verbosity of the output during training.
Impact:
0: Suppresses all output.
1: Prints a message for each tree.
2: Prints even more detailed messages.
Useful for monitoring the training process, especially for long-running fits.
warm_start:
Description: A boolean (True/False) parameter. If True, successive calls to fit will add more estimators to the ensemble, rather than retraining a whole new forest.
Impact: Useful for incrementally adding more trees to an already trained forest, saving computational time when experimenting with n_estimators.
class_weight:
Description: A dictionary or string that specifies weights associated with classes (for classification tasks).
Impact: Helps handle imbalanced datasets by giving more importance to minority classes, preventing the model from being biased towards the majority class. Can be "balanced" (weights inversely proportional to class frequencies) or "balanced_subsample".
OOB SCORE
"OOB" stands for "out-of-bag." In the context of machine learning, an out-of-bag score is a method of measuring the prediction error of Random Forests, Bagging Classifiers, and other ensemble methods that use bootstrap aggregation (bagging) where sub-samples of the training dataset are used to train individual models.
Here's how it works:
Each tree in the ensemble is trained on a distinct bootstrap sample of the data. By the nature of bootstrap sampling, some samples from the dataset will be left out during the training of each tree. These samples are called "out-of-bag" samples.
The out-of-bag samples can then be used as a validation set. Predictions are obtained by passing these OOB samples through the tree that did not see them during training.
These predictions are then compared to the actual values to compute an "out-of-bag score," which serves as an estimate of the prediction error on unseen data.
One of the significant advantages of the out-of-bag validation score is that it allows for estimating the prediction error without requiring a separate validation set. This is particularly useful when the dataset is small, as partitioning it into distinct training and validation sets might leave insufficient samples for effective model learning.
Understanding Out-of-Bag (OOB) Samples in Ensemble Methods
In ensemble techniques like Bagging and Random Forest, a core principle is bootstrapping, which involves drawing random samples with replacement from the original dataset to train individual models. A natural consequence of this process is the creation of Out-of-Bag (OOB) samples.
Consider a dataset with N data points. When a bootstrap sample of size Nis drawn with replacement, some original data points will be selected multiple times, while approximately 36.8% of the original data points are statistically expected not to be included in that specific bootstrap sample. These unselected data points are the OOB samples for that particular model.
Illustrative Example:
Suppose an original dataset contains 10 distinct numbers: [10,15,9,6,7,8,1,12,18,20]. If we perform bootstrap sampling to create a training subset, for instance:
Sample 1: [10,10,1,15,1](Note: 10 and 1 are repeated; 12, 18, 20, 6, 7, 8 are missing)
Sample 2: [15,9,9,9,15](Note: 9 and 15 are repeated; 10, 6, 7, 8, 1, 12, 18, 20 are missing)
Sample 3: [1,10,15,9,1](Note: 1 is repeated; 6, 7, 8, 12, 18, 20 are missing)
For each of these generated samples, the numbers from the original dataset that were not included constitute the OOB samples for the model trained on that specific bootstrap sample.
OOB Samples in Random Forest Context:
In the context of Random Forest, where hundreds or thousands of decision trees are trained:
Each individual decision tree (M_i) is trained on its own unique bootstrap sample (D_i) .
The data points from the original dataset that were not included in D_i form the OOB set for M_i .
The critical advantage of OOB samples is that they have not been seen by the model during its training phase. This makes them genuinely unseen data points for that particular tree.
Utilizing OOB for Model Evaluation:
The most significant utility of OOB samples lies in their ability to provide an unbiased estimate of the model's generalization performance (accuracy for classification, error for regression) without the need for a separate validation set.
The process for OOB evaluation is as follows:
For each data point in the original training set, collect predictions only from the trees for which this data point was an OOB sample.
Aggregate these OOB predictions (e.g., by majority vote for classification or averaging for regression) to get a final OOB prediction for that data point.
Compare these OOB predictions against the true labels of the data points to compute an "OOB score" (e.g., OOB accuracy or OOB error).
Statistical Basis:
It is statistically proven that, on average, for a bootstrap sample of size N drawn from a dataset of size N, approximately 0.632 of the original data points will be present in the sample, meaning approximately 36.8 % will be left out as OOB samples.
This inherent separation means that OOB samples effectively serve as an internal validation set for the ensemble, offering a reliable and computationally efficient way to assess model performance during training.
Extremely Randomized Trees (Extra Trees)
Extremely Randomized Trees, commonly known as Extra Trees, is an ensemble machine learning algorithm that builds upon the principles of Random Forest but introduces an additional layer of randomness during the tree construction process.
Key Distinction from Random Forest: Split Point Selection
The primary difference between Extra Trees and Random Forest lies in how the split points for decision tree nodes are determined:
Traditional Decision Trees (and Random Forests): For each feature considered at a node, the algorithm exhaustively searches for the optimal split point that maximizes information gain (or minimizes impurity, like Gini impurity or mean squared error). This involves calculating metrics for various possible thresholds and selecting the best one.
Extra Trees: In contrast, for each feature considered at a node, Extra Trees selects a split point completely at random within the feature's range. From these randomly generated split points across various features, the algorithm then chooses the feature and its random split point that yields the best overall split (i.e., the one that results in the greatest impurity reduction).
This crucial modification adds an "extreme" level of randomness to the tree building process, hence the name "Extremely Randomized Trees."
Impact of Increased Randomness:
The enhanced randomness in Extra Trees has several implications :
Increased Decorrelation: By choosing random split points, individual trees in an Extra Trees ensemble become even more decorrelated from each other compared to Random Forests. This further leverages the "wisdom of crowds" principle, as each tree explores different decision boundaries, leading to a more diverse set of predictions.
Reduced Variance: The greater decorrelation among trees contributes to a further reduction in the overall variance of the ensemble model. This often leads to better generalization performance and can mitigate overfitting more effectively than Random Forests in certain scenarios.
Potential for Higher Bias (Individual Trees): Because individual trees select random split points instead of optimal ones, they might be slightly less accurate (have higher bias) on their respective training subsets.
Faster Training (Potentially): In scenarios with many features or continuous features, the random split point selection can be computationally faster than exhaustively searching for the optimal split, potentially speeding up the training process.
Intuitive Example (Numerical Feature Splitting):
Consider a numerical feature like 'Age' in a dataset where the goal is to predict 'Insurance purchase' (Yes/No).
Standard Decision Tree / Random Forest: When splitting a node based on 'Age', the algorithm would typically sort the unique 'Age' values and evaluate potential split points (e.g., midpoints between adjacent sorted values) to find the threshold that best separates 'Yes' from 'No' instances. For example, it might find that Age > 35 is the optimal split.
Extra Trees: Instead of calculating optimal splits, Extra Trees would randomly pick a value within the range of 'Age' in that node (e.g., Age > 29.5, Age > 41.2, Age > 33.7). It would do this for several features and then select the feature and its random split point that achieves the best (but not necessarily optimal) impurity reduction.
Performance Considerations:
While the increased randomness in Extra Trees can lead to a slight increase in the bias of individual trees, the significant reduction in variance due to enhanced decorrelation often results in a net gain in overall ensemble performance. This makes Extra Trees a compelling alternative, and sometimes a superior choice, to Random Forest, particularly when dealing with complex, high-dimensional data or when the exact optimal split points are less clearly defined. Whether Extra Trees outperforms Random Forest, or any other algorithm, ultimately depends on the specific characteristics of the dataset and the problem at hand.
ADVANTAGES AND DISADVANTAGES OF RANDOM FOREST
Advantages
Robustness to Overfitting: Random Forests are less prone to overfitting compared to individual decision trees, because they average the results from many different trees, each of which might overfit the data in a different way.
Handling Large Datasets: They can handle large datasets with high dimensionality effectively.
Less Pre-processing: Random Forests can handle both categorical and numerical variables without the need for scaling or normalization. They can also handle missing values.
Variable Importance: They provide insights into which features are most important in the prediction.
Parallelizable: The training of individual trees can be parallelized, as they are independent of each other. This speeds up the training process.
Non-Parametric: Random Forests are non-parametric, meaning they make no assumptions about the form of the transformation from inputs to output. This makes them suitable to model complex, non-linear relationships.
Disadvantages
Model Interpretability: One of the biggest drawbacks of Random Forests is that they lack the interpretability of simpler models like linear regression or decision trees. While you can rank features by their importance, the model as a whole is essentially a black box.
Performance with Unbalanced Data: Random Forests can be biased towards the majority class when dealing with unbalanced datasets. This can sometimes be mitigated by balancing the dataset prior to training.
Predictive Performance: Although Random Forests generally perform well, they may not always provide the best predictive performance. Gradient boosting machines, for instance, often outperform Random Forests. If the relationships within the data are linear, a linear model will likely perform better than a Random Forest.
Inefficiency with Sparse Data: Random Forests might not be the best choice for sparse data (e.g., text data), where linear models or other algorithms might be more suitable.
Parameters Tuning: Although Random Forests require less tuning than some other models, there are still several parameters (like the number of trees, tree depth, etc.) that can affect model performance and need to be optimized.
Difficulty with High Cardinality Features: Random Forests can struggle with high cardinality categorical features (features with a large number of distinct values). These types of features can lead to trees that are biased towards the variables with more levels, and may cause overfitting.
Can't Extrapolate: This is because they do not predict beyond the range of the training data, and so they may not predict as accurately as other regression models.


