All Posts

A dark-themed graphic with "K-Means Clustering" at the top. Below the title, three distinct clusters of glowing dots in orange, cyan, and green are visible, representing data points. Each cluster has a brighter, central point indicating a centroid. Faint dashed lines connect the centroids, enclosed within a larger, abstract, glowing circular boundary, symbolizing the clustering process. The overall design suggests data organization and machine learning.

K-Means Clustering Explained: Geometric Intuition, Assumptions, Limitations, and Variations

K-Means is a powerful unsupervised machine learning algorithm used to partition a dataset into a pre-determined number of distinct, non-overlapping clusters. It works by iteratively assigning data points to the nearest cluster "centroid" and then updating the centroid's position based on the mean of the assigned points. This guide breaks down the geometric intuition behind K-Means, explores its core assumptions and limitations, and introduces important variations you should k

Aryan

Sep 22

A dark, abstract digital image showing four distinct, swirling clusters of small, brightly colored particles arranged in a square formation. Each cluster is a different vibrant color – blue, green, orange, and purple – symbolizing data naturally grouping itself into categories. The background is a sparse field of tiny, subtle dots.

Introduction to Unsupervised Learning: Clustering, Dimensionality Reduction & More

Unsupervised learning is a type of machine learning that uncovers hidden patterns in data without labels. Discover its key types, from clustering and dimensionality reduction to anomaly detection, and see how these techniques are applied in real-world scenarios like customer segmentation and image processing.

Aryan

Sep 22

A dark-themed conceptual image illustrating Exclusive Feature Bundling (EFB). On the left, several distinct data points representing 'Original Features' (F0, F1, F2, F3, F4) are shown in a scattered pattern. These features are then shown merging into a dense, glowing cluster labeled 'Bundled Feature (F.new)' on the right, connected by intricate lines. The text "LightGBM Optimization" and "Exclusive Feature Bundling (EFB)" are prominently displayed at the bottom.

Exclusive Feature Bundling (EFB) in LightGBM: Boost Speed & Reduce Memory Usage

Exclusive Feature Bundling (EFB) is a key LightGBM optimization that reduces the number of features by merging sparse, mutually exclusive columns—cutting memory usage and training time without sacrificing accuracy.

Aryan

Sep 21

A dark-themed, futuristic visualization of the GOSS algorithm. A luminous scanner sweeps over a field of data particles, selectively capturing all the large, bright particles (high-gradient data) and a random few of the smaller, dimmer ones (low-gradient data), channeling them towards a central processor.

GOSS Explained: How LightGBM Achieves Faster Training Without Sacrificing Accuracy

Gradient-based One-Side Sampling (GOSS) is a key innovation in LightGBM that accelerates model training without losing accuracy. By focusing on high-gradient (hard-to-learn) data points and selectively sampling low-gradient ones, GOSS strikes the perfect balance between speed and performance, making LightGBM faster and more efficient than traditional boosting methods.

Aryan

Sep 19

A dark-themed, abstract visualization of a glowing decision tree, illustrating the leaf-wise growth strategy of the LightGBM machine learning algorithm, set against a deep, dark background.

LightGBM Explained: Objective Function, Split Finding, and Leaf-Wise Growth

Discover how LightGBM optimizes gradient boosting with faster training, memory efficiency, and advanced split finding. Learn its unique leaf-wise growth strategy, objective function, and why it outperforms traditional methods like XGBoost.

Aryan

Sep 18

An abstract dark-themed image with a futuristic feel, showing a network of glowing nodes and connections, symbolizing how XGBoost handles missing data by finding patterns and filling in the gaps.

Handling Missing Data in XGBoost

Struggling with missing data? XGBoost simplifies the process by handling it internally using its sparsity-aware split finding algorithm. Learn how it finds the optimal "default direction" for missing values at every tree split by testing which path maximizes information gain. This allows you to train robust models directly on incomplete datasets without manual imputation.

Aryan

Sep 17

XGBoost Optimizations

XGBoost is one of the fastest gradient boosting algorithms, designed for high-dimensional and large-scale datasets. This guide explains its core optimizations—including approximate split finding, quantile sketches, and weighted quantile sketches—that reduce computation time while maintaining high accuracy.

Aryan

Sep 12

XGBoost Regularization

XGBoost is a powerful boosting algorithm, but it can overfit if not controlled. Regularization helps by simplifying trees, pruning unnecessary splits, and balancing bias–variance. This guide explains overfitting, how XGBoost improves on Gradient Boosting, and key parameters like gamma, lambda, max_depth, min_child_weight, learning rate, subsample, and early stopping to build robust models.

Aryan

Sep 5

An abstract visualization of the core mathematics behind XGBoost, set against a dark background. The image features glowing, interconnected nodes and lines representing an ensemble of decision trees. Equations, symbols, and graphs related to gradient boosting, such as objective functions, loss functions, and tree splitting criteria, are subtly integrated into the network. The lines connecting the nodes have a soft, warm light, indicating the flow of information and the iterative refinement process. The word "XGBoost" is displayed at the top in a sleek, modern font.

The Core Math Behind XGBoost

XGBoost isn’t just another boosting algorithm — its strength lies in the mathematics that power its objective function, optimization, and tree-building strategy. In this post, we break down the core math behind XGBoost: from gradients and Hessians to Taylor series approximation, leaf weight derivation, and similarity scores. By the end, you’ll understand how XGBoost balances accuracy with regularization to build powerful predictive models.

Aryan

Aug 26

A futuristic data visualization of an XGBoost classification model, showing a central network of glowing nodes and connections. This network is separating data points into four distinct, color-coded clusters (red, blue, yellow, and purple) on a dark grid background, representing the process of classification.

XGBoost for Classification

Master classification with XGBoost using a practical, beginner-friendly example. Understand how the algorithm builds decision trees, calculates log loss, optimizes splits, and uses probabilities to make accurate class predictions. A must-read for aspiring machine learning engineers.

Aryan

Aug 16

13 4 5

All Posts

K-Means Clustering Explained: Geometric Intuition, Assumptions, Limitations, and Variations

Introduction to Unsupervised Learning: Clustering, Dimensionality Reduction & More

Exclusive Feature Bundling (EFB) in LightGBM: Boost Speed & Reduce Memory Usage

GOSS Explained: How LightGBM Achieves Faster Training Without Sacrificing Accuracy

LightGBM Explained: Objective Function, Split Finding, and Leaf-Wise Growth

Handling Missing Data in XGBoost

XGBoost Optimizations

XGBoost Regularization

The Core Math Behind XGBoost

XGBoost for Classification

© 2025 Aryan Upadhyay |