Elbow Method and Silhouette Score Explained: Finding the Optimal Number of Clusters in K-Means

Aryan
Sep 25
7 min read

THE ELBOW METHOD

When we can't just look at our data and see the number of natural groups (especially in high dimensions), we need a way to find the optimal number of clusters, 'k'. The Elbow Method is a popular technique for this.

The method works on the concept of inertia, technically known as WCSS (Within-Cluster Sum of Squares).

What is WCSS ?

WCSS measures how compact our clusters are. For a single cluster, it's the sum of the squared distances between each data point and the cluster's centroid. A lower WCSS means the points are closer to their centroid, indicating a denser cluster.

If the distances from the centroid to the five points in a cluster are d₁,d₂,d₃,d₄,d₅ , the WCSS for that cluster would be:

WCSS = d₁² + d₂² + d₃² + d₄² + d₅²

A smaller WCSS value is generally better as it signifies more compact clusters.

How the Method Works

The process is straightforward: we calculate the WCSS for different values of k and see which one works best.

Run K-Means for a range of k: We typically test values from 1 to 10 (or more).
Calculate total WCSS for each k:
- For k=1, there's one big cluster. We calculate its WCSS.
- For k=2, there are two clusters. We find the WCSS for each cluster and add them together to get the total WCSS for k=2.
Plot the results: We create a line graph with the number of clusters (k) on the x-axis and the total WCSS on the y-axis.

As you increase k, the WCSS will always decrease. This is because the clusters become smaller, and the distance between points and their respective centroids shrinks. If k equals the number of data points, WCSS would be 0, as each point would be its own cluster. But that's not useful. Our goal is to find a balance.

Finding the "Elbow Point"

The graph of k vs. WCSS typically forms a shape that looks like an arm, and we're looking for the "elbow."

The intuition is to choose the k at the elbow. This is the point where the rate of decrease in WCSS sharply shifts.

Before the elbow (e.g., from k=1 to 3), WCSS drops rapidly. This means adding another cluster significantly improves the model by making the groupings more compact.
After the elbow (e.g., from k=3 to 4 and beyond), the WCSS decreases much more slowly. This suggests we've hit a point of diminishing returns, where adding more clusters doesn't provide much benefit.

Therefore, the elbow point (in this case, k=3) represents the optimal trade-off between minimizing WCSS and not having too many clusters.

Why does reducing WCSS benefit us ?

Reducing the WCSS (Within-Cluster Sum of Squares) is beneficial because it's a measure of cluster compactness.

In statistics, WCSS is analogous to variance. Variance measures how spread out a set of data is. Therefore, when we reduce the WCSS, we are essentially reducing the variance within each cluster. This forces the data points in a cluster to be closer to their centroid, creating tighter, more meaningful groups where the members are highly similar to one another. The Elbow Method helps us find the sweet spot where the clusters are sufficiently compact without creating too many of them.

Why is it called "Inertia" ?

The term "inertia" is an analogy from physics. In physics, inertia is an object's resistance to change in its motion.

In clustering, you can think of WCSS as a measure of a cluster's internal coherence or how "spread out" it is. A cluster with a high WCSS is very spread out, and its points "resist" being represented by a single centroid. The K-Means algorithm tries to minimize this total inertia, resulting in clusters that are as tight and internally consistent as possible.

Why do we use the square of the distance ?

Using the square of the Euclidean distance, rather than just the distance itself, has a few key advantages:

Penalizes Larger Errors More: Squaring the distance gives more weight to points that are far from the centroid. This heavily penalizes outliers and pushes the algorithm to find centroids that minimize these large distances, resulting in tighter, more spherical clusters.
Mathematical Convenience: The squared distance is mathematically easier to work with. Specifically, when the algorithm calculates the mean to find the new centroid location, the derivatives are simpler without the square root that's part of the standard Euclidean distance formula. This makes the underlying optimization problem more stable.

Limitations of the Elbow Method

Subjectivity in Identifying the Elbow
- The biggest challenge is the subjective nature of finding the "elbow" point.
- In datasets where the WCSS decreases gradually, it may be unclear where the elbow occurs.
Not Suitable for All Datasets
- Works best when data is naturally clustered.
- For irregularly shaped or overlapping clusters, the elbow may not be distinct, causing ambiguity in selecting k.
Performance with High-Dimensional Data
- With many features, WCSS differences between successive k values may be small, making the elbow harder to identify.
Doesn’t Consider Cluster Quality
- The method focuses only on variance within clusters.
- It does not guarantee that the clusters are meaningful, well-separated, or interpretable.
Sensitivity to Scaling
- Like K-Means itself, the results can be affected by feature scales.
- Features with larger ranges can dominate the WCSS, potentially leading to a suboptimal k.

The Silhouette Score

The Silhouette Score is a metric used to quantify the quality of results from a clustering algorithm. It is applicable to many different clustering methods.

In supervised learning, we have ground truth labels to evaluate a model's performance. In unsupervised learning, however, we do not have these labels. This absence of a "ground truth" makes it challenging to assess how well the clusters have been formed. The Silhouette Score addresses this problem by providing a quantitative measure for the quality of the clustering.

To evaluate the quality of a clustering algorithm, we assess two key factors: the compactness of the clusters and their distance from each other. These concepts are formally known as cohesion and separation.

Cohesion refers to how tightly packed the data points are within a cluster (also called intra-cluster distance).
Separation refers to how distinct or far apart different clusters are from each other (also called inter-cluster distance).

An important distinction is that some evaluation techniques, such as the Elbow Method, rely solely on inertia, which only measures cohesion. This can be a limitation, as it doesn't account for the separation between clusters.

Cohesion

Definition: Cohesion measures the degree to which data points within the same cluster are close to each other.
Ideal Scenario: High cohesion is desirable. It indicates that the points in a cluster are similar and form a dense grouping.
Measurement: Cohesion is often quantified by the Within-Cluster Sum of Squares (WCSS), also known as inertia. This is the sum of the squared distances between each data point and its cluster's centroid.

Separation

Definition: Separation measures how distinct or well-differentiated a cluster is from other clusters.
Ideal Scenario: High separation is desirable. It means that clusters are far apart, indicating the algorithm has successfully distinguished between different groups in the data.
Measurement: Separation can be measured by the distance between cluster centroids. More advanced metrics, such as the Silhouette Score, are powerful because they evaluate clustering quality by considering both cohesion and separation.

Silhouette score

The Silhouette Score provides a measure of how well each data point fits into its assigned cluster. It considers both the tightness of the cluster (cohesion) and its distance from other clusters (separation). The score for a single point ranges from -1 to +1.

The Formula

The Silhouette Score (Sᵢ) for a single data point i is calculated using the following formula:

Where:

aᵢ (Cohesion): The average distance from the data point i to all other data points in the same cluster. A small aᵢ value means the point is well-matched with its own cluster.
bᵢ (Separation): The minimum average distance from the data point i to all data points in any other single cluster. This is the average distance to the "next nearest" cluster. A large bᵢ value means the point is far away from other clusters.

Interpretation of the Score

The resulting value, Sᵢ , tells us about the quality of clustering for that specific point:

Score Close to +1: Indicates the data point is far from neighboring clusters and very close to the points in its own cluster. This is the ideal scenario.
Score Close to 0: Indicates the data point is on or very close to the decision boundary between two clusters.
Score Close to -1: Indicates the data point has likely been assigned to the wrong cluster.

Visualizing Clustering Performance with Silhouette Plots

To determine the optimal number of clusters (k), we can't just rely on a single number; we need to visually inspect the quality of the clusters. The Silhouette Plot is an excellent tool for this.

First, let's consider the unlabeled data we want to cluster.

How to Read a Silhouette Plot

A Silhouette Plot provides a visual representation of how well each data point fits within its cluster. Let's break down the plot for k=2.

X-axis: This is the Silhouette Score for each point, ranging from -1 to +1.
Y-axis: This groups the individual points by their assigned cluster (e.g., Cluster 0, Cluster 1).
The "Blades": Each colored region represents a cluster.
- The thickness (height) of a blade corresponds to the number of data points in that cluster.
- The width of each horizontal line in the blade is the silhouette score of a single point. The points in each cluster are sorted by their score in descending order.
The Red Dashed Line: This line represents the average Silhouette Score for all data points across all clusters. A higher average score generally indicates a better clustering result.

Using the Plot to Find the Optimal 'k'

By generating and comparing Silhouette Plots for different values of k, we can choose the best one. We analyze the plots to see which value of k results in the most well-defined clusters.

When analyzing these plots, we look for:

High Average Score: The red line (average score) should be as close to 1 as possible.
Well-formed Blades: The blades for each cluster should be wide, indicating that most points have a high silhouette score.
Similar Thickness: The blades should have a relatively uniform thickness, suggesting the clusters are of a balanced size.
Few Negative Scores: There should be very few points with scores less than 0.

Analysis of the Example:

For k=2, the plot looks good, with a decent average score.
For k=3, the average score is even higher. All three clusters extend well beyond the average score line and have very few low-scoring points.
For k=4 and k=5, the average score begins to decrease. We also see that some clusters are thinner and less uniform, suggesting that we are splitting natural clusters into smaller, less optimal ones.

Based on this visual evidence, k=3 is the best choice for this dataset. It yields the highest average Silhouette Score and results in clusters that are dense and well-separated.

Elbow Method and Silhouette Score Explained: Finding the Optimal Number of Clusters in K-Means

Recent Posts

© 2025 Aryan Upadhyay |