NAÏVE BAYES Part - 2

Aryan
Mar 16
9 min read

Updated: Jun 20

UNDERFLOW

Underflow is a condition that can occur in computing when a number nears zero and the computer can no longer store it accurately in memory using floating-point representation. It happens when a calculated result is a smaller absolute value than the computer can actually represent.

Most computers use a form of representation called floating-point to represent real numbers. This representation has a certain precision limit, and it can only represent numbers between a certain minimum and maximum value. If a number is too close to zero (but not zero), it might be smaller than the smallest representable positive number in the machine’s floating-point representation. When an operation on such small numbers is performed, the machine might round the result to zero, leading to a loss of precision.

Underflow can be a problem in certain domains, such as machine learning, where calculations often involve probabilities. Probabilities are positive numbers that can be very close to zero. When multiplying many small probabilities together, the result can underflow. One common way to avoid underflow in such scenarios is to perform calculations in the log domain, where addition and subtraction are used instead of multiplication and division, thereby maintaining higher numerical precision.

LAPLACE ADDITIVE SMOOTHING

Consider a dataset with two columns: review and sentiment. The review column contains words such as w₁ , w₂ and w₃, and the sentiment column indicates whether the review is positive (+ve) or negative (-ve). We represent the reviews using a binary bag-of-words model, where each word wᵢ is a feature, and its value is 1 if the word appears in the review and 0 otherwise. The dataset is as follows :

Review	w₁	w₂	w₃	sentiment
r₁	1	1	1	-ve
r₂	1	0	1	+ ve
r₃	1	1	0	-ve

Additionally, we have the following counts of words in each sentiment class:

Sentiment	w₁	w₂	w₃
+ ve	1	0	1
-ve	2	2	1

Suppose we receive a new review r₄ = {w₁ , w₂ , w₃} , which contains the words w₁ (twice) and w₃ . In the binary bag - of - words model, we only care about the presence of words, so r₄ is represented as {w₁ = 1 , w₂ = 0 , w₃ = 1} . Our task is to determine the sentiment of r₄ using Naive Bayes by calculating the posterior probabilities P(+ve ∣ r₄) and P(−ve ∣ r₄).

Using the Naive Bayes formula, the probabilities are:

P(+ve | r₄) = P(+ve) . P(w₁=1 | +ve) . P(w₂=0 | +ve) . P(w₃=1 | +ve)

P(-ve | r₄) = P(-ve) . P(w₁=1 | -ve) . P(w₂=0 | -ve) . P(w₃=1 | -ve)

First, we compute the prior probabilities:

Total reviews = 3 (since there are r₁, r₂, r₃).
Number of +ve reviews = 1, so P(+ve) = 1/3
Number of −ve reviews = 2, so P(−ve) = 2/3

Next, we calculate the conditional probabilities using the counts from the table:

P(w₁ = 1 ∣ +ve) = 1/1=1
P(w₂ = 0 ∣ +ve) = 1/1=1
P(w₃= 1 ∣ +ve) = 1/1 =1
P(w₁ = 1 ∣ -ve) = 2/2=1
P(w₂ = 0 ∣ -ve) = 0/2= 0
P(w₃= 1 ∣−ve) = 1/2

Now, substitute these into the Naive Bayes equations:

P(+ve | r₄) = P(+ve) . P(w₁ = 1 | +ve) . P(w₂ = 0 | +ve) . P( w₃ = 1 | +ve) = (1/3) . 1. 1 .1 = 1/3

P(-ve | r₄) = P(-ve) . P(w₁ = 1 | -ve) . P(w₂ = 0 | -ve) . P(w₃ = 1 | -ve) = (2/3) . 1 . 1 . 0 . (1/2) = 0

The probability P(−ve ∣ r₄) becomes zero because P(w₂ = 0 ∣ −ve) = 0. This zero probability causes the entire product to be zero, making it impossible to compare P(+ve ∣ r₄) and P(−ve ∣ r₄)

meaningfully. This issue arises because w₂ never appears as absent in the −ve class in the training data, leading to a zero probability for w₂= 0.

This zero-probability problem is a significant issue in Naive Bayes. When a feature value (e.g., w₂= 0) has not been observed in a particular class, the conditional probability becomes zero, and the entire product becomes zero, regardless of the other probabilities. This can lead to incorrect classifications, especially in small datasets where certain feature-class combinations may not appear.

Solution: Laplace Additive Smoothing

To address the zero-probability issue, we use Laplace additive smoothing. Laplace smoothing adds a small constant (typically 1) to the numerator and adjusts the denominator accordingly to account for all possible feature values. For a binary feature like (which can be 0 or 1), there are 2 possible values. The smoothed probability is calculated as :

Where:

count (wᵢ = v,class) is the number of times wᵢ = v appears in the given class.
count (class) is the total number of reviews in the given class.
α is the smoothing parameter (typically α = 1).
n is the number of possible values for the feature (for a binary feature, n = 2) .

Using α = 1 , let’s recompute the conditional probabilities :

P(w₁ = 1 ∣ + ve) =(1+1)/(1+1 * 2) = 2/3
P(w₂ = 0 ∣ + ve) = (1+1)/(1+1 * 2) = 2/3
P(w₃ = 1 ∣ + ve) = (1+1)/(1+1 * 2) = 2/3
P(w₁ = 1 ∣ − ve) = (2+1)/(2+1 * 2) = 3/4
P(w₂ = 0 ∣ − ve) = (0+1)/(2+1 * 2) = 1/4
P(w₃= 1 ∣ − ve) = (1+1)/(2+1 * 2) = 2/4 = 1/2

Now, recompute the Naive Bayes probabilities :

P(+ve | r₄) = (1/3) · (2/3) · (2/3) · (2/3) = (1/3) · (8/27) = 8/81

P(–ve | r₄) = (2/3) · (3/4) · (1/4) · (1/2) = (2/3) · (3/8) = 2/8 = 1/4

Since (1/4) > (8/81) , we classify as r₄ as - ve.

With Laplace smoothing, none of the probabilities are zero, allowing us to compare P(+ ve ∣ r₄) and P(−ve ∣ r₄) and make a meaningful classification decision.

Bias Variance trade off

Introduction to Laplace Smoothing and the Role of α

In the previous discussion, we applied Laplace additive smoothing to address the issue of zero probabilities in Naive Bayes classification. By adding a smoothing parameter α to the numerator and α⋅n to the denominator (where n is the number of possible feature values), we ensured that probabilities never become zero. However, the choice of α impacts the model’s performance beyond just avoiding zero probabilities. Specifically, the value of α influences the bias and variance of the model, leading to a trade-off that we need to understand and manage .

The value of α affects each P(Fₖ | c), which in turn influences the overall prediction. Let’s explore how changing α impacts the bias and variance of the model .

Impact of Small α : High Variance

Consider a dataset with features F₁ , F₂ , F₃ , … and a binary output class {Yes , No} . Suppose the training dataset has 1000 rows: 500 rows labeled "Yes" and 500 rows labeled "No". Assume that the feature F₁ is binary (n = 2), and its distribution is as follows:

In the "Yes" class, F₁ = 1 in out of 500 rows (P (F₁ = 1 ∣ Yes) = 1/500).
In the "No" class, F₁ = 1 in 250 out of 500 rows (P (F₁ = 1 ∣ No = 250/500 =0.5 ).

Without smoothing (α = 0), these probabilities are directly used. However, they are highly sensitive to the training data. For example, if we have a different training dataset with slightly different counts—say, F₁ = 1 in 3 out of 500 "Yes" rows (P(F₁ = 1 ∣ Yes) = 3/500 —the probability changes significantly (from 1/500 = 0.002 to 3/500 = 0.006). This large variation in estimated probabilities across different training datasets indicates high variance. A small α (close to 0) means the model is overfitting to the training data, as the probabilities are heavily influenced by small changes in the counts .

Impact of Large α : High Bias

Now, let’s examine the effect of a large α. Suppose we increase α to 1000 and recompute the probabilities for F₁ :

As α becomes very large, all probabilities approach 1/n , regardless of the actual counts in the data. In this case, P(F₁ = 1 ∣ Yes) ≈ P(F₁ = 1 ∣ No) ≈ 0.5 , meaning the model assumes that F₁ is equally likely to be 1 in both classes. This effectively ignores the training data, leading to high bias. The model underfits because it oversimplifies the relationship between features and classes, assuming a uniform distribution over feature values.

The Bias-Variance Trade-Off

The value of α controls the bias-variance trade-off in Naive Bayes :

Small α (e.g., α→0 ): The model has low bias because the probabilities closely reflect the training data (e.g., P(F₁ = 1 ∣ Yes) = 1/500). However, it has high variance because small changes in the training data lead to large changes in the probabilities, causing overfitting.
Large α (e.g., α→∞ ): The model has high bias because the probabilities are forced toward 1/n , ignoring the training data (e.g., P(F₁=1 ∣ Yes) ≈ 0.5 ). However, it has low variance because the probabilities are stable across different training datasets, reducing sensitivity to data variations and leading to underfitting.

Application to Classification

Consider a query point with features F₁ , F₂ , … , Fₖ . The Naive Bayes classifier computes:

P(Yes | F₁ , F₂ , … , Fₖ) α P(Yes) . P(F₁|Yes) . P(F₂|Yes).….P(Fₖ|Yes)

P(No | F₁ , F₂ , … , Fₖ) α P(No) . P(F₁|No) . P(F₂|No).….P(Fₖ|No)

If P(F₁ ∣ Yes) and P(F₁ ∣ No) vary significantly across training datasets due to high variance (small α), the predictions will be inconsistent. Conversely, if α is too large, the probabilities for "Yes" and "No" become similar, and the model may always predict the class with the higher prior probability (e.g., if P(Yes) > P(No) , it will always predict "Yes"). By tuning α , we can control this trade-off and improve the model’s generalization to new data .

TYPES OF NAÏVE BAYES

GAUSSIAN NAÏVE BAYES

When to Use Gaussian Naive Bayes

Gaussian Naive Bayes is a variant of the Naive Bayes algorithm used when the input features are continuous (numerical) rather than categorical. It assumes that the continuous features follow a Gaussian (normal) distribution within each class. For example, in a dataset with features like GPA and IQ, and a binary output class (e.g., "Placed" or "Not Placed"), we should use Gaussian Naive Bayes if we believe the features are approximately normally distributed within each class.

Problem Setup

Consider a dataset with 1000 rows, where the input features are GPA and IQ, and the output is whether a student is "Placed" (Y) or "Not Placed" (N). The dataset is balanced, with 500 rows labeled "Placed" and 500 rows labeled "Not Placed". We split the data into two parts for analysis :

Placed (Y): 500 rows of students who were placed.
Not Placed (N): 500 rows of students who were not placed.

Our task is to determine whether a new student with a GPA of 8.1 and an IQ of 81 is likely to be placed. Using the Naive Bayes framework, we need to compute the posterior probabilities:

P(Y | GPA=8.1,IQ=81) = P(Y) . P(GPA=8.1|Y) . P(IQ=81|Y)

P(N | GPA=8.1,IQ=81) = P(N) . P(GPA=8.1|N) . P(IQ=81|N)

The prior probabilities are:

P(Y) = 500/1000 = 1/2 since 500 out of 1000 students are placed.
P(N) = 500/1000 = 1/2 since 500 out of 1000 students are not placed.

The challenge lies in computing the conditional probabilities P(GPA=8.1∣Y) , P(IQ=81∣Y) , P(GPA=8.1∣N) , and P(IQ=81∣N) . In a categorical Naive Bayes approach (e.g., Multinomial or Bernoulli), if a specific value like GPA = 8.1 is not present in the training data for a given class, the probability might be zero, leading to the zero-probability problem. However, Gaussian Naive Bayes handles continuous features differently, avoiding this issue .

How Gaussian Naive Bayes Works

Gaussian Naive Bayes assumes that the continuous features (GPA and IQ) follow a Gaussian distribution within each class. For a feature X (e.g., GPA or IQ) in class c (e.g., Y or N), the probability density is modeled using the Gaussian probability density function (PDF):

We compare the two values and classify the student into the class with the higher posterior probability. Note that since the features are continuous, Gaussian Naive Bayes can compute a probability density for any value of GPA or IQ, even if that exact value (e.g., GPA = 8.1) is not present in the training data.

Practical Considerations

While the Gaussian assumption simplifies the computation, it may not always hold in practice. If the features do not follow a normal distribution (e.g., GPA might be skewed), the model’s performance could suffer. In such cases, you might consider:

Transforming the features to make them more Gaussian (e.g., using a log or Box-Cox transformation).
Using a different variant of Naive Bayes, such as Kernel Naive Bayes, which uses kernel density estimation instead of assuming a Gaussian distribution.
Checking the distribution of the features using histograms or statistical tests (e.g., Shapiro-Wilk test) to verify the Gaussian assumption.

CATEGORICAL NAÏVE BAYES

Categorical Naive Bayes is a variant of the Naive Bayes algorithm designed to handle datasetsre the input features are categorical (e.g., "Sunny", "Hot", "High", "True") rather thannumerical. It assumes that the features are independent given the class (the Naive Bayes assumption) and computes the probability of each class based on the frequency of feature values in the training data.

Dataset and Problem Setup

Consider a dataset with the following features: Outlook, Temperature, Humidity, and Windy, and a binary output class PlayTennis ("Yes" or "No"). The dataset contains 14 rows :

NAÏVE BAYES Part - 2

Recent Posts

© 2025 Aryan Upadhyay |