
NAÏVE BAYES Part - 1
- Aryan

- Mar 15
- 10 min read
Updated: Jun 12

Naïve Bayes is a powerful and widely used probabilistic machine learning algorithm that is primarily used for classification tasks. It is based on Bayes’ Theorem, which describes how to update the probability of a hypothesis based on new evidence. The algorithm is called "naïve" because it assumes that all features in the dataset are independent of each other, which is often not true in real-world scenarios. Despite this strong assumption, Naïve Bayes performs remarkably well in many practical applications such as spam filtering, sentiment analysis, document classification, and fraud detection.
Intuition Behind Naïve Bayes Classifier Using Play Tennis Dataset

Consider a dataset called "Play Tennis," which records whether a person decides to play tennis based on weather conditions over 14 days. This is a binary classification problem where the outcome is either "Yes" (the person plays tennis) or "No" (the person does not play tennis). The decision is influenced by four weather features: Outlook, Temperature, Humidity, and Windy. For example:
Day 1: Outlook = Sunny, Temperature = Hot, Humidity = High, Windy = False → Outcome = Yes (plays tennis).
Day 2: Outlook = Sunny, Temperature = Hot, Humidity = High, Windy = False → Outcome = No (does not play tennis).
This dataset spans 14 days, and the goal is to use it to train a Naïve Bayes classifier. Once trained, the classifier will predict whether the person will play tennis given a new set of weather conditions. For instance, we are given a new query point with the conditions: {Outlook = Sunny, Temperature = Cool, Humidity = Normal, Windy = True}, and we need to predict the outcome.
Naïve Bayes and Bayes' Theorem
The Naïve Bayes classifier is based on Bayes' Theorem, which is mathematically expressed as:

Here:
P(A∣B) : Posterior probability (e.g., probability of "Yes" given weather conditions).
P(B∣A : Likelihood (e.g., probability of weather conditions given "Yes").
P(A) : Prior probability (e.g., probability of "Yes").
P(B) : Evidence (e.g., probability of weather conditions).
In our case, we define:
W : Weather conditions (e.g., {Sunny, Cool, Normal, True}).
P(Yes ∣ W) : Probability of playing tennis given weather W.
P(No ∣ W) : Probability of not playing tennis given weather W.
Using Bayes' Theorem, we calculate:

Since P(W) (the denominator) is the same for both probabilities and acts as a normalizing constant, we can ignore it when comparing P(Yes ∣ W) and P(No ∣ W) . Thus, we simplify to:
P(Yes ∣ W) ∝ P(W ∣ Yes)⋅P(Yes)
P(No ∣ W) ∝ P(W ∣ No)⋅P(No)
The classifier will predict "Yes" if P(Yes ∣ W) > P(No ∣ W) , and "No" otherwise.
Calculating Probabilities
Step 1: Prior Probabilities
From the 14-day dataset, suppose:
Number of "Yes" days = 9.
Number of "No" days = 5.
Then: P(Yes) = 9/14 ≈ 0.643
P(No) = 5/14 ≈ 0.357
Step 2: Likelihood Probabilities
The challenge lies in calculating P(W ∣ Yes) and P(W ∣ No) where W={Sunny, Cool, Normal, True} . Using standard Bayes Theorem, we would compute:
P(W ∣ Yes) = P(Sunny ∩Cool ∩Normal ∩True ∣Yes)
P(W ∣ No) = P(Sunny ∩Cool ∩Normal ∩True ∣No)
This requires counting the days when all four conditions occur simultaneously:
P(W ∣ Yes) : Probability of days with W={Sunny ,Cool ,Normal ,True} when the outcome is "Yes".
P(W ∣ No) : Probability of days with W={Sunny ,Cool ,Normal ,True} when the outcome is "No".
However, if no day in the dataset exactly matches W={Sunny ,Cool ,Normal ,True} for either "Yes" or "No," then:
P(W ∣ Yes) = 0
P(W ∣ No) = 0
This results in: P(Yes ∣ W) = 0
P(No ∣ W) = 0
When both probabilities are zero, the model fails to make a prediction. This is a limitation of standard Bayes' Theorem: it requires exact matches in the dataset, which may not always exist, especially with small datasets or multiple features.
Naïve Bayes Solution
The Naïve Bayes classifier overcomes this issue by assuming conditional independence between features. Instead of calculating the joint probability
P(Sunny ∩ Cool ∩ Normal ∩ True ∣ Yes) , it breaks it down into the product of individual probabilities:
P(W ∣ Yes) = P(Sunny ∣ Yes) ⋅ P(Cool ∣ Yes) ⋅ P(Normal ∣ Yes) ⋅ P(True ∣ Yes)
P(W ∣ No) = P(Sunny ∣ No) ⋅ P(Cool ∣ No) ⋅ P(Normal ∣ No) ⋅ P(True ∣ No)
Example Calculation
Assume the following counts from the dataset:
"Yes" days = 9:
Sunny = 2, Cool = 3, Normal = 6, True = 3.
"No" days = 5:
Sunny = 3, Cool = 1, Normal = 1, True = 2.
Then:
P(Sunny ∣ Yes) = 2/9 ≈ 0.222
P(Cool ∣ Yes) = 3/9 ≈ 0.333
P(Normal ∣ Yes) = 6/9 ≈ 0.667
P(True ∣ Yes) = 3/9 ≈ 0.333
P(Sunny ∣ No) = 3/5 = 0.6
P(Cool ∣ No) = 1/5 = 0.2
P(Normal ∣ No) = 1/5 = 0.2
P(True ∣ No) = 2/5 = 0.4
Now compute: P(W ∣ Yes)=0.222 x 0.333 x 0.667 x 0.333 ≈ 0.0164
P(Yes ∣ W) ∝ P(W ∣ Yes)⋅P(Yes) = 0.0164 x 0.643 ≈ 0.0105
P(W ∣ No) = 0.6 x 0.2 x 0.2 x 0.4 = 0.0096
P(No ∣ W) ∝ P(W ∣ No)⋅P(No) = 0.0096 x 0.357 ≈ 0.0034
Step 3: Prediction
Since P(Yes ∣ W) ≈ 0.0105 > P(No ∣ W) ≈ 0.0034 , the Naïve Bayes classifier predicts "Yes"—the person will play tennis under the conditions W = {Sunny ,Cool ,Normal ,True} .
MATHEMATICAL FORMULATION




Decision Rule (Maximum A Posteriori - MAP Estimation)

The Naïve Bayes classifier predicts the class with the highest posterior probability, making it an efficient and widely used approach for multiclass classification.
What Happens in Naïve Bayes During Training and Testing ?

Overview
The Naïve Bayes algorithm operates in two primary phases: training and testing. Below is a detailed explanation of what occurs during each phase, based on the provided dataset.
Training Phase
During the training phase, the Naïve Bayes algorithm calculates probabilities in advance for each feature and class combination. The dataset includes two classes in the "Play Tennis" column: "Yes" and "No." For each input feature (e.g., Outlook, Temperature, Humidity, Windy), the algorithm determines the number of categories and computes the corresponding probabilities.
Class Probabilities:
The prior probabilities of the classes "Yes" and "No" are calculated based on their frequency in the dataset.
Example: P(Yes) and P(No).
Feature Probabilities:
For each input feature, the algorithm identifies the categories and calculates conditional probabilities for both "Yes" and "No" classes.
Outlook: Categories are Sunny, Overcast, and Rainy. The algorithm calculates 6 probabilities:
P(Sunny | Yes), P(Sunny | No), P(Overcast | Yes), P(Overcast | No), P(Rainy | Yes), P(Rainy | No).
Temperature: Categories are Hot, Mild, and Cool. The algorithm calculates 6 probabilities:
P(Hot | Yes), P(Hot | No), P(Mild | Yes), P(Mild | No), P(Cool | Yes), P(Cool | No).
Humidity: Categories are High and Normal. The algorithm calculates 4 probabilities:
P(High | Yes), P(High | No), P(Normal | Yes), P(Normal | No).
Windy: Categories are True and False. The algorithm calculates 4 probabilities:
P(True | Yes), P(True | No), P(False | Yes), P(False | No).
Storage:
These precomputed probabilities are stored in a dictionary or similar data structure for efficient retrieval during the testing phase.
Testing Phase
During the testing phase, the algorithm uses the precomputed probabilities to classify new data points. For example, consider a query point (Sunny, Hot, High, False). The steps are as follows:
Probability Calculation:
The algorithm fetches the relevant probabilities from the dictionary (e.g., P(Sunny | Yes), P(Hot | Yes), P(High | Yes), P(False | Yes), and their "No" counterparts).
It then computes the posterior probability for the query point belonging to the "Yes" class and the "No" class using the Naïve Bayes formula:
P(Yes | Sunny, Hot, High, False) ∝ P(Yes) × P(Sunny | Yes) × P(Hot | Yes) × P(High | Yes) × P(False | Yes).
P(No |Sunny, Hot, High, False) ∝ P(No) × P(Sunny | No) × P(Hot | No) × P(High | No) × P(False | No).
Comparison and Prediction:
The algorithm compares the computed probabilities for "Yes" and "No".
The class with the higher probability is assigned as the predicted outcome (e.g., "Yes" or "No") .
How Naïve Bayes Handles Numerical Data
The dataset contains numerical data (age) and a categorical target variable (married: y/n).
Problem Statement
Suppose a new query point is received, such as age = 55, and we need to determine whether the person is married (Y) or not married (N).
Using Bayes' Theorem, the posterior probabilities are calculated as follows:
P(Y ∣ 55) = P(55 ∣ Y)⋅P(Y)
P(N ∣ 55) = P(55 ∣ N)⋅P(N)
Where:
P(55 ∣ Y) is the likelihood of age being 55 given that the person is married.
P(55 ∣ N) is the likelihood of age being 55 given that the person is not married.
P(Y)and P(N) are the prior probabilities of being married or not married, respectively.
Challenge with Numerical Data
If the dataset does not contain an exact age of 55, the probabilities P(55 ∣ Y) or P(55∣N) would be zero, leading to incorrect results.
To address this, we assume that the age column follows a Gaussian (Normal) distribution and estimate probabilities using a probability density function (PDF) instead of discrete counts.
Gaussian Naïve Bayes Approach
The Gaussian Naïve Bayes method adapts the algorithm to numerical data by modeling the likelihood of continuous variables using a Gaussian distribution. The steps are as follows:
Compute the Mean (μ) and Standard Deviation (σ) of Age
Calculate these statistical measures separately for the married (Y) and not married (N) groups based on the training data.
Use the Gaussian Probability Density Function (PDF)
Estimate the likelihoods P(55∣Y) and P(55∣N) using the Gaussian PDF:

Here, x is the query age (e.g., 55), μ is the mean, and σ is the standard deviation of the respective group (Y or N).
Substitute the Query Age into the Formula
Plug the query age (55) into the PDF formula and compute the probability density.
Example: If the calculated probability density is 0.62, then P(55∣Y)=0.62 .
Compute Probabilities for Other Values
Repeat the process to estimate probability densities for other relevant ages or features in the dataset, if applicable.
Compute Final Probabilities and Classify
Calculate the final posterior probabilities for married (Y) and not married (N) by combining the likelihoods with prior probabilities:
P(Y∣55) = P(55 ∣ Y) ⋅ P(Y)
P(N∣55) = P(55 ∣ N) ⋅ P(N)
Choose the class (married or not married) with the highest
probability as the predicted outcome.
What if Data is Not Gaussian?
Data Transformation:
Apply transformations to make the data more normally distributed.
Common transformations include logarithm, square root, and reciprocal transformations.
Alternative Distributions:
If the data follows a non-normal distribution (e.g., exponential, Poisson), modify the Naïve Bayes algorithm to assume that specific distribution when calculating likelihoods.
Discretization:
Convert continuous data into categorical data by binning the values.
Binning methods include equal width bins, equal frequency bins, and k-means clustering.
After binning, use Multinomial or Bernoulli Naïve Bayes methods.
Kernel Density Estimation (KDE):
A non-parametric method to estimate the probability density function when the distribution is unknown.
Use Other Models:
If the above methods don't work, use a classification algorithm that doesn’t assume normality, such as Decision Trees, Random Forests, or Support Vector Machines (SVMs).
Naïve Bayes on Text Data
Naïve Bayes is a popular algorithm for text classification tasks, such as sentiment analysis. Suppose we have a sentiment analysis dataset with two columns:
One column contains movie reviews (text data).
The other column contains sentiment labels (e.g., positive or negative).
Our goal is to determine the sentiment based on the review using the Naïve Bayes classifier.
Steps to Implement Naïve Bayes for Sentiment Analysis
Text Preprocessing
Remove HTML tags, special characters, and stopwords.
Convert all text to lowercase.
Perform tokenization and, if necessary, stemming or lemmatization.
Vectorization (Feature Extraction)
Convert the cleaned text into numerical form using techniques such as:
One-Hot Encoding
Bag of Words (BoW)
Term Frequency-Inverse Document Frequency (TF-IDF)
Word Embeddings (e.g., Word2Vec, GloVe)
Model Training with Naïve Bayes
Apply the Naïve Bayes algorithm (e.g., Multinomial Naïve Bayes) on the vectorized data.
Train the model and evaluate its performance using appropriate metrics such as accuracy, precision, recall, and F1-score.
By following these steps, we can effectively classify the sentiment of movie reviews using Naïve Bayes.
Numerical Stability and Underflow in Computing
In early computing, representing decimal numbers in memory posed significant challenges. Computers store numbers in binary format using floating-point representation, which has limited precision. This limitation makes it difficult to accurately store and manipulate very small decimal numbers, such as 0.000001. For example, suppose we are comparing the radii of two cells, recorded as 0.000001 and 0.000003. Due to the finite precision of floating-point representation, a computer might treat these numbers as zero if they fall below the machine’s precision threshold (a phenomenon known as underflow). This can lead to incorrect comparisons or calculations, as the computer may fail to distinguish between the two values.
Underflow is a common issue in computations involving very small numbers, and it can also affect algorithms like Naive Bayes, where probabilities are often multiplied together.
Naive Bayes and Underflow in Probability Calculations
Consider a dataset where we aim to predict whether a student belongs to the "Yes" class or the "No" class based on two input columns: GPA and IQ. We receive a new query point with a GPA of 8.1 and an IQ of 81, and we need to calculate the probabilities of the student belonging to each class using the Naive Bayes algorithm. The probabilities are given as follows:
P(Yes ∣ 8.1,81) = P(Yes) × P(8.1∣Yes) × P(81∣Yes)
P(No ∣ 8.1,81) = P(No) × P(8.1∣No) × P(81 ∣ No)
Here, P(8.1∣Yes) represents the conditional probability of a GPA of 8.1 given the "Yes" class, and P(81∣Yes) represents the conditional probability of an IQ of 81 given the "Yes" class. The same applies to the "No" class. In practice, if our dataset has 100 or 250 probabilities (e.g., due to many features or a large number of conditional probabilities), each probability typically lies between 0 and 1, often being very small (e.g., 0.3, 0.6, 0.4).

Solution: Using Log Probabilities to Avoid Underflow
To address underflow, we use the logarithmic transformation of probabilities. The logarithm leverages the property that
log(a⋅b) = log(a) + log(b). Applying this to the Naive Bayes probabilities:
P(Yes ∣ 8.1,81) = P(Yes) ⋅ P(8.1∣Yes) ⋅ P(81∣Yes)
Taking the logarithm, this becomes:
log (P(Yes∣8.1,81)) = log (P(Yes)) + log (P(8.1∣Yes)) + log (P(81∣Yes)
Similarly, for the "No" class:
log (P(No∣8.1,81)) = log (P(No)) + log (P(8.1∣No)) + log (P(81∣No))
Instead of multiplying many small probabilities (which risks underflow), we compute the logarithms of each probability and add them. Since probabilities are between 0 and 1, their logarithms are negative (e.g., log (0.5) ≈ −0.693 ). The class with the highest log probability corresponds to the class with the highest probability, as the logarithm is a monotonically increasing function.
Example Interpretation
Suppose after calculating the log probabilities for 250 features, we obtain:
log (P(Yes ∣ data)) = −153
log (P(No ∣ data)) = −135
Since logarithms of probabilities between 0 and 1 are negative, a larger (less negative) log probability indicates a higher probability. Here, −135 > −153 , meaning P(No ∣ data) > P(Yes ∣ data). Therefore, the query point is classified as belonging to the "No" class, as it has the higher log probability.

