
NAÏVE BAYES Part - 3
- Aryan

- Mar 17
- 9 min read
BERNOULLI NAÏVE BAYES
Introduction to Bernoulli Naive Bayes
Bernoulli Naive Bayes is a variant of the Naive Bayes algorithm designed for classification tasks where the input features are binary (e.g., 0 or 1), following a Bernoulli distribution. This means each feature represents the presence (1) or absence (0) of a particular attribute, and the algorithm assumes that the features are conditionally independent given the class label. It is particularly well-suited for datasets where each feature has only two possible outcomes, such as binary-valued feature vectors. If the input data is not already binary (e.g., word counts in text), a Bernoulli Naive Bayes instance can binarize it based on a specified parameter (e.g., converting presence/absence of words into 0s and 1s).
This approach works effectively on datasets where each feature follows a Bernoulli distribution, meaning it has only two categories (e.g., present or absent). Such data can be generated from a binary bag-of-words representation, where each document is represented by the presence or absence of specific words .
Dataset and Problem Setup
Consider a dataset with text documents and a binary class label indicating whether the document contains "China" (Yes = 1) or not (No = 0). The dataset includes four training points and one test point, which we convert into a binary bag-of-words representation. The original data is :

We convert this into a binary bag-of-words representation, where each row indicates the presence (1) or absence (0) of the words {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan} :
Our task is to predict whether the test document d5 (with features {Chinese = 1, Beijing = 0, Shanghai = 0, Macao = 0, Tokyo = 1, Japan = 1}) contains "China" (i.e.,Y = 1) or not (i.e., Y = 0 ) .

Calculate the counts for each word in each class :
Class Y = 1 (Yes, 3 documents):
Chinese = 1: 3 (all 3 documents have Chinese).
Beijing = 1: 1 (d1 has Beijing).
Shanghai = 1: 1 (d2 has Shanghai).
Macao = 1: 1 (d3 has Macao).
Tokyo = 1: 0 (no document has Tokyo).
Japan = 1: 0 (no document has Japan).
Beijing = 0: 2 (d2,d3 have no Beijing).
Shanghai = 0: 2 (d1,d3 have no Shanghai).
Macao = 0: 2 (d1,d2 have no Macao).
Tokyo = 0: 3 (all 3 documents have no Tokyo).
Japan = 0: 3 (all 3 documents have no Japan).
Class Y = 0 (No, 1 document):
Chinese = 1: 1 (d4 has Chinese).
Beijing = 1: 0 (d4 has no Beijing).
Shanghai = 1: 0 (d4 has no Shanghai).
Macao = 1: 0 (d4 has no Macao).
Tokyo = 1: 1 (d4 has Tokyo).
Japan = 1: 1 (d4 has Japan).
Beijing = 0: 1 (d4 has no Beijing).
Shanghai = 0: 1 (d4 has no Shanghai).
Macao = 0: 1 (d4 has no Macao).
Tokyo = 0: 0 (d4 has Tokyo).
Japan = 0: 0 (d4 has Japan).
Apply Laplace smoothing (α = 1) :

Step 3: Compute Posterior Probabilities
In Bernoulli Naive Bayes, the likelihood is the product of the probabilities of the observed feature values (1 or 0) for the query point d5 = {Chinese = 1, Beijing = 0, Shanghai = 0, Macao = 0, Tokyo = 1, Japan = 1} :

Step 4: Classify the Query Point
Since P(Y = 0 ∣ d5) ≈ 0.0220 > P(Y = 1 ∣ d5) ≈ 0.0207 , we predict that d5 does not contain China (i.e., Y = 0 ).
Why Laplace Smoothing Is Necessary
Without Laplace smoothing (α = 0), if a feature value (e.g., Tokyo = 1) is not observed in a class, the probability would be zero, making the entire product zero. Laplace smoothing adds α to the numerator and α⋅2 to the denominator, ensuring all probabilities are non-zero. This prevents the model from assigning zero probability to unseen feature combinations, improving its robustness, especially with small datasets .
MULTINOMIAL NAÏVE BAYES
Introduction to Multinomial Naive Bayes
Multinomial Naive Bayes is a variant of the Naive Bayes algorithm specifically designed for classification tasks involving discrete features, such as text classification where the features represent word counts or frequencies within documents. It assumes that the features follow a multinomial distribution given the class label, making it ideal for datasets where the input features are counts or frequencies of categorical variables (e.g., the frequency of words in a document). This makes Multinomial Naive Bayes particularly effective for tasks like spam detection, sentiment analysis, or document classification.
Dataset and Problem Setup
Consider a dataset for text classification, where the goal is to predict whether a document contains references to "China" (class Y = 1) or not (class Y = 0). The dataset includes four training documents and one test document, with features representing the frequency of specific words in a vocabulary {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan}. The original data is :

First, we transform this dataset into a bag-of-words representation, where each document is represented by the frequency of words from the vocabulary:
Our task is to predict whether the test document d5 (with word frequencies {Chinese = 3, Beijing = 0, Shanghai = 0, Macao = 0, Tokyo = 1, Japan = 1}) contains references to "China" (Y = 1) or not (Y = 0) .

First, calculate the total word counts in each class:
Class Y = 1 (Yes, 3 documents):
Total words = (2+1+0+0+0+0) + (2+0+1+0+0+0) + (1+0+0+1+0+0) = 8
Word counts:
Chinese: 5 (2 in d1, 2 in d2, 1 in d3).
Beijing: 1 (1 in d1).
Shanghai: 1 (1 in d2).
Macao: 1 (1 in d3).
Tokyo: 0.
Japan: 0.
Class Y = 0 (No, 1 document):
Total words = (1+0+0+0+1+1)=3 .
Word counts:
Chinese: 1 (1 in d4).
Beijing: 0.
Shanghai: 0.
Macao: 0.
Tokyo: 1 (1 in d4).
Japan: 1 (1 in d4).
Now, apply Laplace smoothing (α = 1, V = 6) :

Step 3: Compute Posterior Probabilities
For Multinomial Naive Bayes, the likelihood for a document is the product of the conditional probabilities raised to the power of the word frequencies. For d5 = {Chinese = 3, Beijing = 0, Shanghai = 0, Macao = 0, Tokyo = 1, Japan = 1} :

Step 4: Classify the Query Point
Since P(Y = 1 ∣ d5) ≈ 0.0003 > P(Y = 0 ∣ d5) ≈ 0.0001 , we predict that d5 contains references to "China" (i.e., Y = 1).
Why Laplace Smoothing Is Necessary
Without Laplace smoothing (α = 0), if a word (e.g., Tokyo) does not appear in a class, the probability P(Tokyo ∣ Y = 1) would be zero, making the entire product zero for any document containing that word. Laplace smoothing adds α to the numerator and α⋅V to the denominator, ensuring all probabilities are non-zero, which allows the model to handle unseen words effectively .
COMPLEMENT NAÏVE BAYES
Introduction to Complement Naive Bayes
Complement Naive Bayes (CNB) is a variant of the Naive Bayes algorithm designed to improve classification performance, especially in datasets with imbalanced class distributions. Unlike standard Naive Bayes (e.g., Multinomial or Bernoulli), which estimates probabilities based on the frequency of features within each class, CNB computes probabilities using the complement of each class. This approach reweights the contributions of features, giving more emphasis to those that are underrepresented in the target class, thereby addressing the bias toward the majority class often seen in traditional Naive Bayes.
CNB is particularly useful for text classification tasks, such as sentiment analysis or spam detection, where the dataset may have a dominant class (e.g., non-spam emails outnumbering spam emails). By focusing on the complement, CNB reduces the impact of frequent but less discriminative features and improves the model’s ability to handle skewed data .


Advantages of Complement Naive Bayes
Handles Imbalanced Data: By focusing on the complement, CNB reduces bias toward the majority class, making it more effective than standard Naive Bayes for imbalanced datasets.
Robust to Feature Frequency: It downweights frequent but less discriminative features, improving classification accuracy.
Efficient for Text: CNB is particularly well-suited for text classification tasks where word frequencies vary widely across classes .
Example
Dataset :
Suppose we have a text classification dataset to predict whether a document contains "Positive" sentiment (class 1) or "Negative" sentiment (class 0). The dataset is imbalanced, with 3 "Positive" documents and 1 "Negative" document. The features are binary indicators of word presence (1 or 0) for the words {Happy, Sad, Good, Bad} .
Query Document:
Predict the sentiment of a new document with features {Happy = 1, Sad = 0, Good = 0, Bad = 1} .


Practical Considerations
Imbalanced Datasets: CNB is most effective when the class distribution is skewed.
Feature Selection: Helps remove irrelevant features.
Implementation: Available in scikit-learn as ComplementNB .
OUT OF CORE NAÏVE BAYES
Introduction to Out-of-Core Learning
Out-of-core learning refers to a machine learning approach where the training data is too large to fit into the system’s RAM (Random Access Memory) at once. This is a common challenge with large datasets, such as those exceeding 3 GB or 4 GB, which can overwhelm the memory capacity of a typical system. Traditional machine learning algorithms, including Naive Bayes, often assume that the entire dataset can be loaded into memory during training (e.g., when calling the fit method with features X and labels y ). However, when the dataset size exceeds the available RAM, this approach becomes infeasible, leading to memory errors or significant performance degradation.
Out-of-core Naive Bayes addresses this issue by processing the data in smaller, manageable chunks rather than loading the entire dataset at once. This allows the model to train incrementally, updating its parameters as each chunk is processed, and enables the handling of very large datasets efficiently .
How Out-of-Core Naive Bayes Works
To implement out-of-core learning with Naive Bayes, the dataset is divided into smaller chunks that can fit into memory. For example, suppose we have a 50 GB dataset, and our system’s RAM can handle chunks of 5 GB at a time. We can split the dataset into 10 chunks, each of 5 GB. The training process proceeds as follows:
Initialize the Model: Start with an empty Naive Bayes model (e.g., Multinomial, Bernoulli, or Gaussian, depending on the data type).
Process Chunks Sequentially:
Load the first 5 GB chunk into memory.
Use this chunk to partially train the model by updating the necessary statistics (e.g., prior probabilities P(y) and conditional probabilities P(xᵢ ∣ y)).
Discard the chunk from memory and load the next 5 GB chunk.
Update the model’s statistics with the new chunk, aggregating the counts or parameters from the previous chunks.
Repeat this process until all 10 chunks have been processed.
Finalize the Model: After processing all chunks, the model’s parameters (e.g., probabilities) are computed based on the aggregated statistics from all chunks.
Naive Bayes is particularly well-suited for out-of-core learning because its training process involves calculating probabilities that can be updated incrementally. For example:
Prior Probabilities: P(y) is the fraction of instances in each class. With each chunk, we can count the number of instances per class and accumulate these counts across chunks.
Conditional Probabilities: For categorical features (e.g., in Multinomial or Bernoulli Naive Bayes), P(xᵢ | y)is based on frequency counts, which can be updated additively. For continuous features (e.g., in Gaussian Naive Bayes), we can update the mean and variance incrementally .
Model Updates and Deployment
Once the model is trained on the entire dataset using the chunk-based approach, it can be deployed for inference (e.g., on a server). When new data arrives, the model can be updated incrementally without retraining from scratch:
Daily Updates: If new data is received daily, load the new data (or process it in chunks if it is large) and update the model’s statistics (e.g., increment the counts for prior and conditional probabilities).
Making Predictions: Use the updated model to make predictions on new data points as needed.
This incremental update capability ensures that the model remains current without requiring the entire dataset to be reloaded into memory, making out-of-core Naive Bayes efficient for large-scale, dynamic applications.
Practical Considerations :
Chunk Size: The chunk size (e.g., 5 GB) should be chosen based on the available RAM and system constraints. Smaller chunks reduce memory usage but may increase I/O overhead due to frequent disk access.
Data Format: Store the dataset in a format that supports efficient chunk-based reading, such as CSV, HDF5, or parquet files.
Implementation: Libraries like scikit-learn provide support for out-of-core learning in Naive Bayes through methods like partial_fit, which allow incremental training on chunks of data.
Performance Trade-offs: While out-of-core learning enables handling large datasets, it may be slower than in-memory training due to disk I/O operations. However, this trade-off is acceptable when memory constraints make in-memory training impossible.
Example Workflow
Suppose we have a 50 GB dataset for text classification (e.g., spam detection) with millions of documents and a vocabulary of thousands of words, using Multinomial Naive Bayes:
Split the dataset into 10 chunks of 5 GB each.
Initialize a Multinomial Naive Bayes model.
For each chunk:
Load the chunk into memory.
Update the word counts for each class (spam or not spam) and the total document counts.
Discard the chunk and load the next one.
After processing all chunks, compute the final probabilities (e.g., P(spam) , P(wordᵢ ∣ spam)) .
Deploy the model on a server.
When new emails arrive daily, load the new data, update the counts, and recompute the probabilities to keep the model up-to-date.
Use the updated model to classify incoming emails as spam or not.
This approach ensures that Naive Bayes can scale to very large datasets while maintaining the ability to adapt to new data over time.

