Naive Bayes For Machine Learning

Naive Bayes Classifiers is a machine learning classifier model based on Bayes theorem with the independence assumptions between features.So before moving to the formula for Naive Bayes, it is important to know about Bayes theorem.

which tells us what is the probability of A happening given B happens. Here P(A | B) is posterior, P(A) is prior, P(B | A) is likelihood and P(B) is the evidence.

Probabilistic interpretation of naive bayes algorithm:

The basic naive bayes assumption is that each feature makes independent and equal contribution to outcome. Let x =(x1,x2…..xn) represent some n features(independent variables) and Ck be the classifier with k outcomes.Then it assigns probability of P(Ck|x1…..xn) to each of k outcomes.By Bayes theorem,

            P(Ck) P(x|Ck)        P(Ck,x)         P(Ck,x1,x2...xn)
P(Ck|x) = -------------- = ------- = ---------------
P(x) P(x) P(x)

In practice, we have interest only in the numerator of that fraction, because the denominator does not depend on Ck and the values of the features xi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model.Using chain rule,

Now we take into consideration about conditional independence assumptions,that is each xi is indipendent of each other.So,

Our joint model can be denoted as

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability(Maximum Posterior Rule).

Consider the following dataset

In learning phase we will compute the table of likelihood

We also calculate P(Class-play=yes) and P(Class-play=No)

If we got a new set of weather condition.We can classify it as whether we will play football or not based on the likelihood from learning phase.

Suppose we have Outlook=Sunny,Temperature=Mild,Humidity=High and Wind=Weak.Based on this we can predict we will play or not.With MAP rule we will calculate posterior probabilities.

P(Class-play=Yes|x) = P(Sunny|Class-play=Yes) + P(Mild|Class-play=Yes) + P(High|Class-play=Yes) +P(Weak|Class-play=Yes) = 2/9 * 4/9 * 3/9 * 6/9 = 0.02194787379

P(Class-play=No|x) = P(Sunny|Class-play=No)+ P(Mild|Class-play=No) + P(High|Class-play=No) +P(Weak|Class-play=No) = 3/5 * 2/5 * 4/5 * 2/5 = 0.0768

Since P(Class-play=No|x) is greater than P(Class-play=No|x) we classify the new instance as No.That means for the given conditions they will not play football.

Naive Bayes On Text Data

Suppose we are building a classifier which says a text is about sports or not.

Let this be our training data.Our model will learns from this.Now Now, which tag does the sentence ‘a very close game’ belong to?

Here comes the Naive Bayes comes to play.We can calculate the probabilities P(Sports | A very close game) and P(Not Sports | A very close game) and will determine which one has high probability.Before that we will preprocess the data.

Data preprocessing: Before applying Naive bayes we can preprocess the data. We can use various NLP techniques like Lemmatization , Stemming, removing stop words etc which i had mentioned in my another blog. (https://medium.com/@arunm8489/getting-started-with-natural-language-processing-6e593e349675). Here Iam just counting word frequencies as our data is simple and small.

P(sports | a very close game) = P(a|sports) * P(very|sports) * P(close|sports) * P(game|sports) * P(sports)

P(not sports | a very close game) = P(a|not sports) * P(very|not sports) * P(close|not sports) * P(game|sports) * P(sports)

First we will calculate prior probabilities.

P(sports) = 3/5

P(not sports) = 2/5

Then calculating P(game|sports) means counting how many times the word “game” appears in text corresponding to sports divided by the total number of words in sports (11). Therefore, P(game | sports )= 2/11

Similarly P(a|sports) = 1/11

Consider P(close | sports).Here word ‘close’ is not present in our training data.so,P(close | sports) = 0.We cant consider it as 0 because if we do so the whole term P(sports | a very close game) will becomes 0. Solution for this problem is Laplace smoothing.

In statistics Laplace smoothing is a technique used to smooth catogorical data.

where d is number of distinct values that xi can take and alpha is the “pseudocount".

In our example d will be number of possible words(d = 14).we will take pseudocount as 1.Then

P(sports | a very close game) = P(a|sports) * P(very|sports) * P(close|sports) * P(game|sports) * P(sports) = 0.12 * 0.08 * 0.04 * 0.12 * (3/5) = 0.0000276

P(not sports | a very close game) = P(a|not sports) * P(very|not sports) * P(close|not sports) * P(game|sports) * P(sports) = 0.0869 * 0.0437 * 0.0869 * 0.0437 * (2/5) = 0.00000572

From this clearly P(sports | a very close game)> P(not sports | a very close game)

So the text ‘a very close game’ fall under tag sports.

Naive Bayes For Continuously Distributed Data

The method that we discussed above is for discrete data. In case of continuous data,(numerical features) we need to make some assumptions regarding the distribution of values of each feature. The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).One of such classifier is Gaussian Naive Bayes classifier.

Gaussian Naive Bayes classifier

In Gaussian Naive Bayes we assume that continuous value associated with each feature follow Gaussian distribution(Also known as normal distribution).

The conditional probability is given by

Other popular Naive Bayes classifiers are Multinomial Naive Bayes and Bernoulli Naive Bayes.

Other features:

  • While dealing with large dimensional data , use log probability instead of probability because on using normal probability Naive Bayes can lead to very small values(numbers with high significant values).In such cases there can be error. In order to avoid it use log-probability.
  • Naive Bayes performs only when conditional independence of features is true.
  • Naive Bayes is super interpretable.
  • Run time and space time complexity of Naive Bayes is low.
  • Easily over-fit if we don’t do Laplace smoothing.Best alpha value can be determined by techniques like k-fold cross validation.
  • Naive Bayes is widely used when we have categorical features.Some applications are spam filtering,text classification etc

References

Machine Learning | AI