This section outlines the notation used throughout this document.

  • $D_i$ - Document $i$.
  • $F_j$ - Feature $j$ (word in a text document).
  • $F_{i,j}$ - Feature $j$ in Document $i$.
  • $C$ - Class label.
  • $C_i$ - Class label for Document $i$.
  • $\{c_1, \ldots , c_n\}$ - Values representing specific labels.
  • $P(C=c_i| \ldots )$ - This is a notation for a conditional probability.
  • $P(c_i| \ldots )$ - This is also a notation for a conditional probability.
  • $P(C| \ldots )$ - This is notation for a probability distribution.
  • $P(C_i| \ldots )$ - This is also notation for a probability distribution.

Derivation of Naive Bayes for Classification

First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document $i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable $C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$?

Now that we know what we want, here is the derivation:

using Bayes' Theorem:

$\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$

Note that with $a$ a constant:

$argmax_x \frac{f(x)}{a} = argmax_x f(x)$


$argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$

(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point).

By the multiplication rule

$P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)$

Because of the naive bayes assumption,

$P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)$

Now, the second part of the right-hand-side of this last equation can be written in short-hand as

$\prod_{j=1}^n P(F_{i,j}|C_i = c)$

so we now have

$P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$

leading to

$\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$

For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow.

$\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))$

The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set.

The MLE for the term in the sumation …

Probability Theory

Bayes' Theorem relies on some probability theory, which is covered here.

Sample Spaces and Events

An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a sample space. Each of the possible outcomes is called an event. Events form a set, making up the sample space for a particular experiment.

Sample Space

Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward.


An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, $S = \{H,T\}$, contains two events, $H$ and $T$, heads and tails respectively.

Some Set Theory

Conditional Probability

Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$.

In this case, $P(A|B) = .25$, and $P(A|B^{'}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad.

Formally, conditional probability is defined as follows: :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$

This formula is interpreted as follows: The Probability of class $A_i$ given feature $B$ is equal to the probability of $B$ given class $A_i$

Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document.

The Law of Total Probability

Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$.

Bayes' Theorem

:$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$

:$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!$


nlp/naive-bayes.txt · Last modified: 2015/04/23 15:39 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0