This shows you the differences between two versions of the page.

nlp:naive-bayes [2015/04/23 15:37] ryancha created |
nlp:naive-bayes [2015/04/23 15:39] (current) ryancha |
||
---|---|---|---|

Line 13: | Line 13: | ||

This section outlines the notation used throughout this document. | This section outlines the notation used throughout this document. | ||

- | * <math>D_i</math> - Document <math>i</math>. | + | * $D_i$ - Document $i$. |

- | * <math>F_j</math> - Feature <math>j</math> (word in a text document). | + | * $F_j$ - Feature $j$ (word in a text document). |

- | * <math>F_{i,j}</math> - Feature <math>j</math> in Document <math>i</math>. | + | * $F_{i,j}$ - Feature $j$ in Document $i$. |

- | * <math>C</math> - Class label. | + | * $C$ - Class label. |

- | * <math>C_i</math> - Class label for Document <math>i</math>. | + | * $C_i$ - Class label for Document $i$. |

- | * <math>\{c_1, \ldots , c_n\}</math> - Values representing specific labels. | + | * $\{c_1, \ldots , c_n\}$ - Values representing specific labels. |

- | * <math>P(C=c_i| \ldots )</math> - This is a notation for a conditional probability. | + | * $P(C=c_i| \ldots )$ - This is a notation for a conditional probability. |

- | * <math>P(c_i| \ldots )</math> - This is also a notation for a conditional probability. | + | * $P(c_i| \ldots )$ - This is also a notation for a conditional probability. |

- | * <math>P(C| \ldots )</math> - This is notation for a probability distribution. | + | * $P(C| \ldots )$ - This is notation for a probability distribution. |

- | * <math>P(C_i| \ldots )</math> - This is also notation for a probability distribution. | + | * $P(C_i| \ldots )$ - This is also notation for a probability distribution. |

===Derivation of Naive Bayes for Classification=== | ===Derivation of Naive Bayes for Classification=== | ||

- | First, what we're looking for is <math>\hat c = argmax_c P(C_i = c| \underline{D_i})</math>, where <math>\underline{D_i}</math> is the feature vector for document <math>i</math> which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable <math>C</math> takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class <math>C_i</math>? | + | First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document $i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable $C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$? |

Now that we know what we want, here is the derivation: | Now that we know what we want, here is the derivation: | ||

Line 31: | Line 31: | ||

using Bayes' Theorem: | using Bayes' Theorem: | ||

- | <math>\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}</math> | + | $\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$ |

- | Note that with <math>a</math> a constant: | + | Note that with $a$ a constant: |

- | <math>argmax_x \frac{f(x)}{a} = argmax_x f(x)</math> | + | $argmax_x \frac{f(x)}{a} = argmax_x f(x)$ |

Therefore: | Therefore: | ||

- | <math>argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)</math> | + | $argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$ |

(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point). | (Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point). | ||

Line 45: | Line 45: | ||

By the multiplication rule | By the multiplication rule | ||

- | <math>P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)</math> | + | $P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)$ |

Because of the naive bayes assumption, | Because of the naive bayes assumption, | ||

- | <math>P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)</math> | + | $P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)$ |

Now, the second part of the right-hand-side of this last equation can be written in short-hand as | Now, the second part of the right-hand-side of this last equation can be written in short-hand as | ||

- | <math>\prod_{j=1}^n P(F_{i,j}|C_i = c)</math> | + | $\prod_{j=1}^n P(F_{i,j}|C_i = c)$ |

so we now have | so we now have | ||

- | <math>P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)</math> | + | $P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$ |

leading to | leading to | ||

- | <math>\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)</math> | + | $\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$ |

For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow. | For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow. | ||

- | <math>\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))</math> | + | $\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))$ |

- | The MLE (maximum likelihood estimator) for the first term, <math>P(C_i = c)</math> is computed by taking the number of documents with label <math>c</math> divided by the total number of documents in the training set. | + | The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set. |

The MLE for the term in the sumation ... | The MLE for the term in the sumation ... | ||

Line 78: | Line 78: | ||

=====Sample Space===== | =====Sample Space===== | ||

- | Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space <math>S</math> for this experiment is <math>S = \{H,T\}</math>; the only possible outcomes are a head or tail facing upward. | + | Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward. |

=====Events===== | =====Events===== | ||

- | An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, <math>S = \{H,T\}</math>, contains two events, <math>H</math> and <math>T</math>, heads and tails respectively. | + | An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, $S = \{H,T\}$, contains two events, $H$ and $T$, heads and tails respectively. |

=====Some Set Theory===== | =====Some Set Theory===== | ||

===Conditional Probability=== | ===Conditional Probability=== | ||

- | Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let <math>A</math> be the event that Stan is late, and <math>B</math> be the event of inclement weather. Then, <math>P(A|B)</math> is read, the probability that Stan will be late given that the weather is bad. Or, the probability of <math>A</math> given <math>B</math>. | + | Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$. |

- | In this case, <math>P(A|B) = .25</math>, and <math>P(A|B^{'}) = 0.5</math> where <math>B^{'}</math> is the compliment of <math>B</math>, that is, the weather is not bad. | + | In this case, $P(A|B) = .25$, and $P(A|B^{'}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad. |

Formally, conditional probability is defined as follows: | Formally, conditional probability is defined as follows: | ||

- | :<math>P(A|B) = \frac{P(A \cap B)}{P(B)} \!</math> | + | :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$ |

- | This formula is interpreted as follows: The Probability of class <math>A_i</math> given feature <math>B</math> is equal to the probability of <math>B</math> given class <math>A_i</math> | + | This formula is interpreted as follows: The Probability of class $A_i$ given feature $B$ is equal to the probability of $B$ given class $A_i$ |

- | Let <math>A_i</math> be the <math>i^{th}</math> class that we are interested in, and <math>B</math> be a feature of a document. | + | Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document. |

===The Law of Total Probability=== | ===The Law of Total Probability=== | ||

- | Let <math>A_1, \ldots , A_k</math> be mutually exclusive and exhaustive events. Then for any other event <math>B</math>, <math>P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)</math>. | + | Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$. |

===Bayes' Theorem=== | ===Bayes' Theorem=== | ||

- | :<math>P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!</math> | + | :$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$ |

- | :<math>P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!</math> | + | :$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!$ |

. | . |