This assignment is designed to:

- provide practice with essential ideas for using the expectation maximization (EM) algorithm
- help you become more fluent with the terminology and the techniques
- prepare you for the next assignment dedicated to using EM for document clustering

This is primarily a mathematical homework assignment. Show your work. Be clear and concise. Where code is required, include your code. Type your assignment.

Work through these exercises as soon as possible for their instructional content. If you have general questions about the assignment, please post on the Google Group. Finish early, and earn the early bonus.

[5 points]

- I have one coin (coin 0) in my pocket: Coin 0 has probability $\lambda$ of heads.
- I toss the coin 10 times, and see the following sequence: HHTTHHHTHH (7 heads out of 10)
- What would you guess $\lambda$ to be?
- Why?

- I have three coins (0, 1, 2) in my pocket:
- Coin 0 has probability $\lambda$ of heads
- Coin 1 has probability $\beta_1$ of heads
- Coin 2 has probability $\beta_2$ of heads

- I now run several trials. For each trial I do the following:
- First I toss coin 0. The result is the value of random variable $Y$.
- If coin 0 turns up heads, then:
- I toss coin 1. The result is the value of random variable $X_1$.
- I toss coin 1 a second time yielding the value of random variable $X_2$.
- I toss coin 1 a third time yielding the value of random variable $X_3$.

- If coin 0 turns up tails, then:
- I toss coin 2. The result is the value of random variable $X_1$.
- I toss coin 2 a second time yielding the value of random variable $X_2$.
- I toss coin 2 a third time yielding the value of random variable $X_3$.

- Group the three random variables $\langle X_1,X_2,X_3 \rangle$ together as $X$.
- Note that $\langle X_1,X_2,X_3 \rangle$ represent flip tokens, by analogy with word tokens.

- Consider two scenarios:
- Scenario 1: I tell you whether coin 0 came up heads or tails, and I tell you what the outcomes from flipping coin 1 or 2 were.
- Scenario 2: I don’t tell you whether coin 0 came up heads or tails or whether coin 1 or 2 was tossed subsequently, but I do tell you what the outcomes from flipping coin 1 and/or coin 2 were.

- The problem to solve: given observed coin flips, estimate the values for $\lambda, \beta_1$, and $\beta_2$.

[5 points]

- In scenario #1, you make the following complete observations: $(H, \langle HHH \rangle),(T, \langle TTT \rangle),(H, \langle HHH \rangle),(T, \langle TTT \rangle),(H, \langle HHH \rangle)$
- Note that the notation $(H, \langle HHH \rangle)$ is short-hand for the joint event $(Y=H, X=\langle HHH \rangle) = (Y=H, X_1=H, X_2=H, X_3=H)$

- Compute the maximum likelihood estimates for $\lambda, \beta_1, \beta_2$.

- In scenario #2, you see a sequence of observations. (e.g., $\langle HHH \rangle,\langle TTT \rangle,\langle HHH \rangle,\langle TTT \rangle,\langle HHH \rangle$).
- Note that the notation $\langle HHH \rangle$ is short-hand for the event $X_1=H, X_2=H, X_3=H$

- To compute the maximum likelihood estimates for $\lambda, \beta_1, \beta_2$, you will need to employ the EM algorithm to estimate those parameters, so the following problems will help you think and work through the steps.

[20 points]

- Draw the directed graphical model for the situation described above (under “Problem Description”) consistent with scenario #2.
- Refer to the parameters collectively as $\Theta = \{\lambda, \beta_1, \beta_2\}$.
- Specify the conditional probability table – symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ – for the random variable $Y$ corresponding to the outcome for flipping coin 0: $P(Y|\Theta)$
- Specify the conditional probability table – symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ – for the random variables $X$ corresponding to the outcomes for flipping coins 1 or 2: $P(X|Y,\Theta)$

[20 points]

- What is the probability $P(X=\langle THT \rangle,Y=H|\Theta)$ symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ ?
- What is the probability $P(X=\langle THT \rangle,Y=T|\Theta)$ symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ ?
- What is the probability $P(X=\langle THT \rangle|\Theta)$ symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ ?
- What is the probability $P(Y=H|X=\langle THT \rangle, \Theta)$ symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ ?
- Hint: in each case, ask yourself what kind of query the expression represents, and use the methods with which you are already familiar.

[30 points]

- Let the partially observed data be $(\underline{e}_1, \underline{e}_2, \underline{e}_3, \underline{e}_4, \underline{e}_5) = (\langle HHH \rangle,\langle TTT \rangle,\langle HHH \rangle,\langle TTT \rangle,\langle HHH \rangle)$. (No labels)
- You will compute the maximum likelihood estimates for parameters $\lambda, \beta_1, \beta_2$ using the Expectation Maximization algorithm.
- Construct a table formatted as shown here to capture your results.
- Show your work separately. If you show the careful work for E-step #1 and M-step #1 and you are confident about how to proceed with the next iteration, then it would be reasonable to be more terse in showing work for the subsequent steps.

Iteration | $\lambda$ | $\beta_1$ | $\beta_2$ | $P(Y=H | \underline{e}_1, \Theta)$ | $P(Y=H | \underline{e}_2, \Theta)$ | $P(Y=H | \underline{e}_3, \Theta)$ | $P(Y=H | \underline{e}_4, \Theta)$ | $P(Y=H | \underline{e}_5, \Theta)$ |
---|---|---|---|---|---|---|---|---|

0 | ||||||||

1 | ||||||||

2 | ||||||||

3 |

- Let the initial values of the parameters be as follows:
- $\lambda = 0.3$
- $\beta_1 = 0.3$
- $\beta_2 = 0.6$

- Fill in row 0 using those parameter values.

- Fill in the rest of row 0 as follows:
- E-step #1:
- Compute the fractional (posterior) counts for each of the five instances in the partially observed data set. (shown in the table)
- Total up your fractional counts for each type & cluster-id pair and each cluster-id, as shown on slide #20 in Lecture #18. (not shown in the table)
- Note: Because our model approach in this assignment is token-centric and the discussion Lecture #18 is type-centric (i.e., uses type-count vectors), we need to adapt the math on the slide for our token-centric approach as follows:
- $\forall c, \forall j: C(t_j,c) = \sum_{i=1}^N [P(c|\underline{x}_i, \Theta) (\sum_{k=1}^{|x_i|} {\bf 1}(x_{i,k}==t_j))]$
- where $(\sum_{k=1}^{|x_i|} {\bf 1}(x_{i,k}==t_j))$ is the total count of flip type $t_j$ occuring in flip token vector $\underline{x}$
- where ${\bf 1}(x_{i,k}==t_j)$ denotes an indicator function (like the C ternary operator) which returns 1 when the specified condition is true and 0 when it is not.
- where $C(\cdot)$ is our notation for counting the number of occurrences of certain types of events.

- Note: Whereas the slide uses $c$ to denote a cluster id, in the context of this assignment we are using $Y=y$ to do so.

- Fill in row 1 as follows:
- M-step #1: Compute new estimates of the three parameters from the fractional counts computed in E-step #1.

- Fill in the rest of row 1 as follows:
- E-step #2:
- Compute the fractional (posterior) counts for each of the five instances in the partially observed data set using the parameters computed in M-step #1.
- Total up your fractional counts for each type & cluster-id pair and each cluster-id.

- Fill in row 2 as follows:
- M-step #2: Compute new estimates of the three parameters from the fractional counts computed in E-step #2.

- Fill in the rest of row 2 as follows:
- E-step #3:
- Compute the fractional (posterior) counts for each of the five instances in the partially observed data set using the parameters computed in M-step #2.
- Total up your fractional counts for each type & cluster-id pair and each cluster-id.

- Fill in row 3 as follows:
- M-step #3: Compute new estimates of the three parameters from the fractional counts computed in E-step #3.

- Fill in the rest of row 3 as follows:
- E-step #4:
- Compute the fractional (posterior) counts for each of the five instances in the partially observed data set using the parameters computed in M-step #3.

[5 points]

- Write an expression – symbolically in terms of $\lambda, \beta_1$, and $\beta_2$ – for the log-likelihood of the observed data.

[10 points]

- Add a new column labeled $L(\Theta)$, the log-likelihood of the observed data according to the current parameters $\Theta$, to your table from Question #5.
- Compute the log-likelihood of the observed data for each row in the table.

[5 points]

- To which cluster (mixture component) would you assign each of the observed triples? Cluster 1 (when $Y=H$) or Cluster 2 (when $Y=T$)?
- Why?

Submit a .pdf document through Learning Suite.