To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster: | To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster: | ||

Line 59: | Line 61: | ||

^ EM' 3 | $|\emptyset| = 0$ | $|\{C,D\}| = 2$ | $|\emptyset| = 0$ | 2 | | ^ EM' 3 | $|\emptyset| = 0$ | $|\{C,D\}| = 2$ | $|\emptyset| = 0$ | 2 | | ||

^ Column Sums | 2 | 2 | 2 | 6 | | ^ Column Sums | 2 | 2 | 2 | 6 | | ||

=== Acknowledgements === | === Acknowledgements === | ||

Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion. | Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion. |