• 1. Gibbs Sampling on Mixture of Multinomials
    • EM alone
    • CEM
    • k-means (not model based)
    • GS
    • GS + EM starting from the MAP likelihood sample
    • Other things to try:
      • send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique.
      • don't plot all metrics on same scale; perhaps differences are more visible when zoomed in.
      • work on the test for significance (related to #2)
      • see to what degree these results hold up on the other data sets
      • see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains
      • validate with HBC implementation of Gibbs sampling

Worth doing, perhaps for subsequent paper: 6. compare with Gibbs + EM 7. compare with Variational EM using HBC <br>

  • 2. Metric Survey, Algorithm survey (see below)
    • Comparing clustering metrics and algorithms
    • Correlations among metrics
    • Across many data sets


  • 3. Publicizing as new data set
    • web data
    • human labeled
    • plentiful


  • 4. Pick best/fastest clustering algorithm
    • experiments on desktop PC data
    • experiments on “virtual user”
      • 1 Enron user's email
      • 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus)
      • - use user's interests to crawl web data
      • IM? PDFs? .doc, .ppt, .xls?

Current Experiments

  • Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
  • XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors
    • In one run consisting of 11,000 samples there were no duplicate samples from the joint.
  • Pick k using EM and held-out data set.
  • Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper.
  • Implement BIC?
  • Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
    • modes of per-document marginal posteriors
    • last sample
    • MAP sample
    • random sample
  • Identifiability / Label-Switching Drift Map
    • diagnose using cluster vs. cluster KL-div. heat map
    • see Steyvers & Griffiths 2006 Fig. 8 for idea
    • try at different lags within the same run.
  • Scalability:
    • Reuters Full: can we scale?
    • Reuters All: can we really scale?

Cluster Survey

We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following: Which clustering algorithm has the following properties:

  • scores high on all/most metrics
  • works well on many data sets
  • runs quickly and in reasonable space


Metric List

  • Enumerate and describe all known clustering metrics; In particular, look into
    • Cluster edit distance
    • Compactness
    • BIC?
    • V-metric
    • Q0, Q2

Comparing Metrics

  • Multidimensional scaling

Possible Data Sets

  • Look at new dataset crawled by Michael
  • Check PennTags and for more labeled data.
  • Also for more labeled data: Library of Congress Subject Labels
  • 20 Newsgroups at \\entropy\data
  • Think about how to use Wikipedia data for clustering and classification

Coding Tasks

  • XX Debugging EM:
    • XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations?
    • XX Test the assumption that a seeded Random object will produce the same stream.
    • XX Verify your log distributions everywhere.
    • XX Verify that documents are getting loaded in the same order.
    • Initialize model param.s and NOT posteriors (?)
  • More efficient datum representation as vectors of counts.
    • Write Vector class (done 9/22)
    • Replace Labeled Datums with vectors in clusterer tester, clusterers
  • Rewrite code for explicit E-step, M-step division (95%, because of bug)
  • Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
  • Implement K-means
    • Easy K-means using TF-IDF vectors.
    • K-means using “hard” version of EM and multinomial count vectors.
    • Bisecting K-means
    • Read about constrained K-means (Wagstaff & Klein)
    • Read about Fast K-means using the Triangle inequality, UCSD


  • Assimilate all of Michael Goulding's changes in appropriate way
  • From Michael:
    • Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059

Resources to Investigate

  • Check out NIPS Workshop on “Learning to Compare Examples”
  • Read Meila and Heckerman paper. (link from CS 598R reading list).
  • Estimate Log with Taylor-series expansion for quick evaluation – ask Hal
    • Augment wiki page on “Log domain calculations” with this info.
  • Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel.
  • Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
  • Read Fast Clustering Papers and supplement bibliography
  • Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
  • Get Griffiths & Steyvers LDA code – using Gibbs Sampling
  • Read Radford Neal Technical Report “Markov chain sampling methods for Dirichlet process mixture models” from 1998
  • Investigate Multi-view Clustering (like Multi-view Classification) – ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
  • “Product of Experts – ref.: Geoff Hinton
  • Look into prior work on using graph cuts and flows to cluster documents

Ideas to Explore

  • Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize.
  • Cheeseman and Stutz's approximation to the marginal log likelihood of the data


  • select best clustering algorithm for keyphrase extraction from news
LDAP: couldn't connect to LDAP server
nlp-private/document-clustering-and-metrics.txt · Last modified: 2015/04/22 14:58 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0