Fall (September - December) 2010 Schedule
Place: CS Department Conference Room (usually)
Click on a meeting or event for more information (if available).
'''Schedule of Topics and Presenters'''
Potential Future Topics
More MCMC coverage:
Elkan, Classificaiton with new model: Dir/Mult compound, Mulit-Polya-something,
Elkan, Clustering with the same model
Elkan, LDA with the same model
=== Clustering using von-Mises Fisher ===
Modified Mixture of Multinomials that improves classification accuracy
Conditional Structure vs Training
Klein & Manning, Conditional structure versus conditional estimation in NLP models, 2002, EMNLP-ACL
This paper separates conditional parameter estimation, which consistently raises test set accuracy on statistical NLP tasks, from conditional model structures, such as the conditional Markov model used for maximum-entropy tagging, which tend to lower accuracy. Error analysis on part-of-speech tagging shows that the actual tagging errors made by the conditionally structured model derive not only from label bias, but also from other ways in which the independence assumptions of the conditional model structure are unsuited to linguistic sequences. The paper presents new word-sense disambiguation and POS tagging experiments, and integrates apparently conflicting reports from other recent work.
Lee, Y. S., Papineni, K., Roukos, S., Emam, O., & H. Hassan, Language model based arabic word segmentation, 2003, ACL
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
Hybrid Generative/Discriminative model
Including expert knowledge into models
Constraints in Probabilistic models
Fabio Gagliardi Cozman, Ira Cohen, Marcelo Cesar Cirelo, Semi-Supervised Learning of Mixture Models, 2003
Mathematical proof describing why semi-supervised learning accuracy can decrease when adding unlabeled examples. The parameters of the model learned by semi-supervised methods can be seen as a combination of what would have been learned from the labeled and unlabeled data separately: they share the same structure but have different parameters. When the statistical model structure is incorrect, then there are contradictory biases between the models produced in the limit from labeled and unlabeled learners. Adding unlabeled examples may still decrease the variance of the unsupervised estimation of parameters, but also increase the error of classification using the resulting model. Another interesting aspect pointed out is that in some cases when the model is incorrect, the true decision boundary could be between the unlabeled and labeled boundaries, which implies that semi-supervised learning accidentally increase accuracies even when the model is not correct, by interpolation. A possible research question is whether you can improve the accuracy of semi-supervised learners in the general case (when the optimal decision boundary is not between the supervised and unsupervised model boundaries) by extrapolating from one over the other, toward the optimal decision boundary, instead of interpolating. The paper ends with a big challenge with respect to handling incorrect models: “If we could find an universally robust semisupervised learning method, such a method would indeed be a major accomplishment.”
Grammar Semantic bootstrapping
Hendrik Blockeel, Joaquin Vanschoren, Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning, 2007
Ng, A. Y., & Jordan, M. I., On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes, 2002, Advances in Neural Information Processing Systems (NIPS) (p. 14)
We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely held belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation- which is borne out in repeated experiments- that while discriminative learning has lower asymptotic error, a generative classifier may also approach is (higher) asymptotic error much faster.
Back to top
Named entity extraction from noisy OCR data (7/28/2009)