Tasks Preparing for NAACL


  1. Talk to Dr. Lonsdale and Dr. Davies in the Linguistic department about finding a corpus of mid-20th century news to build a language model.
  2. New lower bound based on adding spelling
  3. Think about whether it is better to only allow tokens that have some indication of being right. That would eliminate a lot of tokens from the committed list and possibly lower the error rate, but is that useful?


  1. Add another data set. (Deseret News, Daily Enquirer, fax data set?)
  2. Add another OCR engine. (IrisReader, Adobe, OCRopus current version?)
  3. Add hypotheses based on an edit distance from all of the aligned tokens rather than a “spell checker” method that only considers the alternatives to a single token.
  4. Construct the multiple token list by looking at tokens that appear in more than one OCR engine rather than just more than once across all of the corpus. This way avoid the problem that the OCR engine is making the same mistake repeatedly so we include the token in the list.
  5. Run a complete set of numbers on the multiple token list, but change the priority of it in the levels of evidence, include voting, separate punctuation.
  6. Separate recognition of punctuation from words. Sclite merges them into a single recognized token.


  • When a token doesn't appear in the dictionary, explore splitting the token with a space to see if the split tokens appear. Example “artillaryfire” → “artillery fire”. Note that Aspell does this, but it isn't clear what to do with the second half of the token.
  • When a token has a dash, split the token at the dash and see if the pieces appear in the dictionary. Example “anti-tank”.
  • When a dash appears at the end of a word, explore merging the words without the dash.
  • When a token ends with a dash, remove the dash and see if the token is in the dictionary, also remove the dash and merge with the next word to see if it is in the dictionary.
  • Separate punctuation from tokens. Currently Sclite requires punctuation to appear correctly in the hypothesis file. By separating punctuation from tokens in both the reference and hypotheses files we are recognizing tokens and punctuation that may be correct when the other may be incorrect.
    • What do we do with “work's”?
    • Make sure that we only consider punctuation in an appropriate spot, e.g. wo?rk would not be addressed by this.
    • In the mean time, make sure that the dictionary search drops punctuation since it will never appear
    • Ultimately punctuation needs to be restored in the text that is saved for the patron.
  • In the commit class, consider “levels of evidence”
    • Finding a word in the dictionary is more convincing that finding a word in the multiple token list.
    • Where does voting fall in the levels of evidence?
  • Train on weights for voting of alternatives
    • Get SVMLite, LibSVM
  • Expand Spell Checking using Aspell
  • When Aspell splits a token, and the same sequence of the next token column is empty, add it. Dr. Ringger considers this to be arbitrary, not principled


  • Add hypotheses to the lattice based on using a spell checker for all tokens that don't appear in the dictionary/gazetteer.
  • New committed tool to deal with levels of evidence (found in the dictionary, found in the spell checking, found in the multiple token list, voting).
  • Voting. If multiple hypotheses match, even if they don't appear in the dictionary/gazetteer, that is strong evidence to accept.
  • Update the dev, dev-test, and test set lists
  • In the commit class, consider “levels of evidence”


At each point show the results.

  1. Aligned with Oracle results using sclite to examine all alternatives and determine whether any of them are correct. This is the old baseline and exists.
  2. Using a dictionary and gazetteer, commit to one token sequence. This is the JCDL 2009 result and exists.
  3. Collect evidence from the OCR of recurring tokens.
    • Collect recurring tokens as they occur multiple times in the same file. The commit step considered this evidence as equal in weight to the dictionary and gazetteer, and provided marginal improvement across the entire dev set and for some individual files significantly worse results (200% worse).
    • A better method would be to take the recurring token file and make it less significant than dictionary/gazetteer evidence. Only refer to it when the dictionary/gazetteer look-up failed. This may permit us to use all of recurring tokens. (One of the problems was that as we improved the recall of true tokens from the recurring set, our precision dropped. When this had the same weight as the dictionary/gazetteer, our overall error rate was affected. If we only use the recurring token list when the dictionary/gazetteer fails we aren't hurting the dictionary only results and can only improve it.)
    • Another possible way to improve the recurring token list precision is to require the recurring tokens to appear in more than one OCR engine. This way we help avoid the problem of the OCR engines consistently making the same mistake. At least then more than one OCR engine would need to make the same mistake.
  4. Add hypotheses by taking each sausage, checking tokens for existence in the dictionary/gazetteer. If not found, use the spell checker on the single word to suggest an alternative. For sausages this creates a new alternative within the sausage. A viewer needs to accommodate this!
    • If a token is not found in the dictionary/gazetteer explore whether by dividing the token with a space will result in two tokens that are found.
    • Explore whether tokens divided by a single dash (do the OCR engines use different dashes?) have both halves in the dictionary/gazetteer.
    • Explore whether tokens ending in a dash, whether when merged with the next token they are found in the dictionary/gazetteer.
    • It seems that we are reacting to specific types of OCR errors. Is this a problem?
  5. Within a sausage explore across aligned alternatives using a multi-input spell checker.
  6. The commit process needs to explore each sausage for “fitness” using levels of evidence of the tokens within the sausage, e.g. found in the 1) dictionary/gazetteer, 2) some type of splitting or merging results in a found token, 3) single spell check alternative, 4) multiple spell check alternative, 5) found in recurring token list, 6) “looks like an English word.”

Tasks Preparing for JCDL

  • Implement an alignment cost based on the confusion matrix between transcriptions and OCR output.
    • Need to decide whether we're going to base the confusion matrix and resulting cost on knowing the location of the token and the resulting character or whether to use the alignment information from aligning the transcription with an OCR output. The former seems more natural, but will end up with many points where characters are recognized and there is no letter found in the document (noise). The latter will incorporate noise into the alignment.
    • The confusion matrix costs can also be used in the edit distance calculations when adding hypotheses to the lattice.

Tasks that need to happen

Expand Datasets

  • 19th Century Newspapers: Deseret News, Daily Enquirer
  • 21st Century Fax


  • DocumentLattice does not escape characters that are not legal in XML. That wasn't a problem in the Eisenhower Communiques, but in general this will be a problem.


  • toSclite classes don't escape “{”, “}”, and “/” which have special meaning to Sclite. Need to think about what to do, in particular if those characters are found in the corpus. This wasn't a problem a big problem in Eisenhower Communiques, but in general it will be a problem. Do we need to rewrite Sclite? Can we get the source?

Tasks for Chris Rotz

Tasks for Johnny Williamson

Research in Scholarly Publications

  • Explore publications on language knowledge being incorporated into OCR. This motivates using other knowledge in our method.
LDAP: couldn't connect to LDAP server
nlp-private/the_list.txt · Last modified: 2015/04/23 13:32 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0