Recovering high quality digital text from modern machine-printed document images using Optical Character Recognition (OCR) is nearly a solved problem. However, recovering high quality digital text from historical document images is significantly more challenging. Our document recognition project focuses on the latter problem. The work encompasses efforts to combine multiple OCR hypotheses using multi-sequence alignment methods and machine learning to select the best hybrid transcription. Such hypotheses can come from multiple OCR engines or from a single OCR engine on different inputs. The potential for transcription improvement is substantial.

Furthermore, we are interested in mining patterns from historical texts. The work includes examinations of the impact of errors in document recognition on the performance of various probabilistic topic models.


How Well Does Multiple OCR Error Correction Generalize?
140px-binarization-generalization.png William B. Lund, Eric K. Ringger, Daniel D. Walker
In Proceedings of the 20th Document Recognition and Retrieval (DRR 2014)
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.
Why Multiple Document Image Binarizations Improve OCR
140px-whybinarization.png William B. Lund, Douglas J. Kennard, Eric K. Ringger
2nd International Workshop on Historical Document Imaging and Processing 2013 (HIP 2013)
Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demonstrating how diversity among multiple binarizations makes those improvements to OCR accuracy possible. We demonstrate the degree and breadth to which the information required for correction is distributed across multiple binarizations of a given document image. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of binarizations holds information needed to achieve the best result as measured by the word error rate (WER) of the final OCR decision. Even binarizations with high WERs contribute to improving the final OCR. For the corpus used in this research, fully 2.68% of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image.
Combining Multiple Thresholding Binarization Values to Improve OCR Output
140px-multiple-thresholding-ocr.png Bill Lund; Doug Kennard; Eric Ringger
DRR 2013
On noisy, historical document images a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set.
Evaluating Supervised Topic Models in the Presence of OCR Errors
120px-supervised-noisy-tm.png Daniel Walker; Eric Ringger; and Kevin Seppi
The Conference on Document Recognition and Retrieval XX (DRR 2013)
Received best student paper award
Topic discovery using unsupervised topic models degrades as error rates increase in OCR transcriptions of historical document images. Despite the availability of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics over Non-Parametric Time, exhibit similar degradation.
A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods
120px-synthetic-ocr.png Dan Walker; Bill Lund; Eric Ringger
DRR 2012
We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset.
Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines
140px-progressive-alignment.png Bill Lund; Dan Walker; Eric Ringger
ICDAR 2011
This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore, 52.2% of the documents achieve a new low WER.
Error Correction with In-Domain Training Across Multiple OCR System Outputs
120px-ocr-error-correction.png Bill Lund; Eric Ringger
ICDAR 2011
This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine.
Extracting Person Names from Diverse and Noisy OCR Text
140px-extracting-names.png Thomas Packer; Joshua Lutes; Aaron Stewart; David Embley; Eric Ringger; Kevin Seppi; Lee Jensen
CIKM 2010 Workshop on the Analysis of Noisy Documents (AND 2010)
We apply four extraction algorithms to various types of noisy OCR data found “in the wild” and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extraction performance of the individual systems by means of ensemble extraction.
Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
120px-noisyocr-lds.png Dan Walker; Bill Lund; Eric Ringger
EMNLP 2010
We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.
Improving optical character recognition through efficient multiple system alignment
140px-mult-alignment.png Bill Lund; Eric Ringger
JCDL 2009
Awarded Best Student Paper of the conference
By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.


Please contact Eric Ringger or Bill Lund.

nlp/historical-document-recognition.txt · Last modified: 2015/05/21 16:40 by plf1
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0