Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
nlp:historical-document-recognition [2015/04/16 14:48]
ryancha
nlp:historical-document-recognition [2015/04/16 14:48]
ryancha
Line 27: Line 27:
 * <​strong>​DRR 2013</​strong>​ * <​strong>​DRR 2013</​strong>​
 * On noisy, historical document images a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set. * On noisy, historical document images a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set.
-<br />+
  
 [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568659| Evaluating Supervised Topic Models in the Presence of OCR Errors] ​ [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568659| Evaluating Supervised Topic Models in the Presence of OCR Errors] ​
Line 35: Line 35:
 * Received best student paper award * Received best student paper award
 * Topic discovery using unsupervised topic models degrades as error rates increase in OCR transcriptions of historical document images. ​ Despite the availability of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics over Non-Parametric Time, exhibit similar degradation. * Topic discovery using unsupervised topic models degrades as error rates increase in OCR transcriptions of historical document images. ​ Despite the availability of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics over Non-Parametric Time, exhibit similar degradation.
-<br /><br /><br />+
  
 [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1284063| A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods] ​ [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1284063| A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods] ​
Line 48: Line 48:
 * <​strong>​ICDAR 2011</​strong>​ * <​strong>​ICDAR 2011</​strong>​
 * This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore,​ 52.2% of the documents achieve a new low WER. * This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore,​ 52.2% of the documents achieve a new low WER.
-<br />+
  
 [http://​www.icdar2011.org/​fileup/​PDF/​4520a658.pdf| Error Correction with In-Domain Training Across Multiple OCR System Outputs] ​ [http://​www.icdar2011.org/​fileup/​PDF/​4520a658.pdf| Error Correction with In-Domain Training Across Multiple OCR System Outputs] ​
Line 55: Line 55:
 * <​strong>​ICDAR 2011</​strong>​ * <​strong>​ICDAR 2011</​strong>​
 * This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine. * This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine.
-<br /><br />+
  
  
Line 63: Line 63:
 * <​strong>​CIKM 2010 Workshop on the Analysis of Noisy Documents (AND 2010)</​strong>​ * <​strong>​CIKM 2010 Workshop on the Analysis of Noisy Documents (AND 2010)</​strong>​
 * We apply four extraction algorithms to various types of noisy OCR data found “in the wild” and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extraction performance of the individual systems by means of ensemble extraction. * We apply four extraction algorithms to various types of noisy OCR data found “in the wild” and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extraction performance of the individual systems by means of ensemble extraction.
-<br /><br /><br />+
  
 [http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors] [http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors]
Line 70: Line 70:
 * <​strong>​EMNLP 2010</​strong>​ * <​strong>​EMNLP 2010</​strong>​
 * We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.  ​ * We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.  ​
-<br /><br />+
  
 [http://​dl.acm.org/​citation.cfm?​id=1555437| Improving optical character recognition through efficient multiple system alignment] [http://​dl.acm.org/​citation.cfm?​id=1555437| Improving optical character recognition through efficient multiple system alignment]
Line 78: Line 78:
 * Awarded Best Student Paper of the conference * Awarded Best Student Paper of the conference
 * By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents. * By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.
-<br /><br />+
  
 ==Questions?​== ==Questions?​==
  
 Please contact [http://​faculty.cs.byu.edu/​~ringger/​ Eric Ringger] or [http://​www.billlund.com/​ Bill Lund]. Please contact [http://​faculty.cs.byu.edu/​~ringger/​ Eric Ringger] or [http://​www.billlund.com/​ Bill Lund].
nlp/historical-document-recognition.txt · Last modified: 2015/05/21 16:40 by plf1
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0