Thank you for your interest in our synthetic OCR datasets.
We are currently working on publishing a set of synthetic OCR datasets based on the common text analytics datasets 20 Newsgroups Reuters 21578, and the Enron e-mail corpus.
The finishing touches are being completed, and we are planning on working with the LDC to publish it through them.
If you would like to receive updates regarding the progress of the dataset, or would like to ask questions about it you may subscribe to our Google Group, which we anticipate be very low volume (< 1 message/month).
You may obtain a copy of the code used to generate the dataset by cloning our mercurial repository:
git clone git://nlp.cs.byu.edu/generate_artificial_ocr_data.gitBack to top