Ohio Death Records Automated Processing Pipeline Documentation

TL;DR

If you just want to run the models on some data you can do so easily with access to the BYU supercomputer and to fslg_handwriting.

ssh into RHEL 7 nodes of Mary Lou by doing:

ssh rhel7ssh.rc.byu.edu

This gets around issues of old versions of GLIBC when loading pytorch.

Dependencies can be taken care of through a conda virtual environment, make sure you have conda installed. Load the virtual environment using:

conda activate /fslgroup/fslg_handwriting/compute/death/env/death_env

Segmenting

Run the following command on a directory of images:

python demo/practice_dir.py configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>

Handwriting Recognition

Run the following command on the directory of segmented images output from the last step:

python hw_pred.py sample_config_iam_hwr.yaml <segmented images dir>

Results will be output to stdout.

Overall Workflow

2. Preprocess death record images by deskewing
3. Label images for segmentation
4. Train segmentation model
5. Segment death records and pair with cause of death transcriptions
6. Train HWR model
7. Perform handwriting recognition
8. Post-processing

0. Preliminaries

Data for Segmentation

We use 2 data sources to train a segmenter to identify pertinent lines in the death records. The 2 data sources are:

1. Ohio death records
2. North Carolina death records

Labelme was used to label the images. We label the images with 2 labels:

1. medcert - Medical Certificate of Death portion of death record
2. cod - any line of text pertinent to why a person died

These images and annotations can be found on the BYU supercomputer in COCO format at the following path:

/fslgroup/fslg_handwriting/compute/death/data/segmentation/combined_coco

Data for Handwriting Recognition

We use 3 data sources of transcribed lines of text in order to train a handwriting recognition model for the Ohio death records. The 3 data sources are:

1. IAM dataset
 a) 13353 training images
2. Ohio death records
 a) 1390 training images
b) 817 validation images
3. North Carolina death records
 a) 7973 training images
b) 1754 validation images


The Ohio and North Carolina images were obtained by segmenting death records and using the transcriptions collected by student transcribers. These images can annotations can be found in the format for Curtis Wiggington’s SFR code on the BYU supercomputer at the following path:

/fslgroup/fslg_handwriting/compute/death/data/transcription/transcribed_iam_combined

We needed to use the IAM dataset for training because of the irregularities present in the death record transcriptions. The main irregularity is whether or not contributory factors are present in the transcription or not.

Making sure you have the right software dependencies is awful. To ease this, we provide a conda virtual environment that can be used with all the appropriate software dependencies. Please don’t install any additional modules while using this environment or you might break software dependencies. Load the environment with the following command:

conda activate /fslgroup/fslg_handwriting/compute/death/env/death_env

2. Preprocess Death Records

All images should be deskewed first. We used Imagemagick to accomplish this. Deskewing is important because it makes labeling lines of text for creating segmentation ground truth significantly easier. The following command will deskew an image:

convert <path to image> -deskew 80% <save path for deskewed image>

3. Label Images for Segmentation

Images are labeled using labelme. We label both the

After labeling images, format the dataset in COCO format by running:

cd /fslgroup/fslg_handwriting/compute/death/software/labelme

python examples/instance_segmentation/labelme2coco_nocrowd_instance_all_cod.py –labels <labels text file> <input images dir> <output images dir>

The <labels text file> can be replaced with /fslgroup/fslg_handwriting/compute/death/data/segmentation/labels_nosfr.txt

4. Train Segmentation Model

The Facebook MaskRCNN model is used for segmenting single lines from death records. We use maskrcnn-benchmark, a pytorch implementation made by Facebook. We used the pretrained ResNet-50 architecture during training. We find that this model provides very good performance (0.969 IoU=0.50:0.95) with minimal training time (<15 hours). We believe that the ResNet architecture works better than Start-Follow-Read’s Start-of-Line finder for this task because it has a larger receptive field and we are only interested in a small number of lines of text instead of every line of text.

We use a modified version of the maskrcnn-benchmark library that does not flip images or do random crops.

To train the model, run the following command:

python tools/train_net.py –config-file configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.MAX_ITER 100000 TEST.IMS_PER_BATCH 2

The config file configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml contains settings for training the model. We note that we allow a maximum image size of 3000 pixels. Most images should remain about the same size during training and segmentation.

5. Segment Images

Run the following command on a directory of images:

python demo/practice_dir.py configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>

6. Train Handwriting Recognition Model

We use just the handwriting module from the Start-Follow-Read

7. Perform Handwriting Recognition

Run the following command on the directory of segmented images output from the last step:

python hw_pred.py sample_config_iam_hwr.yaml <segmented images dir>

Results will be output to stdout.

8. Post-processing

Spell checker/text correction/normalization techniques needed

Deprecated

The following information is kept for historical reasons to follow previous attempts at automated documented processing of the Ohio death records.

This document contains instructions and insights into the machine learning pipeline used to automatically process the Ohio death records. This includes tasks such as:

1. Scraping data from Family Search
2. Data labeling
3. Segmentation
4. Handwriting recognition
5. Cause of death to ICD code

Our approach is to use machine learning as a way to automatically extract regions of interest from the death records for use in text (machine printed and handwritten) recognition. First a labeled training dataset is created, then a model is trained. Unlabeled images can then be processed with the train model. After regions are extracted, we then use text recognition trained by handmaid transcriptions to automatically transcribe the cause of death. Next the transcribed cause of death is mapped to an ICD code to bin the different causes of death.

Scraping Data

We have scraped many thousands death records from Family Search’s website while we wait for them to give them all to us. This is accomplished using Selenium and 34,530 death record IDs that were hand scraped from family search’s website. The Selenium script creates a session with Family Search and attempts to download images given a death record ID. Family Search will block access after an unknown number of queries. The block is released after about an hour. If the script is blocked during download, it sleeps for 60 minutes, creates a new session and begins downloading again.

Data Labeling

Labeling death record forms is accomplished by using the application labelme. It can be downloaded at:

Data Formatting

Once data is labeled it must be converted from the labelme format to the commonly used COCO dataset format. labelme provides a conversion script but it does not work for various reasons outlined below. In the meantime, a custom script has been written which accomplishes the task:

An example of this command is:

python labelme2coco_nocrowd.py –labels labels.txt ohio_death_images_combined/ ohio_coco

The custom script is necessary because of data format the Mask-RCNN implementation (maskrcnn-benchmark) that we use requires segmentation information recorded in polygon format for training. Segmentation data stored as a run-length encoding (RLE) cannot currently be used for training. The supplied conversion script from labelme uses RLE, the custom script uses polygons.

Segmentation

We find it to be flexible and powerful. It trains quicker than Detectron and is more flexible than MatterPort’s Tensorflow Mask-RCNN implementation.

In order to use this network, your data should be formatted into the COCO dataset format with configuration information noted in the repositories catalog script here:

A configuration script should also be generated for training. We suggest using the ResNet-50 config file that is provided:

Training

We have found that the ResNet-50 network trains quickly and has low memory usage while providing excellent results. An example training command is:

python tools/train_net.py –config-file configs/e2e_mask_rcnn_R_50_FPN_1x_death.yaml SOLVER.IMS_PER_BATCH 4 SOLVER.MAX_ITER 10000 TEST.IMS_PER_BATCH 4

Inference/Segment Extraction

The inference process is accomplished using maskrcnn-benchmark’s prediction code snippet in their README.md

Preliminary Segmentation Results

~1 hour training, 294 training images (20190227) With minimal training data we are able to achieve good baseline results. Greater diversity in death record formats would improve results significantly. Below are results from images that the segmenter has never seen.

Handwriting Recognition

Handwriting recognition is accomplished using Start-Follow-Read (SFR) by Curtis Wiggington. SFR is composed of three components: start-of-line (SOL) detector, line follower (LF) and handwriting recognition (HWR).

Start-of-Line Detection

We replace the provided SOL detector with our own based on MaskRCNN. We originally tried using the provided SOL detector but found that it had difficulties identifying the lines of text we were interested in. We believe that this is due to it being based on the shallow VGG11 network architecture which works when identifying every line of text but does not capture enough context to identify specific lines of text.

We use MaskRCNN with a ResNet-50 backend and find that this much deeper network is able to identify the lines of interest very well. The MaskRCNN model is the same model that was used in identifying the ‘Medical Certificate of Death’ (MCD) region mentioned previously. In this approach, we process the entire image (no excising specific portions from the image) to retain contextual information.

When labeling images, we label the MCD and then highlight any line of text that we’re interested in. MaskRCNN performs instance segmentation to identify all SOLs of text that we’re interested in.

Line Follower

We use the built in LF module that was trained on the ICHDAR2017 READ dataset. This module performs adequately but would benefit from additional fine-tuning.

Handwriting Recognition

We use the built in HWR module that was trained on the ICHDAR2017 READ dataset. Performance is quite poor because it is trained using a German dataset. We believe that providing an English language model may be enough to correct the predicted text. If necessary, we will retrain the HWR model using the transcriptions provided from the COD records. Language Model A language model is necessary for quality text recognition. SFR’s pretrained model was trained on German data. We use INSERT CORPUS as our corpus to build a language model for English that is specialized for medical vocabulary.

Nuances of Training SFR

The sample config files provided by the SFR repo for building the run environment and for running the actual models may not be optimized for your setup. I modified the environment to use pytorch==0.3.1 (error in config file). Updating to the most current version of opencv is also recommended to avoid a bug in the repo specified version. I also modified the run config by upping the batch size and changing the training scripts to the dataloader objects to use 16 worker threads. This reduced training time from 800+ seconds to ~10 seconds.

Eventually, we updated the code to work with pytorch==1.0 so that we could use it in conjunction with the MaskRCNN library we have.

Current Results

The MaskRCNN model was trained using 625 images for about 2 hours.

Cause of Death to ICD Code

Labeled training for this data comes from

Questions/Future Approaches

1. Do we need to segment the death record first to get the “medical certificate of death”?
   a) Is is unclear how much segmenting the image first actually helps
2. How should the images be normalized?
   a) There is a great deal of noise in the images. This may be caused by wear to the microfiche they were stored on. There may be a way to normalize the images in a way that helps to reduce these issues.
3. Could we use active learning to reduce the number of labeled samples required for processing?
4. Start-Follow-Read could be enhanced with a more powerful underlying network
   a) Using resnet instead of vgg may provide benefits to SFR and could increase the performance of the start-of-line finder when only certain lines are of interest instead of all lines being of interest