Post-OCR Corrector

Deep Learning tool to correct text extracted from a document with OCR.

Final project for BYU LING 581 "NLP".

This code was developed using Python 3.8.10 and may not work correctly on other versions. Scripts are intended to be OS-agnostic, but were developed on Linux Ubuntu 20.04.3 LTS.

Commands listed in this document use bash syntax. Adjustments may be needed to use other command shells.

All scripts are intended to be run with this directory (the repository root directory) as the working directory, but with the src directory as the Python path. For example, to run src/train.py, use this command:

PYTHONPATH=src python src/train.py

Or, to run scripts without needing PYTHONPATH=src at the start each time, use this command to save it for your session:

export PYTHONPATH=src

Dependencies

Python Packages

Python packages that need to be installed are listed in requirements.txt and can be installed with this command:

pip install -r requirements.txt

Workflow

The following is an outline of the expected workflow.

For help regarding usage for any script, call the script with the -h flag:

python my_script.py -h

1. Generate Training Data

XML to Plain Text: src/corpus/serbian/to_plain_text.py
Convert the Serbian Corpus (srWaC1.1) from XML format to plain text.
Collect "Vocabulary": src/corpus/all_chars.py
Read the plain-text corpus and collect a set of all characters present, then print one copy of each character to a simple text file named all_chars.txt.
Create List of "good characters": done manually
Adjacent to the all_chars.txt file generated by the all_chars.py script, manually create a simple text file named good_chars.txt that contains the "standard" characters used in the dataset. This should be a subset of the characters from all_chars.txt.
Create a Messy Corpus: src/corpus/make_messy_dataset.py
Read the plain-text corpus and create a "messy"/"mutilated" version by randomly editing some characters. Note that this depends on the good_chars.txt file from the previous step.
Index and Split Corpus: src/corpus/make_split_csv.py
Create a CSV file with byte-indices (for use with seek) for the start of each line of the plain-text corpus and for the corresponding line of the messy corpus. Also decides and records which dataset split (train, validation, test) the line belongs to.

2. Train and Evaluate Baseline Models

Evaluate Control: src/baselines/null_corrector.py
As a control for other tests, evaluate using a "null corrector", i.e. with no corrections.
Train a Baseline: src/baselines/dictionary_corrector.py
Train a basic dictionary-based corrector that corrects unknown words by replacing them with the nearest known word ("nearest" measured by Damerau-Levenshtein distance). Because this algorithm for correcting sentences is very slow and scales with the vocabulary size, evaluation can take a very long time.

3. Train and Evaluate Neural Model

Find Hyperparameters: src/model/tune_hyperparameters.py
Run a search across combinations of hyperparameters to find a good setup before going all-in.
Save the best hyperparameter configuration to a hyperparameters.json file to be loaded later.
Train the Model: src/model/train.py
Initialize a model with the hyperparameters saved in hyperparameters.json file, and train the model. Save tensorboard logs and checkpoints while training.
Note that batch size and learning rate are automatically selected in this script, not in tune_hyperparameters.py

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post-OCR Corrector

Dependencies

Python Packages

Workflow

1. Generate Training Data

2. Train and Evaluate Baseline Models

3. Train and Evaluate Neural Model

About

Releases

Packages

Languages

License

rkechols/post-OCR-corrector

Folders and files

Latest commit

History

Repository files navigation

Post-OCR Corrector

Dependencies

Python Packages

Workflow

1. Generate Training Data

2. Train and Evaluate Baseline Models

3. Train and Evaluate Neural Model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages