Deep Learning tool to correct text extracted from a document with OCR.
Final project for BYU LING 581 "NLP".
This code was developed using Python 3.8.10 and may not work correctly on other versions. Scripts are intended to be OS-agnostic, but were developed on Linux Ubuntu 20.04.3 LTS.
Commands listed in this document use bash syntax. Adjustments may be needed to use other command shells.
All scripts are intended to be run with this directory (the repository root directory) as the working directory,
but with the src
directory as the Python path. For example, to run src/train.py
, use this command:
PYTHONPATH=src python src/train.py
Or, to run scripts without needing PYTHONPATH=src
at the start each time, use this command to save it for your session:
export PYTHONPATH=src
Python packages that need to be installed are listed in requirements.txt
and can be installed with this command:
pip install -r requirements.txt
The following is an outline of the expected workflow.
For help regarding usage for any script, call the script with the -h
flag:
python my_script.py -h
-
XML to Plain Text:
src/corpus/serbian/to_plain_text.py
Convert the Serbian Corpus (srWaC1.1) from XML format to plain text. -
Collect "Vocabulary":
src/corpus/all_chars.py
Read the plain-text corpus and collect a set of all characters present, then print one copy of each character to a simple text file namedall_chars.txt
. -
Create List of "good characters": done manually
Adjacent to theall_chars.txt
file generated by theall_chars.py
script, manually create a simple text file namedgood_chars.txt
that contains the "standard" characters used in the dataset. This should be a subset of the characters fromall_chars.txt
. -
Create a Messy Corpus:
src/corpus/make_messy_dataset.py
Read the plain-text corpus and create a "messy"/"mutilated" version by randomly editing some characters. Note that this depends on thegood_chars.txt
file from the previous step. -
Index and Split Corpus:
src/corpus/make_split_csv.py
Create a CSV file with byte-indices (for use withseek
) for the start of each line of the plain-text corpus and for the corresponding line of the messy corpus. Also decides and records which dataset split (train, validation, test) the line belongs to.
-
Evaluate Control:
src/baselines/null_corrector.py
As a control for other tests, evaluate using a "null corrector", i.e. with no corrections. -
Train a Baseline:
src/baselines/dictionary_corrector.py
Train a basic dictionary-based corrector that corrects unknown words by replacing them with the nearest known word ("nearest" measured by Damerau-Levenshtein distance). Because this algorithm for correcting sentences is very slow and scales with the vocabulary size, evaluation can take a very long time.
-
Find Hyperparameters:
src/model/tune_hyperparameters.py
Run a search across combinations of hyperparameters to find a good setup before going all-in.
Save the best hyperparameter configuration to ahyperparameters.json
file to be loaded later. -
Train the Model:
src/model/train.py
Initialize a model with the hyperparameters saved inhyperparameters.json
file, and train the model. Save tensorboard logs and checkpoints while training.
Note that batch size and learning rate are automatically selected in this script, not intune_hyperparameters.py
TODO