This repository contains the scripts and models to reproduce the results of the LREC 2022 paper Automatic Normalisation of Early Modern French. See below for citation instructions.
As well as the models trained in the paper (see below for instructions on how to use and retrain them), we also distribute a model compatible with HuggingFace here. It is a transformer model (equivalent to the one trained in the paper), ported to HuggingFace, fine-tuned, and including more rigorous post-processing (which can be disabled for faster normalisation).
You can use it within your code as follows (if you have transformers>=4.21.0).
from transformers import pipeline
normaliser = pipeline(model="rbawden/modern_french_normalisation", batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)
list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = normaliser(list_inputs)
print(list_outputs)
>> [{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine;', 'alignment': [([0, 4], [0, 4]), ([4, 5], [4, 5]), ([5, 13], [5, 13]), ([13, 14], [13, 14]), ([14, 30], [14, 30]), ([30, 31], [30, 31]), ([31, 33], [31, 33]), ([33, 34], [33, 34]), ([34, 42], [34, 42]), ([42, 43], [42, 43]), ([43, 45], [43, 45]), ([45, 46], [45, 46]), ([46, 54], [46, 54]), ([54, 55], [54, 55])]}, {'text': "Adieu, j'irai chez vous tantôt vous rendre grâce.", 'alignment': [([0, 5], [0, 5]), ([5, 6], [5, 6]), ([6, 7], [6, 7]), ([7, 9], [7, 9]), ([9, 13], [9, 13]), ([13, 14], [13, 14]), ([14, 18], [14, 18]), ([18, 19], [18, 19]), ([19, 23], [19, 23]), ([23, 24], [23, 24]), ([24, 31], [24, 30]), ([31, 32], [30, 31]), ([32, 36], [31, 35]), ([36, 37], [35, 36]), ([37, 43], [36, 42]), ([43, 44], [42, 43]), ([44, 49], [43, 48]), ([49, 50], [48, 49])]}]
The alignment represents pairs of input-predicition text spans (i.e. which span of the input sentence aligns with which span of the prediction). The indices are spans from one inter-character position to another, e.g. [0, 4]
indicates a span from position 0 to position 4 (e.g. Elle
in the first example).
To disable postprocessing (faster but less good normalisation), set the arguments no_postproc_lex and no_post_clean to True when instantiating the pipeline:
normaliser = pipeline(model="rbawden/modern_french_normalisation", no_postproc_lex=True, no_post_clean=True, batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)
To use the model on the command line, call the pipeline.py
file after you have downloaded it locally:
cat INPUT_FILE | python hf-conversion/pipeline.py -k BATCH_SIZE -b BEAM_SIZE > OUTPUT_FILE
Results for this model are shown in the results table below. It performs similarly to the statistical model. Additional postprocessing avoids the hallucination that can be seen with neural models, and it also includes improved postprocessing with the lexicon. These postprocessing steps can be deactivated for faster normalisation, but with reduced performance.
- Python3.7 and the requirements specified in
requirements.txt
- KenLM (to train language models for SMT)
- Moses (for training and decoding with SMT models)
python3 -m venv modfr_env
source modfr_env/bin/activate
pip install -r requirements.txt
Get dataset splits:
bash data-scripts/get_datasets.sh
The final raw files are found in data/raw/{train,dev,test}/{train,dev,test}.finalised.{src,trg,meta}
.
What does this do?
- splitting the files into train/dev/test
- filtering out sentences from the train and dev sets that also appear in the test set and contain over 4 tokens
- normalisation of quotes, apostrophes and repeated spaces
Subsets of dev and test
Subsets of dev/test are available in the same subfolders (different data selection scenarios that could be used for separate analysis).
- 1-standard: "belles-lettres" sentences taken from the same distribution as train (80%/10%/10% train/dev/test)
- 2-test ("zero-shot"): selected texts, distributed across periods and genres (0%/0%/100% train/dev/test)
- 3-test+train ("few-shot"): selected texts, distributed across periods and genres (10%/0%/90% train/dev/test)
- 4-medecine (medical domain): 1 document in the dev and the other in the test (two very different documents) (none in train)
- 5-physique (physics/mechanics domain): 1 document in dev and the other in test (none in train)
Get monolingual normalised data:
TODO download data
python data-scripts/get_monolingual_normalised.py <txt_folder> <toc_folder>
bash data-scripts/process_monolingual.sh # to be updated
bash data-scripts/download_models.sh
Below you can find normalisation commands for each of the methods compared. All methods take a text from standard input and output normalised text to standard output. Here, the dev (data/raw/dev/dev.finalised.src
) is used as an example.
To use MT approaches, you must first download the models to the main directory:
cd ModFr-Norm
wget http://almanach.inria.fr/files/modfr_norm/mt-models.tar.gz
tar -xzvf mt-models.tar.gz
Find below the different commands used for each of the approaches:
# Rule-based
cat data/raw/dev/dev.finalised.src | \
bash norm-scripts/rule-based.sh \
> outputs/rule-based/dev-1.trg
# SMT: bash norm-scripts/smt_translate.sh <model_folder>
cat data/raw/dev/dev.finalised.src | \
bash norm-scripts/smt_translate.sh final-mt-models/smt/1/model \
> outputs/smt/dev/dev-1.trg
# NMT (LSTM): bash norm-scripts/nmt_translate.sh <model_path>
cat data/raw/dev/dev.finalised.src | \
bash norm-scripts/nmt_translate.sh final-mt-models/lstm/1/checkpoint_bestwordacc_sym.pt \
> outputs/lstm/dev/dev-1.trg
# NMT (Transformer): bash norm-scripts/nmt_translate.sh <model_path>
cat data/raw/dev/dev.finalised.src | \
bash norm-scripts/nmt_translate.sh final-mt-models/transformer/1/checkpoint_bestwordacc_sym.pt \
> outputs/transformer/dev/dev-1.trg
# Post-processing using the contemporary French lexicon, the Le*fff* (Sagot, 2009)
# Can be applied after any of the other approaches
cat outputs/rule-based/dev-1.pred.trg | \
bash norm-scripts/lex-postproc.sh \
> cat outputs/rule-based+lex/dev-1.pred.trg
For ABA, the alignment-based approach, see the github repository: https://github.com/johnseazer/aba).
bash eval-scripts/bleu.sh <ref_file> <pred_file> fr
bash eval-scripts/chrf.sh <ref_file> <pred_file> fr
python eval-scripts/levenshtein.py <ref_file> <pred_file> -a {ref,pred} (-c <cache_file>)
python eval-scripts/word_acc.py <ref_file> <pred_file> -a {ref,pred,both} (-c <cache_file>)
python eval-scripts/word_acc_oov.py <ref_file> <pred_file> <trg_train_file> -a ref (-c <cache_file>)
where -a ref
means that the reference is used as basis for the alignment, -a pred
that the prediction is used as basis for the alignment, and -a both
that the average of the two is calculated. An optional cache file destination (format .pickle) can be specified to speed up evaluation when running it several times.
To calculate the average of a metric over several outputs (relevant for different random seeds of the MT approaches):
bash eval-scripts/eval_all.sh <output_folder> <ref_file> (<cache_file>)
where output_folder
is the folder containing prediction files to be included in the evaluation (all files ending in .trg
will be included for evaluation. E.g.
bash eval-scripts/eval_all.sh outputs/rule-based/dev data/raw/dev/dev.finalised.trg outputs/.cache-dev.pickle
WordAcc (ref) | WordAcc (sym) | WordAcc OOV (ref) | Levenshtein | BLEU | ChrF
-----
89.80 | 89.83 | 65.48 | 2.88 | 74.26 | 90.54
To calculate all evaluation scores, including on subsets of the data (as specified above and in the meta data):
bash eval-scripts/eval_detailed.sh <ref_file> <meta_file> <pred_file> (<cache_file>)
E.g.
>> bash eval-scripts/eval_detailed.sh data/raw/dev/dev.finalised.trg data/raw/dev/dev.finalised.meta outputs/rule-based/dev/dev-1.trg outputs/.cache-dev.pickle
>> all,bleu=74.2593 all,chrf=90.54 all,lev_char=2.88061409315046 all,wordacc_r2h=89.83100878163974 all,wordacc_h2r=89.76316112406431 all,wordacc_sym=89.79708495285203 1-standard,bleu=73.3256 1-standard,chrf=90.14 1-standard,lev_char=3.003617425214223 1-standard,wordacc_r2h=89.44745130416169 1-standard,wordacc_h2r=89.36994660564454 1-standard,wordacc_sym=89.40869895490312 4-medecine-dev,bleu=83.9066 4-medecine-dev,chrf=94.22 4-medecine-dev,lev_char=1.6903731189445474 4-medecine-dev,wordacc_r2h=93.50119088125213 4-medecine-dev,wordacc_h2r=93.41479972844536 4-medecine-dev,wordacc_sym=93.45799530484874 5-physique-dev,bleu=73.8570 5-physique-dev,chrf=90.85 5-physique-dev,lev_char=2.8046184081231575 5-physique-dev,wordacc_r2h=90.18160047894632 5-physique-dev,wordacc_h2r=90.18710191082803 5-physique-dev,wordacc_sym=90.18435119488717
where r2h
means that the reference is used as basis for the alignment, h2r
that the hypothesis is used as basis for the alignment and sym
means that the mean of the two directions is calculated.
In bold the best results presented in the paper. The results for the HuggingFace model are in bold when they are similar to or surpass the best results.
Method | WordAcc (ref) | WordAcc (sym) | WordAcc (ref) OOV | Levenshtein | BLEU | ChrF |
---|---|---|---|---|---|---|
Identity | 73.92 | 73.95 | 47.91 | 7.72 | 42.33 | 74.95 |
Identity+lex | 86.75 | 86.78 | 70.32 | 3.57 | 68.08 | 88.01 |
Rule-based | 89.80 | 89.83 | 65.48 | 2.88 | 74.26 | 90.54 |
Rule-based+lex | 91.69 | 91.71 | 72.26 | 2.33 | 78.91 | 92.45 |
ABA | 95.72 | 95.76 | 75.17 | 1.21 | 89.19 | 96.38 |
ABA+lex | 96.07 | 96.11 | 78.92 | 1.06 | 89.89 | 96.73 |
SMT | 97.61±0.04 | 97.59±0.04 | 77.65±0.16 | 0.63±0.01 | 93.67±0.10 | 98.08±0.03 |
SMT+lex | 97.76±0.04 | 97.75±0.04 | 81.24±0.19 | 0.59±0.01 | 94.11±0.10 | 98.23±0.03 |
LSTM | 97.16±0.10 | 96.97±0.08 | 78.30±0.81 | 1.13±0.09 | 92.98±0.33 | 97.60±0.06 |
LSTM+lex | 97.30±0.14 | 97.11±0.11 | 81.08±0.09 | 1.10±0.09 | 93.36±0.40 | 97.73±0.08 |
Transformer | 96.79±0.05 | 96.58±0.07 | 76.78±0.71 | 1.26±0.04 | 92.17±0.06 | 97.27±0.05 |
Transformer+lex | 96.92±0.09 | 96.70±0.10 | 79.10±0.85 | 1.23±0.05 | 92.51±0.17 | 97.40±0.09 |
HuggingFace | 96.83 | 97.04 | 77.17 | 1.81 | 92.76 | 97.51 |
HuggingFace+lex | 97.53 | 97.56 | 84.25 | 0.68 | 93.36 | 98.10 |
HuggingFace+clean | 96.83 | 97.04 | 77.17 | 1.18 | 92.77 | 97.51 |
HuggingFace+lex+clean | 97.56 | 97.59 | 84.31 | 0.67 | 93.42 | 98.13 |
In bold the best results presented in the paper. The results for the HuggingFace model are in bold when they are similar to or surpass the best results.
Method | WordAcc (ref) | WordAcc (sym) | WordAcc (ref) OOV | Levenshtein | BLEU | ChrF |
---|---|---|---|---|---|---|
Identity | 72.72 | 72.73 | 43.00 | 8.15 | 40.25 | 73.77 |
identity+lex | 86.12 | 86.12 | 64.84 | 3.78 | 66.78 | 87.40 |
Rule-based | 89.06 | 89.05 | 60.22 | 3.08 | 72.47 | 89.94 |
Rule-based+lex | 90.85 | 90.85 | 66.51 | 2.56 | 76.90 | 91.70 |
ABA | 95.13 | 95.14 | 69.50 | 1.35 | 87.70 | 95.84 |
ABA+lex | 95.44 | 95.44 | 73.54 | 1.25 | 88.37 | 96.13 |
SMT | 97.12±0.02 | 97.10±0.02 | 75.64±0.18 | 0.76±0.01 | 92.59±0.05 | 97.71±0.01 |
SMT+lex | 97.26±0.02 | 97.24±0.02 | 78.37±0.20 | 0.73±0.01 | 92.97±0.05 | 97.85±0.01 |
LSTM | 96.52±0.07 | 96.14±0.08 | 76.69±0.70 | 1.66±0.04 | 91.77±0.21 | 96.85±0.08 |
LSTM+lex | 96.63±0.08 | 96.25±0.10 | 78.35±0.79 | 1.64±0.05 | 92.07±0.25 | 96.95±0.10 |
Transformer | 96.27±0.05 | 95.89±0.07 | 75.73±0.38 | 1.81±0.01 | 91.30±0.08 | 96.65±0.05 |
Transformer+lex | 96.39±0.07 | 96.01±0.09 | 77.51±1.00 | 1.78±0.02 | 91.62±0.14 | 96.76±0.08 |
HuggingFace | 96.06 | 96.44 | 76.46 | 1.81 | 91.69 | 96.77 |
HuggingFace+lex | 96.95 | 96.98 | 82.57 | 0.87 | 92.15 | 97.60 |
HuggingFace+clean | 96.06 | 96.44 | 76.46 | 1.81 | 91.69 | 96.77 |
HuggingFace+lex+clean | 97.00 | 97.03 | 82.60 | 0.85 | 92.26 | 97.65 |
It can be useful to obtain an alignment between either the source file or the reference file. To do this, we can use a command very similar to the evalution scripts:
python eval-scripts/align_levenshtein.py <ref_or_src_file> <pred_file> -a {ref,pred} (-c <cache_file>)
python eval-scripts/align_levenshtein.py data/raw/dev/dev.finalised.trg outputs/smt+lex/dev/dev-1.trg -a ref -c outputs/.cache.pickle
The alignment script relies on a non-destructive tokenisation convention whereby a token boundary is marked by two spaces when the tokens are white-spaced separated in the raw input and by a single space when they are not. This means that the initial text is preserved, despite the tokenisation applied. The chosen tokenisation can be modified (in eval-scripts/utils.py
). By default, we apply a very simple tokenisation, separating on whitespace and certain punctuation marks.
Here is an example:
If the reference file contains the following (made-up) example sentence:
surtout j'ai choisi davantage ses écrits
and the predicted file contains the following sentence:
sur tout ji choisi d'avantage ses escrits
the alignment script will output:
surtout||||sur▁▁tout j'||||j░ ai||||i choisi davantage||||d'▁avantage ses écrits||||escrits
In this output, different cases arise:
- aligned token is identical: the token is writte as it is (e.g.
choisi
); - aligned token is different: the reference token is written first, followed by the separator
||||
and the corresponding predicted (sub)token(s). Tokenisation mismatches between the reference and prediction are marked on the predicted side as follows:- oversplitting: when there is a token boundary on the predicted side that does not correspond to a reference token boundary, it is marked using one or two consecutive symbols
▁
, depending on whether the predicted tokens are white-space separated or not (e.g. (1)surtout||||sur▁▁tout
, where the two-token predicted sequencesur tout
is aligned with the reference tokensurtout
; and (2)davantage||||d'▁avantage
, where the two-token predicted sequenced'avantage
, which is tokenised asd' avantage
, is aligned with the reference tokendavantage
); - undersplitting: when there is no token boundary on the predicted side at a place where there is one on the reference side, the subtoken aligned with the first reference token is appended with the symbol
░
(e.g.j'||||j░ ai||||i
means that the predicted tokenji
is aligned to both reference tokensj'
andai
, the░
allowing for the correct reconstruction of the single predicted tokenji
from the alignment script output).
- oversplitting: when there is a token boundary on the predicted side that does not correspond to a reference token boundary, it is marked using one or two consecutive symbols
This token-level alignment is produced based on a character-level alignment obtained using a dedicated variant of the weighted Levenshtein algorithm, designed to avoid tokenisation and punctuation mismatches unless they are really necessary for a successful alignment:
- by default, the cost of a substitution is 1, whereas the cost of an insertion or a deletion is 0.8;
- the cost of a substitution of a reference white-space character with a non-white-space is prohibitive (1,000,000);
- the cost of a substitution of a reference non-white-space character with a white-space is 30;
- the cost of a substitution involving a punctuation mark (within
,.;-!?'
) is 20; - the cost of the insertion or deletion of a white-space character is prohibitive;
- the cost of the insertion of a white-space character is 2.
To preprocess with all segmentations used in our experiments, run the following script:
bash data-scripts/process_for_mt.sh
This involves:
- preparation of data (+ meta information (decades and years))
- subword segmentation using sentencepiece for the following (joint) vocab sizes:
- char, 500, 1k, 2k, 4k, 8k, 16k, 24k
- binarisation of the data in the fairseq format (for neural models)
Train all \textit{n}-gram language model combinations as follows:
bash mt-training-scripts/train_lms.sh
Make sure to change the tool paths in this file first to point to your installation of KenLM.
An example of a training script is giving in mt-models/best-smt/1/
.
To train a new phrase table:
- Create a new model folder (e.g.
mt-models/smt-bpe_joint_1000/1
) - Copy the train script over:
cp mt-models/best-smt/1/train.sh mt-models/smt-bpe_joint_1000/1/
- Modify the location of you tools directory
tools=~tools
- Modify
type=bpe_joint_500
to your chosen segmentation type (e.g. type=bpe_joint_1000) - Modify the final lines of the two
train-model.perl
commands if you wish to change the type of language model used - Go to the directory and run training:
cd mt-models/smt-bpe_joint_1000/1; bash train.sh
Tune the models:
- Copy the tuning script over
cp mt-models/best-smt/1/tune.sh mt-models/smt-bpe_joint_1000/1/
- As before, modify the tools location and the segmentation type.
- Go to the directory adn run tuning:
cd mt-models/smt-bpe_joint_1000/1; bash tune.sh
This does tuning for 1 random seed. To do the other two random seeds create two more subfolders mt-models/smt-bpe_joint_1000/2
and mt-models/smt-bpe_joint_1000/3
, copy the tuning script over from 1/
as it is and rerun tuning (i.e. you do not need to retrain phrase tables and language models).
Create model folders and scripts for different hyper-parameter settings as follows:
bash mt-training-scripts/create_experiments.sh
N.B. You can change the hyper-parameter values in this file to generate different combinations. The dropout, batch size and learning rate are hard-coded as we only try different combinations for a few experiments.
This script will create a model folder named with the specific parameters. Each folder will have a subfolder indicating the random seed and in each of these folders will be the training script.
E.g. mt-models/transformer_char_2enc_2dec_2heads_128embdim_512ff_0.3drop_0.001lr_3000bsz/{1,2,3}/
To run training:
cd <model_folder>/<seed>
bash train.sh
Then translate the validation (dev) set for each of the model checkpoints:
cd <model_folder>/<seed>
bash translate_val.sh
To choose the best checkpoint (using as the criterion symmetrised word accuracy):
bash mt-training-scripts/eval_val.sh <model_folder>/<seed>
This will produce a validation file valid.eval
in the subfolder, which records the scores for each of the checkpoints, finds the best scoring checkpoint and copies it over to checkpoint_bestwordacc_sym.pt
. The translation of the validation set by this best checkpoint is checkpoint_bestwordacc_sym.pt.valid.postproc
.
If you use or refer to this work, please cite the following paper:
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France.]
Bibtex:
@inproceedings{bawden-etal-2022-automatic,
title = {{Automatic Normalisation of Early Modern French}},
author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon},
url = {https://hal.inria.fr/hal-03540226},
booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
year = {2022},
address = {Marseille, France},
pages = {3354--3366},
url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.358.pdf}
}
The models can be found on Zenodo:
Bawden, Rachel, Poinhos, Jonathan, Kogkitsidou, Eleni, Gambette, Philippe, Sagot, Benoît, & Gabay, Simon. (2022). FreEM-corpora/FreEM-automatic-normalisation: normalisation models for Early Modern French (1.0). Zenodo. https://doi.org/10.5281/zenodo.6594765
@software{bawden_rachel_2022_6594765,
author = {Bawden, Rachel and
Poinhos, Jonathan and
Kogkitsidou, Eleni and
Gambette, Philippe and
Sagot, Benoît and
Gabay, Simon},
title = {{FreEM-corpora/FreEM-automatic-normalisation:
normalisation models for Early Modern French}},
month = may,
year = 2022,
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.6594765},
url = {https://doi.org/10.5281/zenodo.6594765}
}
And to reference the FreEM-norm and FreEM-max datasets used in the experiments:
For FreEM-norm (used to train ABA, SMT and neural models) Simon Gabay. (2022). FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5865428
@software{simon_gabay_2022_5865428,
author = {Simon Gabay},
title = {{FreEM-corpora/FreEMnorm: FreEM norm Parallel
corpus}},
month = jan,
year = 2022,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.5865428},
url = {https://doi.org/10.5281/zenodo.5865428}
}
For FreEM-max (used to train the large language models for SMT):
@software{gabay_simon_2022_6481135,
author = {Gabay, Simon and
Bartz, Alexandre and
Gambette, Philippe and
Chagué, Alix},
title = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large
Corpus for Early modern French - Open access
version}},
month = apr,
year = 2022,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.6481135},
url = {https://doi.org/10.5281/zenodo.6481135}
}