Neural Machine Translation with cycle-consistency.
Final project on Skoltech DL-2023 course.
Traditional methods of Neural Machine Translation (NMT) require a big corpus of paired texts. However, in our information age, we have a lot of unstructured unpaired data in the internet. So, can we employ this unlabeled data to train an NMT language model? Or use the data at your disposal more efficiently?
Observation: if you translate from English to Russian, and then back, you must get the same thing. So, having two translation models, we can take unpaired data, translate it with one model and force other to translate it back.
We can also apply the same trick, training two translation models in parallel on paired data, trying to elicit more signal from what we have.
There are many ways to delevop from this idea, so to be directed, we formulate several more concrete questions:
- Can we benefit from additional unlabeled data using cycle consistency?
- Can we benefit from enforcing cycle consistency in low-data regime?
- How do the results depend on the model size?
Original English-Russian dataset, taken from here contains about 450k of paired sentences.
To emulate low-data regime, we subsampled 10k examples and dubbed them low_resource_train
.
The first experiment was to train four t5-base models on low_resource_train
for 10 epochs.
First pair of models was trained using the classic CrossEntropy loss, while two other models used combination of CrossEntropy and proposed Cyclic Loss.
Experiment name | BLEU for en2ru model | BLEU for ru2en model |
---|---|---|
T5-base, CrossEntropy loss, 10 epochs | 4.7844 | 4.3415 |
T5-base, CrossEntropy + Cyclic Loss, 10 epochs | 6.1359 | 5.7527 |
As you can see, on low number of epochs, proposed method of training yields better BLEU scores.
The next experiment was to train four t5-base models on low_resource_train
, but this time do the training for 30 epochs instead of 10 epochs. Again, first two models were to be trained using the classic CrossEntropy loss, while two other models employed our proposed Cyclic Loss.
Experiment name | BLEU for en2ru model | BLEU for ru2en model |
---|---|---|
T5-base, CrossEntropy loss, 30 epochs | 14.7513 | 18.7511 |
T5-base, CrossEntropy + Cyclic Loss, 30 epochs | 13.7197 | 17.6676 |
Surprisingly, on higher amount of epochs, our models yield a tad lower scores.
Here we use the same setup as before, expect the model now is t5-small (cointegrated/rut5-small
, to be exact).
On this plot we can see, that cycle consistency only helps once, on 20 epochs for English to Russian model. We hypothesize, that small model is less robust to noisy training signals.
The third experiment was to train 10 t5-base models, using different amount of data. First four models use only the small subset of labeled data that was used in previous experiments and CrossEntropy Loss. Next four models models use Cyclic Loss and are trained in the following manner:
- Train for 10 (or 30) epochs on the
low_resource_train
(init stage) - Train for 1 epoch on a big set (~300k sentences) of unlabeled data (pretrain stage)
- Train for 10 epochs on the
low_resource_train
(finetune stage)
Finally, we trained two models in mixed regime:
- At each training step, model receives a batch of paired and a batch of unpaired data
- For paired data usual loss is computed
- For both paired and unpaired data consistency loss is computed
- Total loss is a weighted sum of these losses
Experiment name | BLEU for en2ru model | BLEU for ru2en model |
---|---|---|
T5-base, CrossEntropy loss, 10 epochs | 4.7844 | 4.3415 |
T5-base, CrossEntropy + Cyclic Loss, multistage, 10 + 1 + 10 epochs | 6.7953 | 7.2948 |
T5-base, CrossEntropy loss, 30 epochs | 14.7513 | 18.7511 |
T5-base, CrossEntropy + Cyclic Loss, multistage, 30 + 1 + 10 epochs | 15.1079 | 20.2587 |
T5-base, CrossEntropy + Cyclic Loss, mixed, 30 epochs | 15.6286 | 20.0381 |
First, mixed strategy workes the best with respect to average BLEU. Second, multistage training improves relative to training less epochs. However if we compare 30 epoch baseline against 10+1+10 multistage, the former wins convicingly, while the wall-clock time of training is comparable.
- Can we benefit from unpaired data?
- Mixed approach brings some benefits. We hypothesize, that with very large amount of epochs it will still be ahead of the baseline, making use of more samples. However, it is dramatically slower (x5-x6).
- Can we benefit from enforcing cycle consistency in low-data regime?
- Yes, if the amount of training epochs is low.
- How does it depend on the model size?
- In our experiments small models were harder to train with cycle constraint.
cd data
wget http://www.manythings.org/anki/rus-eng.zip
unzip rus-eng.zip
python make_split.py
Install micromamba
. For Linux users:
curl micro.mamba.pm/install.sh | bash
Restart the terminal.
Create and activate environment with
micromamba create -f env.yml
micromamba activate cyc
Configure PYTHOPATH:
export PROJECT_DIR=path/to/CycleTranslate
micromamba env config vars set PYTHONPATH=${PROJECT_DIR}:
micromamba deactivate
micromamba activate
Then, to reproduce, e.g. training of t5-small, one can run
CUDA_VISIBLE_DEVICES=0 python two_model_cyc/train.py --batch_size 512 --model-name cointegrated/t5-small --run-name small_model_small_train
- Seleznyov Mikhail
- experiment design
- implementing the baseline model
- running experiments
- tuning README and presentation
- Sushko Nikita
- idea of the project
- implementing helper and utility scripts
- running experiments
- writing README
- Kovaleva Maria
- implementing the cyclic loss model
- preparing presentation
- existing paper research
- running experiments