This repository contains the code for a partial replication of Conneau et al. (2017), Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
Four different sentence encoders are implemented and trained in the "Generic NLI training scheme" as described in the original paper. In particular:
- Baseline: averaging word embeddings to obtain sentence representations
- LSTM: applied on the word embeddings, where the last hidden state is considered as sentence representation.
- BiLSTM: where the last hidden state of forward and backward layers are concatenated as the sentence representations.
- Max-Pool BiLSTM: with max pooling applied to the concatenation of word-level hidden states from both directions to retrieve sentence representations
Evaluation is then done with SentEval.
Details such as python and package versions can be found in the generated pyproject.toml and poetry.lock files.
We recommend using an environment manager such as conda. After setting up your environment with the correct python version, please proceed with the installation of the required packages
For poetry users, getting setup is as easy as running
poetry install
We also provide a requirements.txt file for pip users who do not wish to use poetry. In this case, simply run
pip install -r requirements.txt
Once these packages have been installed, we have to manually set up SentEval, since the FaceBook researchers and engineers are not paid enough to make a PyPI package like everyone else. To do this, first clone the repository to a folder of your choice:
git clone [email protected]:facebookresearch/SentEval.git
Then navigate to the SentEval repository and install it to the same environment as used above:
cd SentEval
python setup install
The repository does not include the datasets and pretrained embeddings used to train the models mentioned above, nor the trained model checkpoints themselves, as these are inappropriate for git version control.
The datasets relative to NLI training are public and will be automatically downloaded when necessary.
The model checkpoints and evaluation results, hosted on the
Internet Archive, can be downloaded from
this link.
Please download and unzip the file, placing the resulting logs/
directory in
the repository root.
The public datasets and embeddings used are SNLI and 840B-token 300-d GloVe respectively. If users already have these locally and do not wish to re-download them, simply move (or symbolically link them) to a shared data directory, and then signal this directory and the resulting paths in the arguments for the scripts.
We also make use of the SentEval datasets. To download them, visit the senteval repository you previously cloned and run
cd .data/downstream/
./get_transfer_data.bash
Once this is complete, you may then rsync or mv the downstream/
directory to a
directory of choice. Keep this directory in mind as we will then point to it
when using SentEval for evaluation. For example
rsync -r -v -h senteval/data/downstream infersent-replication/data/
we would then point to infersent-replication/data
when using SentEval for
evaluation
.
├── models/ # models folder
│ ├── __init__.py
│ ├── encoder.py # sentence encoder models
│ └── infersent.py # Generic NLI model
├── data/ # data folder (not committed)
├── logs/ # logs folder (not committed)
├── utils.py # for miscellaneous utils
├── data.py # for data loading and processing
├── train.py # for model training
├── eval.py # for model evaluation
├── infer.py # for model inference
├── demo.ipynb # demo jupyter notebook
├── error_analysis.md # error analysis markdown file
├── error_analysis.pdf # error analysis pdf file
├── images/ # images folder for error_analysis.md
├── pyproject.toml # repo metadata
├── poetry.lock
├── gen_pip_reqs.sh # script for gen. pip requirements
└── README.md # you are here
The repository comes with a demo Jupyter Notebook that allows users to load a trained model and run inference on different examples.
The notebook also provides an overview and analysis of the results.
For more fine-grained usage, please refer to the following sections.
When called directly, data.py
script will take care of setting up data and
embedding requirements for you. In particular, it will
- Download GloVe embeddings if the embeddings .txt file is not found.
- Parse the embeddings .txt file.
- Download the SNLI dataset if they are not already downloaded.
- Process the SNLI dataset, building the vocabulary in the process.
- Save the vocab to disk, to avoid having to build it again.
- Align GloVe embeddings to the vocab.
- Save the aligned glove embeddings as a Tensor to disk.
For usage:
usage: data.py [-h] [-d DATA_DIR] [-g GLOVE] [-gv GLOVE_VARIANT] -ag
ALIGNED_GLOVE [-b BATCH_SIZE] [-cv CACHED_VOCAB]
Sets up data: downloads data and aligns GloVe embeddings to SNLI vocab.
options:
-h, --help show this help message and exit
-d DATA_DIR, --data-dir DATA_DIR
path to data directory
-g GLOVE, --glove GLOVE
path to glove embeddings
-gv GLOVE_VARIANT, --glove-variant GLOVE_VARIANT
which variant of glove embeddings to use
-ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE
path to save aligned glove embeddings tensor
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size for training
-cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB
path to save/load serialized vocabulary
We use train.py
for training. Most arguments should work fine with default
values.
For usage:
usage: train.py [-h] -e ENCODER_TYPE [-c CHECKPOINT_PATH] [-s SEED] [-p]
[-l LOG_DIR] [-d DATA_DIR] [-g GLOVE] [-gv GLOVE_VARIANT]
[-ag ALIGNED_GLOVE] [-b BATCH_SIZE] [-cv CACHED_VOCAB]
[-w NUM_WORKERS]
Trains an InferSent model. Test-set evaluation is deferred to eval.py
options:
-h, --help show this help message and exit
-e ENCODER_TYPE, --encoder-type ENCODER_TYPE
one of 'baseline', 'lstm', 'bilstm', maxpoolbilstm'
-c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH
path for loading previously saved checkpoint
-s SEED, --seed SEED the random seed to use
-p, --progress-bar whether to show the progress bar
-l LOG_DIR, --log-dir LOG_DIR
path to log directory
-d DATA_DIR, --data-dir DATA_DIR
path to data directory
-g GLOVE, --glove GLOVE
path to glove embeddings
-gv GLOVE_VARIANT, --glove-variant GLOVE_VARIANT
which variant of glove embeddings to use
-ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE
path to aligned glove embeddings tensor
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size for training
-cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB
path to save/load serialized vocabulary
-w NUM_WORKERS, --num-workers NUM_WORKERS
number of workers for data loading
We use eval.py
for evaluation, both on the original SNLI task as well as on
SentEval, configurably via the command-line arguments.
For usage:
usage: eval.py [-h] [-d DATA_DIR] [-o OUTPUT_DIR] [--snli]
[--snli-output-dir SNLI_OUTPUT_DIR] [--senteval]
[--senteval-output-dir SENTEVAL_OUTPUT_DIR] -e ENCODER_TYPE -c
CHECKPOINT_PATH [-ag ALIGNED_GLOVE] [-cv CACHED_VOCAB]
[-w NUM_WORKERS] [-b BATCH_SIZE] [-p] [-s SEED] [-g]
Evaluate a trained model on SNLI and SentEval
options:
-h, --help show this help message and exit
-d DATA_DIR, --data-dir DATA_DIR
path to data directory
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Parent directory for saving results
--snli Evaluate on SNLI
--snli-output-dir SNLI_OUTPUT_DIR
Directory to save SNLI results
--senteval Evaluate on SentEval
--senteval-output-dir SENTEVAL_OUTPUT_DIR
Directory to save SentEval results
-e ENCODER_TYPE, --encoder-type ENCODER_TYPE
one of 'baseline', 'lstm', 'bilstm', maxpoolbilstm'
-c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH
path to the checkpoint file
-ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE
path to the aligned glove file
-cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB
path to save/load serialized vocabulary
-w NUM_WORKERS, --num-workers NUM_WORKERS
number of workers for data loading
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size
-p, --progress-bar whether to show the progress bar
-s SEED, --seed SEED the random seed to use
-g, --gpu whether to use gpu
We provide a simple script for performing inference, infer.py
. This can be
used either to predict the entailment of a pair of sentences, or to embed a
particular sentence. For usage:
usage: infer.py [-h] -m MODE -c CHECKPOINT_PATH [-ag ALIGNED_GLOVE] -s1
SENTENCE_1 [-s2 SENTENCE_2] [-map]
Script for inference
options:
-h, --help show this help message and exit
-m MODE, --mode MODE Mode for inference. One of 'nli' or 'sentembed'
-c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH
Path to the checkpoint file
-ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE
path to the aligned glove file
-s1 SENTENCE_1, --sentence-1 SENTENCE_1
Sentence to embed if embedding, premise if NLI'ing
-s2 SENTENCE_2, --sentence-2 SENTENCE_2
Hypothesis. Only required if NLI'ing
-map, --map Flag whether to return one of {'entailment',
'neutral', 'contradiction'} instead of {0, 1, 2}