This repository contains code and instructions for reproducing the experiments in the paper Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).
We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.
git clone --recurse-submodules https://github.com/yuzhimanhua/SciMult
cd SciMult
# get the DPR codebase
mkdir third_party
cd third_party
git clone https://github.com/facebookresearch/DPR.git
cd ../
# create the sandbox
conda env create --file=environment.yml --name=scimult
conda activate scimult
# add the `src/` and `third_party/DPR` to the list of places python searches for packages
conda develop src/ third_party/DPR/
# download spacy models
python -m spacy download en_core_web_sm
You need to first download the evaluation datasets and the pre-trained models. After you unzip the dataset file, put the folder (i.e., data/
) under the repository main folder ./
. After you download the four model checkpoints (i.e., scimult_vanilla.ckpt
, scimult_moe.ckpt
, scimult_moe_pmcpatients_par.ckpt
, and scimult_moe_pmcpatients_ppr.ckpt
), put them under the model folder ./model/
.
Then, you can run the evaluation code for each task:
cd src
# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])
./eval_classification_fine.sh
# evaluate coarse-grained classification (SciDocs [MAG, MeSH])
./eval_classification_coarse.sh
# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])
./eval_link_prediction_retrieval.sh
# evaluate link prediction under the reranking setting (Recommendation)
./eval_link_prediction_reranking.sh
# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])
./eval_search.sh
The metrics will be shown at the end of the terminal output as well as in scores.txt
.
If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:
cd src
python3.8 get_embedding.py
NOTE: The performance of SciMult on PMC-Patients reported in our paper is based on the old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be found here.
To reproduce our reported performance on the "old" PMC-Patients Leaderboard:
cd src
./eval_pmc_patients.sh
The metrics will be shown at the end of the terminal output as well as in scores.txt
. The similarity scores that we submitted to the leaderboard can be found at ../output/PMCPatientsPAR_test_out.json
and ../output/PMCPatientsPPR_test_out.json
.
For the performance of SciMult on the new version of PMC-Patients, please refer to the up-to-date PMC-Patients Leaderboard.
To reproduce our performance on the SciDocs benchmark:
cd src
./eval_scidocs.sh
The output embedding files can be found at ../output/cls.jsonl
and ../output/user-citation.jsonl
. Then, run the adapted SciDocs evaluation code:
cd ../
git clone https://github.com/yuzhimanhua/SciDocs.git
cd scidocs
# install dependencies
conda deactivate
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install
# run evaluation
python eval.py
The metrics will be shown at the end of the terminal output.
The preprocessed evaluation datasets can be downloaded from here. The aggregate version is released under the ODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.
Similar to Tensorflow datasets or Hugging Face's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
More details about each constituent dataset are as follows.
Dataset | Folder | #Queries | #Candidates | Source | License |
---|---|---|---|---|---|
MAPLE (CS-Conference) | classification_fine/ |
261,781 | 15,808 | Link | ODC-By v1.0 |
MAPLE (Chemistry-MeSH) | classification_fine/ |
762,129 | 30,194 | Link | ODC-By v1.0 |
MAPLE (Geography) | classification_fine/ |
73,883 | 3,285 | Link | ODC-By v1.0 |
MAPLE (Psychology) | classification_fine/ |
372,954 | 7,641 | Link | ODC-By v1.0 |
SciDocs (MAG Fields) | classification_coarse/ |
25,001 | 19 | Link | CC BY 4.0 |
SciDocs (MeSH Diseases) | classification_coarse/ |
23,473 | 11 | Link | CC BY 4.0 |
SciDocs (Cite) | link_prediction_retrieval/ |
92,214 | 142,009 | Link | CC BY 4.0 |
SciDocs (Co-cite) | link_prediction_retrieval/ |
54,543 | 142,009 | Link | CC BY 4.0 |
PMC-Patients (PPR, Zero-shot) | link_prediction_retrieval/ |
100,327 | 155,151 | Link | CC BY-NC-SA 4.0 |
PMC-Patients (PAR, Supervised) | pmc_patients/ |
5,959 | 1,413,087 | Link | CC BY-NC-SA 4.0 |
PMC-Patients (PPR, Supervised) | pmc_patients/ |
2,812 | 155,151 | Link | CC BY-NC-SA 4.0 |
SciDocs (Co-view) | scidocs/ |
1,000 | reranking, 29.98 for each query on average | Link | CC BY 4.0 |
SciDocs (Co-read) | scidocs/ |
1,000 | reranking, 29.98 for each query on average | Link | CC BY 4.0 |
SciDocs (Cite) | scidocs/ |
1,000 | reranking, 29.93 for each query on average | Link | CC BY 4.0 |
SciDocs (Co-cite) | scidocs/ |
1,000 | reranking, 29.95 for each query on average | Link | CC BY 4.0 |
Recommendation | link_prediction_reranking/ |
137 | reranking, 16.28 for each query on average | Link | N/A |
SciRepEval-Search | search/ |
2,637 | reranking, 10.00 for each query on average | Link | ODC-By v1.0 |
TREC-COVID in SciRepEval | search/ |
50 | reranking, 1386.36 for each query on average | Link | ODC-By v1.0 |
TREC-COVID in BEIR | search/ |
50 | 171,332 | Link | Apache License 2.0 |
SciFact | search/ |
1,109 | 5,183 | Link | Apache License 2.0, CC BY-NC 2.0 |
NFCorpus | search/ |
3,237 | 3,633 | Link | Apache License 2.0 |
Our pre-trained models can be downloaded from here. Please refer to the Hugging Face README for more details about the models.
If you find SciMult useful in your research, please cite the following paper:
@inproceedings{zhang2023pre,
title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding},
author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng},
booktitle={Findings of EMNLP'23},
pages={12259--12275},
year={2023}
}