This repository contains datasets and code for classifying citation intents in academic papers.
For details on the model and data refer to our NAACL 2019 paper:
"Structural Scaffolds for Citation Intent Classification in Scientific Publications".
We introduce SciCite
a new large dataset of citation intents. Download from the following link:
The data is in the Jsonlines format (each line is a json object).
The main citation intent label for each Json object is spacified with the label
key while the citation context is specified in with a context
key.
Example entry:
{
'string': 'In chacma baboons, male-infant relationships can be linked to both
formation of friendships and paternity success [30,31].'
'sectionName': 'Introduction',
'label': 'background',
'citingPaperId': '7a6b2d4b405439',
'citedPaperId': '9d1abadc55b5e0',
...
}
You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API.
We also run experiments on a pre-existing dataset of citation intents in the computational linguistics domain (ACL-ARC) introduced by Jurgens et al., (2018).
The preprocessed dataset is available at ACL-ARC data
.
The project needs Python 3.6 and is based on the AllenNLP library.
Use pip to install dependencies in your desired python environment
pip install -r requirements.in -c constraints.txt
Download one of the pre-trained models and run the following command:
allennlp predict [path-to-model.tar.gz] [path-to-data.jsonl] \
--predictor [predictor-type] \
--include-package scicite \
--overrides "{'model':{'data_format':''}}"
Where
[path-to-data.jsonl]
contains the data in the same format as the training data.[path-to-model.tar.gz]
is the path to the pretrained model[predictor-type]
is one ofpredictor_scicite
(for the SciCite dataset format) orpredictor_aclarc
(for the ACL-ARC dataset format).--output-file [out-path.jsonl]
is an optional argument showing the path to the output. If you don't pass this, the output will be printed in the stdout.
If you are using your own data, you need to first convert your data to be according to the SciCite data format.
We also release our pretrained models; download from the following path:
First you need a config
file for your training configuration.
Check the experiment_configs/
directory for example configurations.
Important options (you can specify them with environment variables) are:
"train_data_path": # path to training data,
"validation_data_path": #path to development data,
"test_data_path": # path to test data,
"train_data_path_aux": # path to the data for section title scaffold,
"train_data_path_aux2": # path to the data for citation worthiness scaffold,
"mixing_ratio": # parameter \lambda_2 in the paper (sensitivity of loss to the first scaffold)
"mixing_ratio2": # parameter \lambda_3 in the paper (sensitivity of loss to the second scaffold)
After downloading the data, edit the configuration file with the correct paths. You also need to pass in an environment variable specifying whether to use ELMo contextualized embeddings.
export elmo=true
Note that with elmo training speed will be significantly slower.
After making sure you have the correct configuration file, start training the model.
python scripts/train_local.py train_multitask_2 [path-to-config-file.json] \
-s [path-to-serialization-dir/]
--include-package scicite
Where the model output and logs will be stored in [path-to-serialization-dir/]
If you found our dataset, or code useful, please cite Structural Scaffolds for Citation Intent Classification in Scientific Publications.
@InProceedings{Cohan2019Structural,
author={Arman Cohan and Waleed Ammar and Madeleine Van Zuylen and Field Cady},
title={Structural Scaffolds for Citation Intent Classification in Scientific Publications},
booktitle="NAACL",
year="2019"
}