This is the implement of the paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.
- Overview
- Requirements
- Data preparation
- Protein pre-training model
- Usage for protein-related tasks
- Citation
In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.
To run our code, please install dependency packages for related steps.
python3.8 / biopython 1.37 / goatools
python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /
python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb
Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:
BLAST / Interproscan / DeepGraphGO / GNN-PPI
For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.
To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.
We have released our prepared data ProteinKG25 in Google Drive.
The whole compressed package includes following files:
go_def.txt
: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.go_type.txt
: The ontology type which the specific GO term belong to. The index is correponding to GO ID ingo2id.txt
file.go2id.txt
: The ID mapping of GO terms.go_go_triplet.txt
: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is <h r t
>, whereh
andt
are respectively head entity and tail entity, both GO term nodes.r
is relation between two GO terms, e.g.is_a
andpart_of
.protein_seq.txt
: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.protein2id.txt
: The ID mapping of proteins.protein_go_train_triplet.txt
: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is <h r t
>, whereh
andt
are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function.r
is relation between the protein and GO term.relation2id.txt
: The ID mapping of relations. We mix relations in two triplet relation.
For generating your own pre-training data, you need download following raw data:
go.obo
: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`uniprot_sprot.dat
: protein Swiss-Prot database. [link]goa_uniprot_all.gpa
: Gene Annotation data. [link]
When download these raw data, you can excute following script to generate pre-training data:
python tools/gen_onto_protein_data.py
Our experiments involved with several protein-related downstream tasks. [Download datasets]
You can pre-training your own OntoProtein based above pretraining dataset. We provide the script bash script/run_pretrain.sh
to run pre-training. And the detailed arguments are all listed in src/training_args.py
, you can set pre-training hyperparameters to your need.
We have released the checkpoint of pretrained model on the model library of Hugging Face
. [Download model].
The shell files of training and evaluation for every task are provided in script/
, and could directly run.
Also, you can utilize the running codes run_downstream.py
, and write your shell files according to your need:
run_downstream.py
: support{ss3, ss8, contact, remote_homology, fluorescence, stability}
tasks;
Running shell files: bash script/run_{task}.sh
, and the contents of shell files are as follow:
sh run_main.sh \
--model ./model/ss3/ProtBertModel \
--output_file ss3-ProtBert \
--task_name ss3 \
--do_train True \
--epoch 5 \
--optimizer AdamW \
--per_device_batch_size 2 \
--gradient_accumulation_steps 8 \
--eval_step 100 \
--eval_batchsize 4 \
--warmup_ratio 0.08 \
--frozen_bert False
You can set more detailed parameters in run_main.sh. The details of main.sh are as follows:
LR=3e-5
SEED=3
DATA_DIR=data/datasets
OUTPUT_DIR=data/output_data/$TASK_NAME-$SEED-$OI
python run_downstream.py \
--task_name $TASK_NAME \
--data_dir $DATA_DIR \
--do_train $DO_TRAIN \
--do_predict True \
--model_name_or_path $MODEL \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $EB \
--gradient_accumulation_steps $GS \
--learning_rate $LR \
--num_train_epochs $EPOCHS \
--warmup_ratio $WR \
--logging_steps $ES \
--eval_steps $ES \
--output_dir $OUTPUT_DIR \
--seed $SEED \
--optimizer $OPTIMIZER \
--frozen_bert $FROZEN_BERT \
--mean_output $MEAN_OUTPUT \
Notice: the best checkpoint is saved in OUTPUT_DIR/
.
@article{DBLP:journals/corr/abs-2201-11147,
author = {Ningyu Zhang and
Zhen Bi and
Xiaozhuan Liang and
Siyuan Cheng and
Haosen Hong and
Shumin Deng and
Jiazhang Lian and
Qiang Zhang and
Huajun Chen},
title = {OntoProtein: Protein Pretraining With Gene Ontology Embedding},
journal = {CoRR},
volume = {abs/2201.11147},
year = {2022},
url = {https://arxiv.org/abs/2201.11147},
eprinttype = {arXiv},
eprint = {2201.11147},
timestamp = {Wed, 02 Feb 2022 15:00:01 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2201-11147.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}