Le Zhuo*, Zewen Chi*, Minghao Xu*, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang
This repository hosts the code, data and model weights of ProtLLM, a versatile cross-modal large language model for both protein-centric and protein-language tasks.
- Release the code for retrieval.
- Release the raw InterPT dataset.
- Update the huggingface version of ProtLLM.
- ...
- Clone this repository and navigate to the ProtLLM folder
git clone https://github.com/ProtLLM/ProtLLM.git
cd ProtLLM
- Install Package
conda create -n protllm python=3.10 -y
conda activate protllm
pip install e .
We release the pre-processed version of our InterPT dataset, all datasets for downstream tasks, and pre-trained checkpoints in Hugging Face.
For pre-training, you should download the pre-preprocessed dataset from Hugging Face first and run the following script:
bash scripts/pretrain.sh
We provide the fine-tuning scripts to reproduce all results of ProtLLM on various protein-centric tasks, including Enzyme Commission (EC) number prediction, Gene Ontology (GO) term prediction, and Protein-Protein Interaction (PPI) prediction. By default, we use the pre-trained ProtST-ESM-2 as the protein encoder, which can be downloaded from the ProtST repository. After downloading the processed dataset from Hugging Face, you can run the following script to finetune ProtLLM on specific downstream task:
bash scripts/finetune.sh
The detailed hyperparameters and settings for each task can be found in the appendix of our paper. Note that, we also fine-tune the weights of protein encoder for GO and EC prediction tasks, which can be done by setting --lr_ratio
to 0.1 in the fine-tuning script.
After fine-tuning ProtLLM on protein-centric tasks, you can evaluate its performance by running the following script:
bash scripts/eval.sh
Remember to set --task
to the target task name and --n_labels
to the number of labels of the task. You should also change the LoRA hyperparameters --sft_lora_r
and --sft_lora_alpha
to the values you used in the fine-tuning script.
Run the following script to perform in-context learning with ProtLLM (using PPI prediction as an example):
bash scripts/icl.sh
You can specify the --n_demo
argument to control the number of demonstration samples.
If you have any questions related to the code or the paper, feel free to contact Le Zhuo, Zewen Chi, and Minghao Xu.
If you find our work useful in your research, please consider citing ProtLLM:
@article{zhuo2024protllm,
title={ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training},
author={Le Zhuo and Zewen Chi and Minghao Xu and Heyan Huang and Heqi Zheng and Conghui He and Xian-Ling Mao and Wentao Zhang},
journal={arXiv preprint arXiv:2403.079205},
year={2024}
}