scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data
Reliable cell type annotation is a prerequisite for downstream analysis of single-cell RNA sequencing data. Existing annotation algorithms typically suffer from improper handling of batch effect, lack of curated marker gene lists, or difficulty in leveraging the latent gene-gene interaction information. Inspired by large scale pretrained langurage models, we present a pretrained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the above challenges. scBERT follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field. In the first phase of scBERT, it obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data. The pre-trained scBERT can then be used for the cell annotation task of unseen and user-specific scRNA-seq data through supervised fine-tuning. For more information, please refer to https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1
The data can be downloaded from these links. If you have any question, please contact [email protected].
https://drive.weixin.qq.com/s?k=AJEAIQdfAAozQt5B8k https://drive.google.com/file/d/1fNZbKx6LPeoS0hbVYJFI8jlDlNctZxlU/view?usp=sharing
The pre-trained model checkpoint can be downloaded from this link. If you have any question, please contact [email protected].
https://drive.weixin.qq.com/s?k=AJEAIQdfAAoUxhXE7r
The test single-cell transcriptomics data file should be pre-processed by first revising gene symbols according to NCBI Gene database updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the sc.pp.normalize_total
and sc.pp.log1p
method in scanpy
(Python package), detailed in preprocess.py
.
You can download this repo and run the demo task on your computing machine within about 4 hours.
- Fine-tune using pre-trained models
python -m torch.distributed.launch --data_path "fine-tune_data_path" --model_path "pretrained_model_path" finetune.py
- Predict using fine-tuned models
python --data_path "test_data_path" --model_path "finetuned_model_path" predict.py
- Detection of novel cell type
The detection of novel cell type can be done by thresholding the predicted probabilities. (Default threshold=0.5)
python --data_path "test_data_path" --model_path "finetuned_model_path" --novel_type True --unassign_thres "custom_threshold" predict.py
- Expected output
The expected output of model inference is the cell type of each individual cell.
- Guidance for hyperparameter selection
You can select the hyperparameters of the Performer encoder based on your data and task in:
model = PerformerLM(
num_tokens = 7,
dim = 200,
depth = 6,
heads = 10
)
Hyperparameter | Description | Default | Arbitrary range |
---|---|---|---|
num_tokens | Number of bins in expression embedding | 7 | [5, 7, 9] |
dim | Size of scBERT embedding vector | 200 | [100, 200] |
heads | Number of attention heads of Performer | 10 | [8, 10, 20] |
depth | Number of Performer encoder layers | 6 | [4, 6, 8] |
Typical install time on a "normal" desktop computer is about 30 minutes.
Exptected run time for infering 10,000 cells on a "normal" desktop computer is about 25 minutes.
This tool is for research purpose and not approved for clinical use.
This is not an official Tencent product.
This tool is developed in Tencent AI Lab.
The copyright holder for this project is Tencent AI Lab.
All rights reserved.
Yang, F., Wang, W., Wang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell (2022). https://doi.org/10.1038/s42256-022-00534-z