Skip to content

Code for the benchmarking single-cell foundation models (scGPT, scBERT, and Geneformer) for cell-type annotation task using skewed single-cell cell-type distribution data.

Notifications You must be signed in to change notification settings

SabbaghCodes/ImbalancedLearningForSingleCellFoundationModels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations

Abstarct

With the emergence of single-cell foundation models, an important question arises: how do these models perform when trained on datasets having an imbalance in cell type distribution due to rare cell types or biased sampling? Using skewed single-cell cell-type distribution, we benchmark foundation models, scGPT, scBERT, and Geneformer, for cell-type annotation. While all models had reduced performance when challenged with rare cell types, scGPT, scBERT, performed better than Geneformer. Notably, in contrast to scGPT and scBERT, Geneformer uses ordinal positions of the tokenized genes rather than actual raw gene expression values. To mitigate the effect of a skewed distribution, we find that random oversampling, but not random undersampling, improved the performance for all three foundation models. Finally, scGPT, using FlashAttention, has the fastest computational speed, whereas scBERT is much more memory-efficient. We conclude that tokenization and data representation are essential areas of research, and new strategies are needed to mitigate the effects of imbalanced learning in single-cell foundation models.

In this repository, we share the code and data for reproducability.

Setup

Although each foundation model has its own configuration steps, we provide you with a universal configuration as follows:

  • Python version: 3.8.17.
  • CUDA version: 11.7.
  • GPU: NVIDIA A100.
  • All dependencies are available in requirements.txt, just run pip install -r requirements.txt.
  • To install FlashAttention without any problems, use the Pytorch Docker container. We used Singularity as follows:
    singularity pull docker://nvcr.io/nvidia/pytorch:22.08-py3
    singularity run --nv pytorch_2.0.0-cuda11.7-cudnn8-devel.sif
    

Models

For simplicity, we provide Jupyter notebooks containing codes for the cell type annotation task for each foundation model inside models directory:

To access the more broader code for each foundation model and their papers, please refer to the links as follows:

Datasets & Preprocessing

Datasets (MS and Zheng68K) and their sampling variations (undersampled, oversampled, and imputed) can be found inside the data directory. You can also find the preprocessing code for undersampling, oversampling, imputation, and converting the gene expression matrix into ordinal gene tokens (for Geneformer) in data/preprocessing.ipynb.

Pretrained Model Weights

You can find links to the pretrained model weights for all foundation models inside pretrained_model_weights directory.

Contact Us

If you encounter any issues and need assistance, please sumbit an issue on this GitHub repository, or contact us via [email protected]. Thank you for your interest in reproducing the results!

About

Code for the benchmarking single-cell foundation models (scGPT, scBERT, and Geneformer) for cell-type annotation task using skewed single-cell cell-type distribution data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published