tGPT:Generative pretraining from large-scale transcriptomes

tGPT

"Generative pretraining from large-scale transcriptomes: Implications for single-cell deciphering and clinical translation".

Introduction of tGPT

Exponential accumulation of single-cell transcriptomes poses great challenge for efficient assimilation. Here, we present an approach entitled tGPT towards integration of 22.3 million single-cell transcriptomes by modeling gene expression rankings as generative pretraining task tGPT is conceptually simple in that it autoregressively models the ranking of a gene in the context of its preceding neighbors. We demonstrated the high performance of tGPT on a range of fundamental single-cell analysis tasks and novel applications on bulk tissues. The single-cell clusters and cell lineage trajectories derived from tGPT are highly aligned with known cell labels and states. The feature patterns of tumor bulk tissues learned by tGPT are associated with a wide range of genomic alteration events, prognosis and treatment outcome of immunotherapy. tGPT represents a new analytical paradigm for integrating and deciphering massive amount of transcriptome data and it will facilitate the interpretation and clinical translation of single-cell transcriptomes

Architecture of tGPT

This figure consists of three components: development of tGPT, applications of tGPT for single-cell and bulk tissue transcriptomes.

The following packages are required:

The following example was tested with:

python==3.7.11 
numpy==1.20.0
torch==1.7.1
scanpy==1.9.1

Using tGPT on own data

Load packages

import re
import os
import sys
import gzip
import torch
import numpy as np
import pandas as pd
import scanpy as sc
from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2Model

Setting parameter and file path

device = "cuda" if torch.cuda.is_available() else "cpu" 
tokenizer_file = "lixiangchun/transcriptome-gpt-1024-8-16-64" 
## Pretrained model with sequence of 62 top expressing genes.
checkpoint = "lixiangchun/transcriptome-gpt-1024-8-16-64"
## Pretrained model with sequence of 126 top expressing genes.
#checkpoint = "lixiangchun/transcriptome-gpt-1024-8-16-64"
celltype_path = "./data/Muris_cell_labels.txt.gz" ## Cell type annotation
max_len = 64 ## Number of top genes used for analysis
text_file = "./data/Muris_gene_rankings.txt.gz"  ## Gene symbols ranked by exprssion

Extract features

class LineDataset(Dataset):
    def __init__(self, lines):
        self.lines = lines
        self.regex = re.compile(r'\-|\.')
    def __getitem__(self, i):
        return self.regex.sub('_', self.lines[i])
    def __len__(self):
        return len(self.lines)

tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_file)
model = GPT2LMHeadModel.from_pretrained(checkpoint,output_hidden_states = True).transformer
model = model.to(device)
model.eval()

lines = [s.decode().strip() for s in gzip.open(text_file, "r").readlines()]

ds = LineDataset(lines)
dl = DataLoader(ds, batch_size=64)

Xs = []
for a in tqdm(dl, total=len(dl)):
    batch = tokenizer(a, max_length= max_len, truncation=True, padding=True, return_tensors="pt")

    for k, v in batch.items():
        batch[k] = v.to(device)

    with torch.no_grad():
        x = model(**batch)

    eos_idxs = batch.attention_mask.sum(dim=1) - 1
    xx = x.last_hidden_state
    result_list = [[] for i in range(len(xx))]

    for j, item in enumerate(xx):
        result_list[j] = item[1:int(eos_idxs[j]),:].mean(dim =0).tolist()

    Xs.extend(result_list)

features = np.stack(Xs)

Visualization

adata = sc.AnnData(features)
celltype = pd.read_csv(celltype_path, header=None)[0].tolist()
adata.obs["celltype"] = celltype
adata.obs["celltype"] = adata.obs["celltype"].astype("category")

sc.pp.neighbors(adata,n_neighbors=20)
sc.tl.leiden(adata,resolution=0.6)
sc.tl.umap(adata)

#################  Cell Type  #######################
sc.pl.umap(adata, color = ["celltype"], show = True)

############ Single-cell Clustering  #############
sc.pl.umap(adata, color = ["leiden"], show = True)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
data		data
figures		figures
gpt2tokenizer		gpt2tokenizer
LICENSE		LICENSE
README.md		README.md
example.ipynb		example.ipynb
train.py		train.py
train.sh		train.sh
train_parallel.sh		train_parallel.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tGPT:Generative pretraining from large-scale transcriptomes

tGPT

Introduction of tGPT

Architecture of tGPT

The following packages are required:

Using tGPT on own data

Load packages

Setting parameter and file path

Extract features

Visualization

About

Releases

Packages

Contributors 2

Languages

License

deeplearningplus/tGPT

Folders and files

Latest commit

History

Repository files navigation

tGPT:Generative pretraining from large-scale transcriptomes

tGPT

Introduction of tGPT

Architecture of tGPT

The following packages are required:

Using tGPT on own data

Load packages

Setting parameter and file path

Extract features

Visualization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages