mtSC is a novel, flexible and generalized multitask deep metric learning-based framework for single cell assignment based on multiple references. Previous strategies for single-cell assignment with multiple references rely on data- or decision-level integration, while limitations remain. Different from the previous strategies, mtSC regards each reference dataset as a task, and different tasks can be complementary to improve single-cell assignment, while the overcorrection of the batch effect can be avoided. Such a novel integration strategy provides a flexible and reliable way to integrate related reference datasets. On the other hand, two additional advantages of mtSC were proven in this study: (1) mtSC performs increasingly better as the number of reference datasets increases, and (2) mtSC enables cross-species single-cell assignment, especially for specific tissues with very few sequencing datasets available for a specific species. These two characteristics of mtSC are of great potential utility when much more sequencing data on different species have accumulated in the future.
mtSC comprises two main steps: model learning and cell assignment.
- (1) In the model learning process of mtSC, each dataset is considered a single task, and a corresponding loss is calculated. All the losses are added together and utilized to update the model parameters through a backpropagation algorithm, then the parameter-shared deep metric learning network (PS-DMLN) was trained for the next cell assignment process.
- (2) In the cell assignment process of mtSC, the trained PS-DMLN is utilized to transform the query cells. Then, the transformed query cells are compared against transformed reference cells, and the predicted cell type with the highest similarity among all the transformed reference datasets is obtained.
Environment: Python>=3.6
- (1) Clone the repository
git clone https://github.com/bm2-lab/mtSC.git
- (2) Install the dependencies
pip install mtSC
-
(3) Create folder
Linux or Mac:
cd mtSC sh create_folder.sh
Windows:
cd mtSC create_folder.bat
A routine normalization and quality control should be performed. For example, there are three commonly used cell quality criteria, namely, the number of genes detected (default >500), the number of unique molecular identifiers induced (default >1500), and the percentage of mitochondrial genes detected (default <10% among all genes). Then, datasets should be normalized, i.e., scaling to 10,000 and then with log(counts+1). The format of training data should be a csv or tab-delimited txt or h5ad(scanpy) format where the columns correspond to genes and the rows correspond to cells. The column of cell types should be the last column and named as "cell_label". In a word, the format of training data is a transposed normalized dataset with a cell type column in the right. A sample file looks something like:
tspan6 | dpm1 | cell_label | |
---|---|---|---|
pbmc1_SM2_Cell_133 | 0.745639 | 0.0 | CD4+_T_cell |
pbmc1_SM2_Cell_142 | 0.0 | 0.778851 | B_Cell |
The format of test data is the format of training data without the column of "cell_label".
We also provide a script preprocess.py
to handle the cell quality control and normalization of origin counts matrix dataset. The processed file will end up with "_treated.h5" in its name and can read through pandas
package's read_hdf
function. After processing by the script, you just need to add the cell type column in the right of the processed matrix. The origin counts matrix dataset should be a csv or tab-delimited txt or h5ad(scanpy) format where the columns correspond to cells and the rows correspond to genes looks like:
pbmc1_SM2_Cell_133 | pbmc1_SM2_Cell_142 | |
---|---|---|
tspan6 | 4 | 0 |
dpm1 | 0 | 9 |
The script can be run by calling the following command. The filename
is file name of the origin counts matrix dataset.
python preprocess.py -f filename
- Example datasets: The example data is in https://www.jianguoyun.com/p/DbqGHM0Q7oftCBiFjNQD. You can download whole
train_set
andtest_set
folder. Once downloaded, you need to unzip thetrain_set.zip
andtest_set.zip
and then use the unzippedtrain_set
folder andtest_set
folder to replace the origintrain_set
folder andtest_set
folder. The demo example can be run by calling the following command.python run.py
- Cell assignment with model trained by your datasets: You just need to put reference datasets files in the
train_set
folder, put the query datasets files in thetest_set
folder and run the following command.python run.py
- Cell assignment with pre_trained models: You can directly use models we have trained. Models can be downloaded through https://www.jianguoyun.com/p/DbZJv8QQ7oftCBiPg9QD. Once downloaded, you need to unzip the
pre_trained.zip
and use the unzippedpre_trained
folder to replace the originpre_trained
folder. There are four trained models--'brain','PBMC-Ding','PBMC-Mereu','pancreas'. You can viewcell_type.txt
to select the model satisfying your need. For illustration purpose, we took 'brain' model as an example. You just need to put the query dataset file in thetest_set
folder and run the following command.Or if you have trained your own model by runningpython test_with_trained_model.py -m brain
run.py
. You can also use your own model bypython test_with_trained_model.py -m new_model
- Check the results: For all the above tests, the assignment results can be found in the
result.txt
after running. The content of theresult.txt
consists of two columns. The first column represents cell barcodes and the second column represents predicted cell class information. It looks like:
pbmc1_SM2_Cell_133 CD4+_T_cell
pbmc1_SM2_Cell_142 B_Cell
Duan B, Chen S, Chen X, Zhu C, Tang C, Wang S, Gao Y, Fu S, Liu Q. Integrating multiple references for single-cell assignment. Nucleic Acids Res. 2021 May 25:gkab380. doi: 10.1093/nar/gkab380. Epub ahead of print. PMID: 34037791.