This repository contains all of the simulation and analysis code for the manuscript "Assessing Markovian and delay models for single-nucleus RNA sequencing," a discussion of the prospects for solving and applying delay master equations in the context of RNA sequencing. In the first part of the report, we numerically solve a fairly generic set of systems with gene switching and mixed Markovian/non-Markovian downstream processing. In the second part, we analytically solve a set of bursty two-species systems with delayed steps, fit single-cell and single-nucleus RNA sequencing data, and compare the fits to draw conclusions about model suitability and identifiability.
metadata/
: All dataset metadata. Brain metadata were obtained from NeMO (single-cell, single-nucleus). Liver metadata were obtained from GSE185477 at GEO.counting/
: All scripts used to generate and pseudoalign FASTQ files.liver/
: scripts to download and process liver data.sc/
: bash scripts towget
SRA files andfasterq-dump
them to biological and technical read FASTQ files. Whitelist with all barcodes.sn/
: bash scripts towget
BAM file and process withbamtofastq
. Whitelist with all barcodes.
kb/
: scripts to pseudoalign datasets usingkb-python
.
polyA_ref/
: poly(A) count and gene length references, copied from the Monod example repository; used for internal QC but not for analysis.
The following packages were used to perform the analyses:
kb-tools 0.26.0
monod 0.2.6.0
numdifftools 0.9.40
pandas 1.2.4
numpy 1.23.2
scipy 1.9.1
loompy 3.0.6
matplotlib 3.6.0
PyMC 3.11.5
Theano-PyMC 1.1
Monod was used to filter data. The MCMC procedure was performed on fifty 3.7GHz cores on a dedicated server. Raw data and results results are available on Zenodo. The simulations can be immediately reproduced using Google Colaboratory.
To obtain brain FASTQ files, download them from NeMO. To obtain liver FASTQ files, use the scripts in counting/liver/
to convert BAM or SRA files.
To obtain whitelists for the liver dataset, run gg220908_liver_whitelist.ipynb
.
To generate count matrices, build a kb
intronic reference (as described previously) and run the scripts in counting/kb/
to obtain spliced and unspliced count matrices in the loom
format. The results of the kb
pipeline are available on Zenodo.
To run the inference pipeline, run gg230620_mcmc_allen.ipynb
and gg230621_mcmc_liver.ipynb
using the loom
files and annotations. This will perform the following tasks:
- Filter cells.
- Select a set of genes with at least moderate expression in all datasets for a category.
- Extract their count data to a Monod
SearchData
orsd
object. - Use PyMC3 to obtain model parameter posteriors for each gene, using the Monod solvers for likelihood functions.
- Generate
pickle
files for each search, comprising a list of PyMC3 traces.
To generate Figures 1c, 2, 3, S3, and S4 run gg230612_BF_analysis.ipynb
using the data and result files. These files are available on Zenodo, as are the intermediate results. This notebook also generates Supplementary Data File 1 (gg230705_smc_results.xlsx
).
To generate Figure S1, run delay_cat_sim_twospec
and delay_cauchy.ipynb
. To generate Figure S2, run gg220909_telegraph.ipynb
.
Some derivations in the manuscript are partially automated using MATLAB scripts. To reproduce the moment calculations, run gg230629_odesolv.mlx
. To reproduce the characteristic calculations, run gg220629_2species_moments.m
.