This repository contains data and code necessary for assembling the PRECISE-1K E. coli K-12 MG1655 expression and transcriptional regulation knowledgebase, including computing and analyzing iModulons, analyzing expression patterns, and generating all figures for the associated publication.
This repository also includes data, assembly, iModulon computation, and analysis for the larger, 2710-sample "Public K-12" RNA-seq resource.
To analyze new data with the PRECISE-1K knowledgebase, we recommend starting with the lightweight precise1k-analyze repository.
For an overview of iModulons, we recommend iModulonDB's About Page.
Ensure you have Python 3.6+, Jupyter (pip install jupyterlab
), and the iModulon analysis package pymodulon
installed (pip install pymodulon
).
pymodulon
documentation can be found here.
Installing pymodulon
using pip
will install all other needed packages.
Then, clone this repository onto your local machine by executing the following command in your command prompt/terminal: git clone https://github.com/SBRG/precise1k.git
. This command will copy this repository to a folder called precise1k
in the directory from which the command was executed. This repository contains many data files, so cloning may take a few minutes.
The repository is split into 2 subdirectories: data
and notebooks
. All analyses are presented in the form of Jupyter notebooks in the notebooks
subdirectory, utilizing data from the data
directory.
The data
directory contains further subdirectories with the following constituent files:
annotation
: contains gene and regulatory network annotations for K-12 MG1655k12_modulome
: contains sample metadata, QC statistics, iModulon M and A matrices, and iModulon metadata for the "Public K-12" dataset, which consists of PRECISE-1K (1035 samples) combined with 1,675 high-quality publicly available samplesA.csv
: the iModulon A (activity) natrix computed for the Public K-12 datasetEscherichia_coli_20220127.tsv
: raw metadata for publicly-available E. coli RNA-seq data downloaded from SRA on 1/27/2022 using the modulome workflowK12_metadata.tsv
: curated public sample metadata for RNA-seq samples identified as coming from E. coli strain K-12M.csv
: the iModulon M (modulon) matrix computed for the Public K-12 datasetbioproject_list.csv
: the raw list of unique BioProjects fromEscherichia_coli_20220127.tsv
bioproject_list_curated.csv
: a manually curated set of BioProjects frombioproject_list.csv
component_stats.csv
: raw statistics for the Public K-12 iModulonscounts.csv
: the raw read counts computed from the raw public read data inK12_metadata.tsv
using the modulome RNA-seq processing workflowimodulon_table.csv
: the iModulon metadata table, containing iModulon regulatory enrichment statistics, functional categorizations, and other metadatak12_modulome.json.gz
: theIcaData
object for the Public K-12 dataset, containing theimodulon_table
,M
, andA
matrices - for use withpymodulon
k12_only_p1k_ctrl.json.gz
andk12_only_proj_reg.json.gz
:IcaData
objects for iModulons computed using just the publicly available data (i.e. without appending it to PRECISE-1K)metadata_qc.csv
: the final, curated sample metadata for the Public K-12 dataset, without PRECISE-1K metadatametadata_qc_blah
: intermediate metadata files containing samples that were excluded during various QC steps (see Public K-12 QC notebook Pt 1 and Public K-12 QC notebook Pt 2multiqc_stats.tsv
: raw QC statistics for the publicly-available samples- NOTE:
log_tpm
files for the public dataset are not provided as they are too large for GitHub - please contact us for these.
precise
: contains iModulon files from the original PRECISE publication (also available at iModulonDB's PRECISE-278 page)A.csv
: iModulon A (activity) matrixM.csv
: iModulon M (modulon) matrixM_thresholds.csv
: thresholds for determining iModulon gene membership fromM
iM_table.csv
/imodulon_table.csv
: iModulon metadata, including regulatory enrichmentslog_tpm.csv
: log2[TPM] data (i.e. PRECISE itself)log_tpm_norm.csv
: log2[TPM] data, centered to the control condition (M9 glucose) - also known as theX
matrix, or the direct input to the ICA pipelinemetadata_qc.csv
: curated sample information for PRECISE - slightly modified fromsample_table.csv
precise.json.gz
: anIcaData
object containing PRECISE and its iModulon information, for use withpymodulon
sample_table.csv
: metadata as downloaded from iModulonDB
precise1k
: expression and iModulon data for the core expression compendium in this resource, PRECISE-1Kmultiqc_data
: raw QC statistics for PRECISE-1K samplessubsamples
: sample IDs for random subsamples generated from PRECISE-1K for a specific analysisA.csv
: the iModulon A matrix for PRECISE-1KM.csv
: the iModulon M matrixcomponent_stats.csv
: raw statistics for the PRECISE-1K imodulonscounts.csv
: the raw read count data from which log2[TPM] data was computed; output of modulome workflowcrp_binding.csv
: data from a simulated binding curve for CRP analysis (see manuscript Figure 4)deg_dima_result.csv
: results from comparing differential iModulon activity (DiMA) analysis to differential gene expression (DGE) analysisimodulon_table.csv
: metadata for PRECISE-1K iModulons, including regulatory enrichment statistics and functional categorizationslog_tpm.csv
: the log2[TPM] data, BEFORE quality control (1055 samples)log_tpm_norm_qc.csv
: log2[TPM] data, centered to the reference condition (M9 glucose), after QC - akaX
matrix, direct input to ICAlog_tpm_qc.csv
: the quality-controlled log2[TPM] data, aka PRECISE-1K itself (1035 samples)log_tpm_qc_w_short_low_fpkm.csv
: log2[TPM] data before removal of ~100 short and very low read count genesmetadata.csv
: metadata for PRECISE-1K samples BEFORE quality control (1055 samples)metadata_qc.csv
: post-QC sample metadata for PRECISE-1K (1035 samples)multiqc_stats.tsv
: summary of QC statistics for raw read data frommultiqc_data
subdirectoryprecise1k.json.gz
:IcaData
object containing iModulon matrices, iModulon and sample metadata, etc. - for use withpymodulon
regulation
: contains raw files used to generate the TRN used for regulatory enrichment analysissequence_files
: contains the reference genome sequences used for alignment of raw reads in modulome processing workflow
The notebooks
directory contains analyses, split between the k12_modulome
and precise1k
datasets. The PRECISE-1K analysis notebooks are described below (Public K-12 analyses are very similar):
blah_figs
: contains figure panel files used to generate paper figures; has correspondingblah.ipynb
notebook1__QC_expression.ipynb
: notebook for visualizing QC statistics and excluding non-high-quality data after raw read processing1b__differential_gene_expression_R.ipynb
: an R notebook for running traditional DGE2__choose_dimensionality.ipynb
: final step of OptICA dimensionality selection method to determine finalM
andA
matrices3__create_ica_data.ipynb
: compiles finalM
andA
matrices, sample metadata, TRN, gene information intopymodulon
-readyIcaData
object, as well as computing iModulon thresholds to determine iModulon membership, and computing regulatory enrichment statistics4__annotate_imodulons.ipynb
: manual annotation and curation of iModulonsfig__characterize_ytfs.ipynb
: analysis of putative regulons YmfT and YgeV (Figure 3 in manuscript)fig__investigate_activities.ipynb
: DiMA analysis, activity clustering (stimulons), CRP regulon segmentationfig__investigate_expression.ipynb
: analyses of PRECISE-1K itself, i.e. the log2[TPM] expression data (Figure 1)fig__subsamples.ipynb
: analysis of PRECISE-1K subsample iModulon sets (subsampleIcaData
objects not included due to GitHub size limitations, request if needed)fig__summarize_dataset.ipynb
: overview of sample conditions in PRECISE-1Kfig__summarize_imodulons.ipynb
: overview of iModulon categories, enrichments, and regulatory coverage (Figure 2)generate_p1k_subsamples.ipynb
: for generation of PRECISE-1K subsamples analyzed infig__subsamples.ipynb
workflow__add_new_data.ipynb
: new data addition workflow example analysis of aerobic transition (Figure 6); for adding new data, it's recommended to start with lightweight precise1k-analyze repository
The iModulons computed from PRECISE-1K can be explored through an interactive web interface at iModulonDB.
Please cite our Biorxiv preprint if you find this repository useful! This link will be updated when this work is published.