This repository contains the scripts and datafiles used in the DeepGOmeta manuscript.
- The code was developed and tested using python 3.10.
- Clone the repository:
git clone https://github.com/bio-ontology-research-group/deepgometa.git
- Create virtual environment with Conda or python3-venv module.
- Install PyTorch:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
- Install DGL:
pip install dgl==1.1.2+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
- Install other requirements:
pip install -r requirements.txt
Follow these instructions to obtain predictions for your proteins. You'll need around 30Gb storage and a GPU with >16Gb memory (or you can use CPU)
- Download the data.tar.gz
- Extract
tar xvzf data.tar.gz
- Run the model
python predict.py -if data/example.fa
We also provide a docker container with all dependencies installed:
docker pull coolmaksat/deepgometa
This repository is installed at /deepgometa directory. To run the scripts you'll
need to mount the data directory. Example:
docker run --gpus all -v $(pwd)/data:/workspace/deepgometa/data coolmaksat/deepgometa python predict.py -if data/example.fa
DeepGOMeta can be run as a Nextflow workflow using the docker image for easier execution.
Requirements:
- For amplicon data: OTU table of relative abundance, where OTUs are classified using the RDP database
- For WGS data: Protein sequences in FASTA format
- After cloning the repository, navigate to the Nextflow directory:
cd Nextflow
- Update the runOptions paths in nextflow.config
- Navigate to the data directory
cd data_and_scripts
and download the genome annotations - Run workflow. Example:
nextflow run DeepGOMeta.nf -profile docker/singularity --amplicon true --OTU_table otu_relative_abd.tsv --pkl_dir /PATH/TO/PKL/DIR/
- Data and metadata: download from SRA and MG-RAST using sample accessions
- Processing reads:
- 16S reads - generate OTU tables using the Nextflow 16SProcessing workflow
- WGS reads - obtain protein sequences using the assembly pipeline
- Functional annotation:
- Clustering and Purity: use a metadata file and the functional profile to apply PCA, k-means clustering, calculating purity, and generating plots for 16S datasets and WGS datasets
- Information Content Calculation: create a .txt file for each sample containing the 16S predicted functions and WGS predicted functions on separate lines (e.g. 16Ssample'\t'GO1'\t'GO2'\n'WGSsample'\t'GO2'\t'GO3), and get IC for each function, then run a t-test