Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running a gubaphage query #40

Open
rchikhi opened this issue May 29, 2021 · 5 comments
Open

running a gubaphage query #40

rchikhi opened this issue May 29, 2021 · 5 comments

Comments

@rchikhi
Copy link

rchikhi commented May 29, 2021

Hi Luiz, Titus,

I was wondering if you could please run a Wort query across all metagenomes for the Serratus people? I seem to have understood from your talk that the best way to ask for it is through Github issues, but let me know if you prefer another channel. The goal here is to search for gubaphages in metagenomes. I'm attaching both DNA and protein signatures: Gubaphage_genomes.sigs.zip that I computed for a collection (multifasta) of gubaphages as follows:

sourmash compute --dna Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.dna.sig
sourmash compute -k 6,9,15  --protein Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.protein.sig

thanks in advance!
Rayan

@bluegenes
Copy link
Contributor

bluegenes commented Jun 1, 2021

Hi @rchikhi,

Happy to run some DNA queries first (we don't have protein running quite yet). However, this search requires scaled signatures, and yours seem to be Mash-compatible signatures (probably due to your reported python 3.6 installation issue, sourmash-bio/sourmash#1561).

Could you try installing sourmash in an isolated environment, and then recalculating sigs, please?

installation: conda create -n sourmash4 -c conda-forge -c bioconda sourmash

activate conda environment: conda activate sourmash4

signature generation:

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund Gubaphage_genomes.fa -o Gubaphage_genomes.dna.sig

@rchikhi
Copy link
Author

rchikhi commented Jun 7, 2021

Great! oh for some reason I didn't get an email notification for this reply. Thanks for the detailed instructions, here are the updated sigs: Gubaphage_genomes.dna.sig.zip

@bluegenes
Copy link
Contributor

Hi @rchikhi! I ran the query at k=21.

see https://github.com/bluegenes/2021-gubaphage-magsearch for code & a notebook where I did some light processing and filtration of the results. Click on the binder if you'd like to run the notebook interactively.

In that notebook, I selected results metagenome results that had at least 30% (output file: gubaphage.sra-search.k21-c30.csv) or at least 10% containment (output file: gubaphage.sra-search.k21-c10.csv) of the gubaphage query. These csvs can be found in the output.gubaphage-magsearch/processed_results folder, which you can see on the associated OSF repo. They can also be regenerated via the binder.

The processed results look like this:

search_genome,metagenome,containment
Gubaphage_genomes.fa,ERR2683220,0.506
Gubaphage_genomes.fa,ERR2683247,0.472
Gubaphage_genomes.fa,ERR2607412,0.439
Gubaphage_genomes.fa,ERR2592250,0.439
Gubaphage_genomes.fa,ERR2683203,0.432
Gubaphage_genomes.fa,ERR2683256,0.426
Gubaphage_genomes.fa,ERR2683152,0.419

...where containment is the fraction of your gubaphage query covered by the metagenome.

What other information would be helpful, or what questions can I answer about this run? I can also search at k=31 and k=51 if you'd like - I'm not sure what (if any) additional metagenomes would be recovered.

cheers,

Tessa

@rchikhi
Copy link
Author

rchikhi commented Jun 14, 2021

Thanks much Tessa! I'll have a look at the results and let you know if more info is needed on our side.

@rchikhi
Copy link
Author

rchikhi commented Jun 16, 2021

Hi Tessa, quick question: the query was a pangenome, i.e. a collection of all known genomes for that clade. I suspect there's also some redundancy, i.e. some genomes inside this pangenome are very similar. Would it be correct to say that a containment score of 0.5 means that.. essentially 50% of the data inside that query pangenome has a hit? (With the truth being somewhere between "50% of the entries inside that FASTA file having a full hit and 50% have no hit", and "100% of the entries match over half their length").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants