Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some missing details #1

Open
sdwfrost opened this issue Feb 16, 2017 · 8 comments
Open

Some missing details #1

sdwfrost opened this issue Feb 16, 2017 · 8 comments

Comments

@sdwfrost
Copy link

Hi @evogytis @rambaut @plemey @trvrb @msuchard

A few things in the repository that I couldn't find in the biorXiv paper:

  • How was the ML tree reconstructed?
  • Which putative ADAR edited sites/sequences were masked? Are the data in Data/ the masked or unmasked data?
  • How were missing dates imputed?
  • Any update on the missing accession numbers?

Don't mean to be a pain, but I'd much rather use a common resource rather than try to reproduce with subtly different results.

@rambaut
Copy link
Contributor

rambaut commented Feb 16, 2017 via email

@evogytis
Copy link
Collaborator

Hey @sdwfrost,

Responses in order:

  • I didn't generate the ML tree, but I imagine PhyML + HKY+G was used. @rambaut can confirm if true. We don't mention the ML tree anywhere in text as far as I know.
  • All the data we share is masked. Masking is easy to identify because we use ?s instead of Ns. Any ? used to be a C. I have a Jupyter notebook that takes in unmasked alignments and highlights problematic sequences/areas.
  • As far as I can tell we're missing 431 accessions. Didn't realise it was this bad. Most of the sequences missing accessions are EM, DML, IP, WHO or USAMRIID. We might have USAMRIID accessions but haven't updated the sequence names. Not sure where we're at with the other accessions.

@sdwfrost
Copy link
Author

Thanks @evogytis @rambaut

  • Was the data partitioned for the analysis, as in the BEAST runs? If not, I can run one myself. Just putting in the PhyML log files would be sufficient to see what was done.
  • Thanks for discriminating masking versus Ns! That'll be an easy fix. If you could also share the notebook, that would be really handy.
  • Those are some hefty BEAST runs linked from the doi....I'm curious as to how many iterations of the models you went through.

@evogytis
Copy link
Collaborator

Hey @sdwfrost

  • Depends if @rambaut used the Geneious or the command line version. I imagine the former is the case.
  • This is the notebook that I've been using: EBOV_scrutiny.ipynb.zip. This is the consensus sequence that I've used: EBOV_consensus.fasta.zip. The script identifies each gene based on how the consensus aligns to the dataset, highlights ADAR sites and can output an alignment in CDS+ig format. Apologies for lack of comments too, didn't think I'd have other eyes on it.

@rambaut
Copy link
Contributor

rambaut commented Feb 17, 2017

I have a command-line script that bats back and forth between phyml (to create an initial tree using NJ), RAXML to search topologies, and back to Phyml to improve branch lengths. Am re-running on the 1610 data here and will upload all in a couple of days.

@rambaut
Copy link
Contributor

rambaut commented Feb 17, 2017

Missing accession numbers are from the Quick et al MinION sequencing. This is because although the raw data were on ENA, the consensuses were simply on Nick's github. These have recently been deposited in genbank so will endeavour to match accession to sequence in the tables. Creating Issue...

@BEAST-Community
Copy link

Very nice work! @rambaut @evogytis @msuchard @plemey .
@rambaut, could you tell me how to how to back and forth between phyml and RAxML to get a better tree, thank you.

@evogytis
Copy link
Collaborator

evogytis commented Jun 4, 2018

@BEAST-Community the script is now in the repo with 67a36db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants