-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Tool: Reference Comparator #6837
Comments
Good idea. Another use case for this would be to take a bam/cram and set of references, and see if there is a reference suitable for the bam/cram, which would super useful for finding the reference for an existing CRAM, and also for finding a suitable reference to use to do a bam->cram conversion (which comes up a lot when adding cram tests). |
This is great! Thanks for putting this together Jonn. Could we also add the option to compare a vcf and a reference to see if the vcf was generated using that reference? Sometime users have vcfs from a previous study and don't know for sure if they are using the right ref. We see this often. Users want to use post-variantcalling tools and end up getting weird errors due to a wrong reference. |
@KevinCLydon Reassigning this one to you, as discussed. |
@jonn-smith I am working on this ticket and was looking for the script "compareTwoReferenceDictionaries.tar.gz" you uploaded on the ticket. Looks like this is not the actual file but it gives a soft link with a path on your laptop. Could you point me to the script? Thanks Kishori |
@jonn-smith What is the requirement here exactly?
Is it to generate a liftover chain file from two arbitrary references and then do the liftover of the VCF? |
I think the idea is to just generate the liftover file between two arbitrary references, not to actually do the liftover. |
And yes, ideally it should handle things like hg19 -> hg38, but I'll defer to @jonn-smith on how feasible that is. If the liftover aspect of the ticket is too difficult, it would be fine to have an initial version of the tool that just prints the differences between two references, and then we can add the liftover capability later. |
@kishorikonwar Ah! My bad. I'll post it in a sec. @LeeTL1220 @droazen My initial goal was to compare several "equivalent" genome FASTA files and produce two things:
This would fix the issues that I've left as open questions for how certain versions of "hg19" actually differ. I have since looked into it with that script and I have an answer for some of the positions. I hope we all can handle IUPAC bases in our references! Creating a liftover file in this case would be really nice, and should take minimal effort. If we're opening the tool to dissimilar references for the same organism, then there are some really tricky issues. What if Reference A has a frameshift relative to Reference B? What is the right way to display / report a pairwise comparison between all references together concisely (if it isn't a table)? Creating a liftover file in cases like this (e.g. hg19 -> hg38) is non-trivial. |
@kishorikonwar I had checked in the script a while ago, so that sim link just points to the version in GATK: https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/testing/compareTwoReferenceDictionaries.sh More recently I wrote a script for doing the nucleotide diffs in python. I put it in the |
@kishorikonwar It's worth noting that there are some instructions on how to create a chain file / liftover file from UCSC here: |
@jonn-smith Thank you. I will be looking at the ucsc. I also found the following tool that implements the ucsc liftover file creation....the logic seems simple. |
@kishorikonwar No problem. One quick note - I'm not sure that It's worth checking out for how to do the liftover once the chain file has been created, but to make the chain file itself I think you'll need to reference the UCSC howto or other documentation. If we were to have this tool create chain files for liftovers then we could create chain files for arbitrary liftovers (e.g. hg19 -> CanFam3.1), which some people might find useful. Also, another reference to take a quick look at is https://github.com/alshai/levioSAM - I haven't looked closely, but it does something similar. |
@jonn-smith @LeeTL1220 @droazen Thanks for sharing the information above, and I looked at it. It seems to me that once we have a chain file for one reference and another reference, the remaining steps are straightforward. I also noticed the following Picard utility Picard LiftoverVcf that can Lift over a VCF file from one reference to another. Based on this, it appears to me I should think about the following steps: Let me know what you think of this or have any suggestions about how I should proceed. |
As per discussion with Kishori, he is going to work on a different issue, since it is more inline with his work. He can come back to this if nobody has picked it up. |
Re-assigning to @KevinCLydon |
Re-assigning to @orlicohen |
…eferences (#7930) * This tool generates an MD5-keyed table comparing specified references and does an analysis to summarize the differences between the references provided. * Comparisons are made against a "special" reference, specified with the -R argument. Subsequent references to be compared may be specified using the --references-to-compare argument. * The table can be directed to a file or standard output using provided command-line arguments. * A supplementary table keyed by sequence name can be displayed using the --display-sequences-by-name argument; to display only sequence names for which the references are not consistent, run with the --display-only-differing-sequences argument as well. * Comprehensive unit and integration tests First part of #6837
…ibility against specified references * This tool generates a table analyzing the compatibility of a sequence file against provided references. * The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the -references-to-compare argument. * When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length. * The table can be directed to a file or standard output using provided command-line arguments. * Comprehensive unit and integration tests Continued work on #6837.
We need a tool to compare multiple references and spit out a TSV (or similar) detailing what the differences are. Additionally it should be able to spit out a liftover file that will properly move a variant from one reference to another.
We should first compare the sequence dictionaries in the references to see if they have equal lengths and checksums - the names may differ and we should track this so we can definitively say which contigs are equivalent. After this, we should walk the references and find out specifically which bases differ between contigs that have different checksums (with some limits on the number of differences between them so we don't get bogged down by
hg19
vshg38
comparisons).Then it should create a liftover file from those comparisons so the data can be easily converted between the references given.
Additionally, it should be able to take a variant file and a set of references and say:
This will finally lay to rest the questions raised by my blog post about "HG19".
I believe Adam Phillipy had created a perl script that does something similar to this, but a brief view of his github page doesn't show anything like that anymore (maybe it was called
refdiff
or similar).I created a bash script that does something similar to this (see attached), but it only looks at the sequence dictionaries. It produces a table similar to that in the above blog post. For example:
compareTwoReferenceDictionaries.tar.gz
@droazen @bhanugandham - we can discuss what other features we would want for this.
The text was updated successfully, but these errors were encountered: