Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tool: Reference Comparator #6837

Open
jonn-smith opened this issue Sep 21, 2020 · 16 comments
Open

New Tool: Reference Comparator #6837

jonn-smith opened this issue Sep 21, 2020 · 16 comments
Assignees
Labels

Comments

@jonn-smith
Copy link
Collaborator

We need a tool to compare multiple references and spit out a TSV (or similar) detailing what the differences are. Additionally it should be able to spit out a liftover file that will properly move a variant from one reference to another.

We should first compare the sequence dictionaries in the references to see if they have equal lengths and checksums - the names may differ and we should track this so we can definitively say which contigs are equivalent. After this, we should walk the references and find out specifically which bases differ between contigs that have different checksums (with some limits on the number of differences between them so we don't get bogged down by hg19 vs hg38 comparisons).
Then it should create a liftover file from those comparisons so the data can be easily converted between the references given.

Additionally, it should be able to take a variant file and a set of references and say:

  • whether the variant file "belongs" to one of the given references
  • if it isn't exactly from one of the given references, which reference is closest
  • optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

This will finally lay to rest the questions raised by my blog post about "HG19".

I believe Adam Phillipy had created a perl script that does something similar to this, but a brief view of his github page doesn't show anything like that anymore (maybe it was called refdiff or similar).

I created a bash script that does something similar to this (see attached), but it only looks at the sequence dictionaries. It produces a table similar to that in the above blog post. For example:

MD5 HG38(Homo_sapiens_assembly38.dict) HG38_WEIRD(genome.hg38rg.fa.dict)
1e95e047b98ed92148dd84d6c037158c chr1_KI270708v1_random 1_KI270708v1_random
42f7a452b8b769d051ad738ee9f00631 chr1_KI270714v1_random 1_KI270714v1_random
4e2db2933ea96aee8dab54af60ecb37d chr1_KI270709v1_random 1_KI270709v1_random
62def1a794b3e18192863d187af956e6 chr1_KI270706v1_random 1_KI270706v1_random
6aef897c3d6ff0c78aff06ac189178dd chr1 1
78135804eb15220565483b7cdd02f3be chr1_KI270707v1_random 1_KI270707v1_random
...

compareTwoReferenceDictionaries.tar.gz

@droazen @bhanugandham - we can discuss what other features we would want for this.

@cmnbroad
Copy link
Collaborator

Good idea. Another use case for this would be to take a bam/cram and set of references, and see if there is a reference suitable for the bam/cram, which would super useful for finding the reference for an existing CRAM, and also for finding a suitable reference to use to do a bam->cram conversion (which comes up a lot when adding cram tests).

@bhanugandham
Copy link
Contributor

This is great! Thanks for putting this together Jonn.

Could we also add the option to compare a vcf and a reference to see if the vcf was generated using that reference? Sometime users have vcfs from a previous study and don't know for sure if they are using the right ref. We see this often. Users want to use post-variantcalling tools and end up getting weird errors due to a wrong reference.

@droazen
Copy link
Contributor

droazen commented Mar 15, 2021

@KevinCLydon Reassigning this one to you, as discussed.

@kishorikonwar
Copy link

@jonn-smith I am working on this ticket and was looking for the script "compareTwoReferenceDictionaries.tar.gz" you uploaded on the ticket. Looks like this is not the actual file but it gives a soft link with a path on your laptop. Could you point me to the script? Thanks Kishori

@LeeTL1220
Copy link
Contributor

LeeTL1220 commented Mar 15, 2021

@jonn-smith What is the requirement here exactly?

optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

Is it to generate a liftover chain file from two arbitrary references and then do the liftover of the VCF?
Would this handle big changes, such as hg19 to hg38? Or only for lifting over similar references, such as b37 to hg19?

@droazen
Copy link
Contributor

droazen commented Mar 15, 2021

I think the idea is to just generate the liftover file between two arbitrary references, not to actually do the liftover.

@droazen
Copy link
Contributor

droazen commented Mar 15, 2021

And yes, ideally it should handle things like hg19 -> hg38, but I'll defer to @jonn-smith on how feasible that is.

If the liftover aspect of the ticket is too difficult, it would be fine to have an initial version of the tool that just prints the differences between two references, and then we can add the liftover capability later.

@jonn-smith
Copy link
Collaborator Author

@kishorikonwar Ah! My bad. I'll post it in a sec.

@LeeTL1220 @droazen My initial goal was to compare several "equivalent" genome FASTA files and produce two things:

  1. A table like the one I made here which has all the contigs that are the same and different between all input references
  2. For all contigs that differ, the actual base differences and positions at which they differ. I had envisioned this output to be a VCF file with metadata in the INFO field but now that I'm thinking about it another table with each reference as a column and each row as a position and the base at that position makes more sense (and would be easier to implement).

This would fix the issues that I've left as open questions for how certain versions of "hg19" actually differ. I have since looked into it with that script and I have an answer for some of the positions. I hope we all can handle IUPAC bases in our references! Creating a liftover file in this case would be really nice, and should take minimal effort.

If we're opening the tool to dissimilar references for the same organism, then there are some really tricky issues. What if Reference A has a frameshift relative to Reference B? What is the right way to display / report a pairwise comparison between all references together concisely (if it isn't a table)? Creating a liftover file in cases like this (e.g. hg19 -> hg38) is non-trivial.

@jonn-smith
Copy link
Collaborator Author

@kishorikonwar I had checked in the script a while ago, so that sim link just points to the version in GATK:

https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/testing/compareTwoReferenceDictionaries.sh
This only does the sequence dictionary check - not the base-level check, and it's not very sophistocated.

More recently I wrote a script for doing the nucleotide diffs in python. I put it in the jts_refdiff branch here:
https://github.com/broadinstitute/gatk/blob/jts_refdiff/scripts/refDiff.py

@jonn-smith
Copy link
Collaborator Author

@kishorikonwar It's worth noting that there are some instructions on how to create a chain file / liftover file from UCSC here:

http://genomewiki.ucsc.edu/index.php/LiftOver_Howto

@kishorikonwar
Copy link

kishorikonwar commented Mar 16, 2021

@jonn-smith Thank you. I will be looking at the ucsc. I also found the following tool that implements the ucsc liftover file creation....the logic seems simple.
https://github.com/konstantint/pyliftover

@jonn-smith
Copy link
Collaborator Author

jonn-smith commented Mar 16, 2021

@kishorikonwar No problem. One quick note - I'm not sure that pyliftover actually creates the chain files to do the liftover. It looks like it can only retrieve and apply chain file liftovers to a given set of coordinates / VCFs (as opposed to creating the chain file which specifies the differences between references).

It's worth checking out for how to do the liftover once the chain file has been created, but to make the chain file itself I think you'll need to reference the UCSC howto or other documentation.

If we were to have this tool create chain files for liftovers then we could create chain files for arbitrary liftovers (e.g. hg19 -> CanFam3.1), which some people might find useful.

Also, another reference to take a quick look at is https://github.com/alshai/levioSAM - I haven't looked closely, but it does something similar.

@kishorikonwar
Copy link

@jonn-smith @LeeTL1220 @droazen Thanks for sharing the information above, and I looked at it. It seems to me that once we have a chain file for one reference and another reference, the remaining steps are straightforward. I also noticed the following Picard utility Picard LiftoverVcf that can Lift over a VCF file from one reference to another.
Therefore, creating the chain file between a pair of references (and limiting ourselves to cases where both references are from the same species, mouse/human) is the key. To that end, according to the following post List of chain file creators most of the chain file creation tools are available as a web interface. However, the UCSC one seems to be more popular, and fortunately, they have the utilities as open source and to some degree explain their steps in the LiverOver_Howto link you sent. With this approach, they first BLAT the pairwise contigs in the reference files and then use the utility DoSameSpeciesLiftOver.pl.

Based on this, it appears to me I should think about the following steps:
a) First, try out their code (UCSC) and make sure it works to produce chain files for two references successfully.
b) Design/propose a solution putting the logic in DoSameSpeciesLiftOver.pl into GATK, which also might need a BLAT run

Let me know what you think of this or have any suggestions about how I should proceed.

@LeeTL1220
Copy link
Contributor

As per discussion with Kishori, he is going to work on a different issue, since it is more inline with his work. He can come back to this if nobody has picked it up.

@droazen
Copy link
Contributor

droazen commented Mar 24, 2021

Re-assigning to @KevinCLydon

@droazen droazen assigned orlicohen and unassigned KevinCLydon Jun 29, 2022
@droazen
Copy link
Contributor

droazen commented Jun 29, 2022

Re-assigning to @orlicohen

droazen pushed a commit that referenced this issue Jul 21, 2022
…eferences (#7930)

* This tool generates an MD5-keyed table comparing specified references and does an analysis to summarize the differences between the references provided. 

* Comparisons are made against a "special" reference, specified with the -R argument. Subsequent references to be compared may be specified using the --references-to-compare argument.

* The table can be directed to a file or standard output using provided command-line arguments.

* A supplementary table keyed by sequence name can be displayed using the --display-sequences-by-name argument; to display only sequence names for which the references are not consistent, run with the --display-only-differing-sequences argument as well.

* Comprehensive unit and integration tests

First part of #6837
orlicohen added a commit that referenced this issue Jul 28, 2022
…ibility against specified references

* This tool generates a table analyzing the compatibility of a sequence file against provided references.
* The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the -references-to-compare argument.
* When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length.
* The table can be directed to a file or standard output using provided command-line arguments.
* Comprehensive unit and integration tests
Continued work on #6837.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants