New Tool: Reference Comparator #6837

jonn-smith · 2020-09-21T17:35:51Z

We need a tool to compare multiple references and spit out a TSV (or similar) detailing what the differences are. Additionally it should be able to spit out a liftover file that will properly move a variant from one reference to another.

We should first compare the sequence dictionaries in the references to see if they have equal lengths and checksums - the names may differ and we should track this so we can definitively say which contigs are equivalent. After this, we should walk the references and find out specifically which bases differ between contigs that have different checksums (with some limits on the number of differences between them so we don't get bogged down by hg19 vs hg38 comparisons).
Then it should create a liftover file from those comparisons so the data can be easily converted between the references given.

Additionally, it should be able to take a variant file and a set of references and say:

whether the variant file "belongs" to one of the given references
if it isn't exactly from one of the given references, which reference is closest
optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

This will finally lay to rest the questions raised by my blog post about "HG19".

I believe Adam Phillipy had created a perl script that does something similar to this, but a brief view of his github page doesn't show anything like that anymore (maybe it was called refdiff or similar).

I created a bash script that does something similar to this (see attached), but it only looks at the sequence dictionaries. It produces a table similar to that in the above blog post. For example:

MD5	HG38(Homo_sapiens_assembly38.dict)	HG38_WEIRD(genome.hg38rg.fa.dict)
1e95e047b98ed92148dd84d6c037158c	chr1_KI270708v1_random	1_KI270708v1_random
42f7a452b8b769d051ad738ee9f00631	chr1_KI270714v1_random	1_KI270714v1_random
4e2db2933ea96aee8dab54af60ecb37d	chr1_KI270709v1_random	1_KI270709v1_random
62def1a794b3e18192863d187af956e6	chr1_KI270706v1_random	1_KI270706v1_random
6aef897c3d6ff0c78aff06ac189178dd	chr1	1
78135804eb15220565483b7cdd02f3be	chr1_KI270707v1_random	1_KI270707v1_random
...

compareTwoReferenceDictionaries.tar.gz

@droazen @bhanugandham - we can discuss what other features we would want for this.

The text was updated successfully, but these errors were encountered:

cmnbroad · 2020-09-21T18:55:30Z

Good idea. Another use case for this would be to take a bam/cram and set of references, and see if there is a reference suitable for the bam/cram, which would super useful for finding the reference for an existing CRAM, and also for finding a suitable reference to use to do a bam->cram conversion (which comes up a lot when adding cram tests).

bhanugandham · 2020-09-23T05:14:29Z

This is great! Thanks for putting this together Jonn.

Could we also add the option to compare a vcf and a reference to see if the vcf was generated using that reference? Sometime users have vcfs from a previous study and don't know for sure if they are using the right ref. We see this often. Users want to use post-variantcalling tools and end up getting weird errors due to a wrong reference.

droazen · 2021-03-15T18:35:26Z

@KevinCLydon Reassigning this one to you, as discussed.

kishorikonwar · 2021-03-15T20:02:16Z

@jonn-smith I am working on this ticket and was looking for the script "compareTwoReferenceDictionaries.tar.gz" you uploaded on the ticket. Looks like this is not the actual file but it gives a soft link with a path on your laptop. Could you point me to the script? Thanks Kishori

LeeTL1220 · 2021-03-15T20:11:47Z

@jonn-smith What is the requirement here exactly?

optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

Is it to generate a liftover chain file from two arbitrary references and then do the liftover of the VCF?
Would this handle big changes, such as hg19 to hg38? Or only for lifting over similar references, such as b37 to hg19?

droazen · 2021-03-15T20:52:55Z

I think the idea is to just generate the liftover file between two arbitrary references, not to actually do the liftover.

droazen · 2021-03-15T20:55:06Z

And yes, ideally it should handle things like hg19 -> hg38, but I'll defer to @jonn-smith on how feasible that is.

If the liftover aspect of the ticket is too difficult, it would be fine to have an initial version of the tool that just prints the differences between two references, and then we can add the liftover capability later.

jonn-smith · 2021-03-15T21:40:17Z

@kishorikonwar Ah! My bad. I'll post it in a sec.

@LeeTL1220 @droazen My initial goal was to compare several "equivalent" genome FASTA files and produce two things:

A table like the one I made here which has all the contigs that are the same and different between all input references
For all contigs that differ, the actual base differences and positions at which they differ. I had envisioned this output to be a VCF file with metadata in the INFO field but now that I'm thinking about it another table with each reference as a column and each row as a position and the base at that position makes more sense (and would be easier to implement).

This would fix the issues that I've left as open questions for how certain versions of "hg19" actually differ. I have since looked into it with that script and I have an answer for some of the positions. I hope we all can handle IUPAC bases in our references! Creating a liftover file in this case would be really nice, and should take minimal effort.

If we're opening the tool to dissimilar references for the same organism, then there are some really tricky issues. What if Reference A has a frameshift relative to Reference B? What is the right way to display / report a pairwise comparison between all references together concisely (if it isn't a table)? Creating a liftover file in cases like this (e.g. hg19 -> hg38) is non-trivial.

jonn-smith · 2021-03-16T14:57:48Z

@kishorikonwar I had checked in the script a while ago, so that sim link just points to the version in GATK:

https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/testing/compareTwoReferenceDictionaries.sh
This only does the sequence dictionary check - not the base-level check, and it's not very sophistocated.

More recently I wrote a script for doing the nucleotide diffs in python. I put it in the jts_refdiff branch here:
https://github.com/broadinstitute/gatk/blob/jts_refdiff/scripts/refDiff.py

jonn-smith · 2021-03-16T16:12:34Z

@kishorikonwar It's worth noting that there are some instructions on how to create a chain file / liftover file from UCSC here:

http://genomewiki.ucsc.edu/index.php/LiftOver_Howto

kishorikonwar · 2021-03-16T16:34:35Z

@jonn-smith Thank you. I will be looking at the ucsc. I also found the following tool that implements the ucsc liftover file creation....the logic seems simple.
https://github.com/konstantint/pyliftover

jonn-smith · 2021-03-16T17:49:29Z

@kishorikonwar No problem. One quick note - I'm not sure that pyliftover actually creates the chain files to do the liftover. It looks like it can only retrieve and apply chain file liftovers to a given set of coordinates / VCFs (as opposed to creating the chain file which specifies the differences between references).

It's worth checking out for how to do the liftover once the chain file has been created, but to make the chain file itself I think you'll need to reference the UCSC howto or other documentation.

If we were to have this tool create chain files for liftovers then we could create chain files for arbitrary liftovers (e.g. hg19 -> CanFam3.1), which some people might find useful.

Also, another reference to take a quick look at is https://github.com/alshai/levioSAM - I haven't looked closely, but it does something similar.

kishorikonwar · 2021-03-22T16:04:13Z

@jonn-smith @LeeTL1220 @droazen Thanks for sharing the information above, and I looked at it. It seems to me that once we have a chain file for one reference and another reference, the remaining steps are straightforward. I also noticed the following Picard utility Picard LiftoverVcf that can Lift over a VCF file from one reference to another.
Therefore, creating the chain file between a pair of references (and limiting ourselves to cases where both references are from the same species, mouse/human) is the key. To that end, according to the following post List of chain file creators most of the chain file creation tools are available as a web interface. However, the UCSC one seems to be more popular, and fortunately, they have the utilities as open source and to some degree explain their steps in the LiverOver_Howto link you sent. With this approach, they first BLAT the pairwise contigs in the reference files and then use the utility DoSameSpeciesLiftOver.pl.

Based on this, it appears to me I should think about the following steps:
a) First, try out their code (UCSC) and make sure it works to produce chain files for two references successfully.
b) Design/propose a solution putting the logic in DoSameSpeciesLiftOver.pl into GATK, which also might need a BLAT run

Let me know what you think of this or have any suggestions about how I should proceed.

LeeTL1220 · 2021-03-24T16:41:21Z

As per discussion with Kishori, he is going to work on a different issue, since it is more inline with his work. He can come back to this if nobody has picked it up.

droazen · 2021-03-24T17:24:54Z

Re-assigning to @KevinCLydon

droazen · 2022-06-29T16:50:10Z

Re-assigning to @orlicohen

…eferences (#7930) * This tool generates an MD5-keyed table comparing specified references and does an analysis to summarize the differences between the references provided. * Comparisons are made against a "special" reference, specified with the -R argument. Subsequent references to be compared may be specified using the --references-to-compare argument. * The table can be directed to a file or standard output using provided command-line arguments. * A supplementary table keyed by sequence name can be displayed using the --display-sequences-by-name argument; to display only sequence names for which the references are not consistent, run with the --display-only-differing-sequences argument as well. * Comprehensive unit and integration tests First part of #6837

…ibility against specified references * This tool generates a table analyzing the compatibility of a sequence file against provided references. * The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the -references-to-compare argument. * When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length. * The table can be directed to a file or standard output using provided command-line arguments. * Comprehensive unit and integration tests Continued work on #6837.

droazen assigned jonn-smith Jan 4, 2021

droazen assigned KevinCLydon Mar 15, 2021

droazen added the tools label Mar 15, 2021

droazen unassigned jonn-smith Mar 15, 2021

LeeTL1220 assigned kishorikonwar and unassigned KevinCLydon Mar 15, 2021

LeeTL1220 unassigned kishorikonwar Mar 24, 2021

droazen assigned KevinCLydon Mar 24, 2021

droazen assigned orlicohen and unassigned KevinCLydon Jun 29, 2022

orlicohen mentioned this issue Jul 7, 2022

CompareReferences tool #7930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Tool: Reference Comparator #6837

New Tool: Reference Comparator #6837

jonn-smith commented Sep 21, 2020

cmnbroad commented Sep 21, 2020

bhanugandham commented Sep 23, 2020

droazen commented Mar 15, 2021

kishorikonwar commented Mar 15, 2021

LeeTL1220 commented Mar 15, 2021 •

edited

Loading

droazen commented Mar 15, 2021

droazen commented Mar 15, 2021

jonn-smith commented Mar 15, 2021

jonn-smith commented Mar 16, 2021

jonn-smith commented Mar 16, 2021

kishorikonwar commented Mar 16, 2021 •

edited

Loading

jonn-smith commented Mar 16, 2021 •

edited

Loading

kishorikonwar commented Mar 22, 2021

LeeTL1220 commented Mar 24, 2021

droazen commented Mar 24, 2021

droazen commented Jun 29, 2022

New Tool: Reference Comparator #6837

New Tool: Reference Comparator #6837

Comments

jonn-smith commented Sep 21, 2020

cmnbroad commented Sep 21, 2020

bhanugandham commented Sep 23, 2020

droazen commented Mar 15, 2021

kishorikonwar commented Mar 15, 2021

LeeTL1220 commented Mar 15, 2021 • edited Loading

droazen commented Mar 15, 2021

droazen commented Mar 15, 2021

jonn-smith commented Mar 15, 2021

jonn-smith commented Mar 16, 2021

jonn-smith commented Mar 16, 2021

kishorikonwar commented Mar 16, 2021 • edited Loading

jonn-smith commented Mar 16, 2021 • edited Loading

kishorikonwar commented Mar 22, 2021

LeeTL1220 commented Mar 24, 2021

droazen commented Mar 24, 2021

droazen commented Jun 29, 2022

LeeTL1220 commented Mar 15, 2021 •

edited

Loading

kishorikonwar commented Mar 16, 2021 •

edited

Loading

jonn-smith commented Mar 16, 2021 •

edited

Loading