This is a simple tool for subsetting sites from a WTCCC style haplotypes file.
This code was originally written by Olivier Delaneau. All changes from the initial commit were made by Warren Kretzschmar.
For bugs please contact Warren Kretzschmar at [email protected]
,
or open an issue on github.
./bin/subsetREFERENCE input.map input.hap.gz output.hap.gz
The .map file is space separated and consists of the first five columns of a VCF: Chromosome identifier, position, variant ID, ref allele, alt allele. The .map file contains no header. This is a valid .map file:
20 60309 20:60309_G_T G T
20 60479 20:60479_C_T C T
20 60571 20:60571_C_A C A
20 60828 20:60828_T_G T G
This is a WTCCC style haplotypes file. This is a valid .hap.gz file:
20 20:60309_G_T 60309 G T 0 0 1 0
20 20:60571_C_A 60571 C A 0 0 0 0
The output is a WTCCC style haplotypes file that only includes sites found in the .map file. Matching is performed on chromosome, position, ref allele, and alt allele. The variant ID is ignored.
After the first three arguments, the following arguments may be given:
Providing this argument will cause matching to be reversed. Only sites that are not in the .map file are output to the output.hap.gz file.
If a site is not found in the .map file, also check to see if the site matches a site in the .map file with ref and alt allele swapped.