In addition to including genetic diversity in the graphs, one could also mitigate mapping bias by adjusting reference genome to the targeted population, or so called as consensus genome approach. In this part, we replaced bases in the ARS-UCD 1.2 bovine reference genome with the most frequent allele in the the population. We consider two types of consensus:
We modified reference bases with major allele where frequency calculated based on 82 Brown Swiss animals.
We modified reference bases with major allele where frequency calculated based combined 288 animals in four cattle populations (BSW, OBV, HOL, and FV).
-
VG toolkit version v1.17.0 "Candida", we do not test the script in other vg version
-
Java or JDK and
vcf2diploid.jar
-
UCSC
liftover
tools -
R (we used version 3.4.2) with
Tidyverse
library
Make sure that the program are in the $PATH
and raw data have been downloaded from Zenodo.
We calculated two consensus, major-BSW
and major-pan
where allele frequencies were calculated based on Brown Swiss and combined population, respectively. We provided the variants in ../data/part3/vcf_consensus
(with the frequency file and the vcf files).
We modified the original reference with major variants defined in the vcf file with vcf2diploid
tools. Vcf2diploid is a tool to generate parental and maternal haplotypes by replacing reference with variants from phased vcf. For our purpose, we inputted a single sample vcf with all homozygous alternate genotypes (thus all alleles will be replaced with corresponding variants and outputted the same two fasta haplotypes). Since replacing reference allele with insertion and deletions would cause genomics coordinate shift, we applied the accompanying chain file produced by vcf2diploid to convert the coordinates of the simulated reads from the original to the modified reference using local UCSC liftOver tools.
We mapped Brown swiss simulated reads from part1 to the consensus linear genome with bwa
and vg
(by first creating an empty graphs without variations ). Script consensus_liftover.sh
will create consensus genome, perform liftover, mapping consensus genome to bwa
and to consensus graphs with vg
.
scripts/consensus_liftover.sh ${consensus_type}
Where consensus type is either major-BSW
or major-pan
.
The scripts will generate modified consensus genomes as 25_anims_major-BSW.fa
and 25_anims_major-pan.fa
. Additionally, mapping statistics generated in compare.gz
files that are required for subsequent data analyses.
The analysis presented in the paper can be followed interactively through Jupyter notebook in analysis/part3_consensusgenome.ipynb
or via Google colab
.