Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umi_tools dedup: Incorrect output with using --paired #347

Open
christianbioinf opened this issue Jul 5, 2019 · 10 comments
Open

umi_tools dedup: Incorrect output with using --paired #347

christianbioinf opened this issue Jul 5, 2019 · 10 comments

Comments

@christianbioinf
Copy link
Contributor

Hej hej,

I noticed an issue with umi_tools dedup using the option --paired which in some cases reports the read alignments multiple times in the output. By that, I mean that one gets literally duplicate lines in the SAM/BAM format.

I have traced it down that it is in conjunction with the output writer used with --paired, namely the TwoPassWriter in sam_methods.py. It is supposed to output the second-in-pair read alignment that corresponds to reported first-in-pair read alignments. However, in case the first-in-pair and second-in-pair read alignments share the same reference name and reference start, it may report the first-in-pair read alignment twice. Also, it produces incorrect (or at least unexpected) output in case the same second-in-pair read alignment is referred to by multiple first-in-pair read alignments (which bwa mem does by default in case there are secondary read alignments).

I compiled a small testset of read alignments illustrating both cases and I am currently working on a fix that I will make available as pull request. But for those who are interested and also later for checking the fix by the pull request, please find the alignment file for download (here). It contains the following read alignments (for those who do not want to download it).

TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	83	chr18	13300322	60	44S56M51S	=	13300195	-183	TCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGCCACCTCACCAGGCAGAACAGATGGCAGTGGTGAGGGCTGGAGCCCTCTGCAGGGCCGCCACTGGATGCTCTCCACATTGCACAGGGCAGGGTTGTTGCTGNACCGCA	9;.CEFECCA<GGGCCCGGGHGFHHFDEGGHHEGGGHGEHHFHGHFFGBGFHFHHHHHBHHHFGFFFFGGHGHHHEHGFEEHFHFGGGHHHHHHHGGGGFGHHGHHHHHHHGHHHFGGHHHHHHGFHHGGGGGGGGGGGBAA>>#CCCBBA	NM:i:0	MD:Z:100	AS:i:100	XS:i:19	RG:Z:HD200	SA:Z:chr7,55214317,+,53M98S,60,2;	MQ:i:60	MC:Z:127M
TestRead_SECONDARY_ATAACCGGTTAC	323	chr7	55214317	60	53M98H	chr18	13300195	0	TGCGGTNCAGCAACAACCCTGCCCTGTGCAATGTGGAGAGCATCCAGTGGCGG	ABBCCC#>>AABGGGGGGGGGGGHHFGHHHHHHGGFHHHGHHHHHHHGHHGFG	NM:i:2	MD:Z:6T24C21	AS:i:46	XS:i:19	RG:Z:HD200	SA:Z:chr18,13300278,-,100M51S,60,0;
TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M

To reproduce the issue, simply run the following command with the newest version of umi_tools (I used the master branch).

umi_tools dedup --paired --stdin umi_tools_bug_report.bam --verbose 0 -o

This results in the following output (without SAM header).

TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M
TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M
TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	83	chr18	13300322	60	44S56M51S	=	13300195	-183	TCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGCCACCTCACCAGGCAGAACAGATGGCAGTGGTGAGGGCTGGAGCCCTCTGCAGGGCCGCCACTGGATGCTCTCCACATTGCACAGGGCAGGGTTGTTGCTGNACCGCA	9;.CEFECCA<GGGCCCGGGHGFHHFDEGGHHEGGGHGEHHFHGHFFGBGFHFHHHHHBHHHFGFFFFGGHGHHHEHGFEEHFHFGGGHHHHHHHGGGGFGHHGHHHHHHHGHHHFGGHHHHHHGFHHGGGGGGGGGGGBAA>>#CCCBBA	NM:i:0	MD:Z:100	AS:i:100	XS:i:19	RG:Z:HD200	SA:Z:chr7,55214317,+,53M98S,60,2;	MQ:i:60	MC:Z:127M
TestRead_SECONDARY_ATAACCGGTTAC	323	chr7	55214317	60	53M98H	chr18	13300195	0	TGCGGTNCAGCAACAACCCTGCCCTGTGCAATGTGGAGAGCATCCAGTGGCGG	ABBCCC#>>AABGGGGGGGGGGGHHFGHHHHHHGGFHHHGHHHHHHHGHHGFG	NM:i:2	MD:Z:6T24C21	AS:i:46	XS:i:19	RG:Z:HD200	SA:Z:chr18,13300278,-,100M51S,60,0;

One can see that is unexpected to have two identical second-in-pair read alignments each of the reads.

Just for completion, I am using Python 3.6.7 and umi_tools (e8c2b47) on Mac OSX.

Best regards,
Christian

@IanSudbery
Copy link
Member

First of all, thanks for taking the time to report this and making to effort to offer a fix - your effort is appreciated.

Thanks for the catch with outputting read1 twice if it is at the same location as read2. My guess is that because this happens on close(), this is only and issue if you have chimeric reads (that is the mates are on different chrs).

For the problem with outputting read2 twice if two read1s point to it.... This is a little less clear cut. We have had a long dicussion in the past about when to do about mates in cases of supplementary reads etc. Some mappers will specifically output a read twice if it is the pair of two different mates. Our biggest problem here is that we are inconsistent - if you have two read1s on the same chr, then your read2 will only be output once. Unfortunately if your read1s are on two different chrs, then the read2 will be output twice. There is definitely a strong argument for your solution. Unfortunately it means storing the keys for the whole file in memory, which is going to be a problem once you start dealing with files that contain 200-300M reads.

There was a bit of dicussion on matter related to this when the TwoPassWriter was first introduced in #31 . Perhaps @TomSmithCGAT has views?

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Jul 8, 2019
…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).
@christianbioinf
Copy link
Contributor Author

Thanks for your feedback. I got your points and now tried to sustain the consistency, regardless of the strategy of the aligner, by using more information when searching for read2 alignments, namely the mate information that link back to read1. Hence, in case all read2 alignments are put out as long as they are linking back to read1s that were reported. This also renders the additional set void and thus any memory issue can be avoided.

@IanSudbery
Copy link
Member

Thanks for this. Sorry its taken me a long time to get around to this.

I'm sorry to be a pain, but could you split this into two pull requests? The first problem definately needs your fix. I'm still troubled by the second problem though -

Your new solution certainly doesn't have the memory issues, but it assumes that if read2 is the mate of read1, then read1 will be the mate of read2, but this is not always the case. As you pointed out in your first post, BWA will have multiple read2s pointing to the same read1. Under this new scheme, if we don't output read1(p), but do output read1(s), then read2, which will point to read1(p) as its mate if it is mapped using BWA, will not be output, when it probably should be.

I'm still not sure I have a good alternative that works in all situations though.

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 22, 2019
christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 22, 2019
…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).
christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 23, 2019
IanSudbery added a commit that referenced this issue Aug 23, 2019
…uplicated_lines

Fix for incorrect output of dedup with --paired (#347) in case of duplicated lines
@SPPearce
Copy link

SPPearce commented May 6, 2020

Has this issue been fixed? I think I've just come across the bug, leading to people using CLC not being able to open the bam files that I'm producing.

@IanSudbery
Copy link
Member

Yes. This should be fixed in 1.0.1

@RoelKluin
Copy link

version: 1.1.1 has the same issue still

@cagaser
Copy link

cagaser commented May 11, 2021

Hello, we are having the same issue still in version 1.1.1.

A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 99      19      15077383        60      67S76M  =       15077383        76      CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,-,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;

We noticed that it does solve the issue when running on seperate contigs. For example, when running only on chr19.

samtools view 104303-001-001.dedup.small.singularity.bam
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 99      19      15077383        60      67S76M  =       15077383        76      CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,-,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;

We are not sure if running on seperate contigs and then merging is the way to go. This might influence the possibility of detecting inter-chr translocations?

@IanSudbery
Copy link
Member

Sorry, the bug that was fixed in 1.0.1 was that read1 could be output twice if it was mapped to the same location as read2. The problem you have here is outputting read2 twice.

Is this the total output? It seems to me that the only way this could happen would be if there was another alignment for A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC read 1 that also point to the same read2 alignment.

@cagaser
Copy link

cagaser commented May 18, 2021

Hello,

That was just a subset, but there are more of these cases where it was outputting read2 twice.
Here I have the BED format so its easy to see.

sort 104303-001-001.dedup.fromPipeline.chr19.bed | uniq -c | awk '{if ($1>1) print $0}' | head
      2 19      10003296        10003447        A00379:309:HV5GFDSXY:3:2355:6225:33442_CGTTTGA/2        60      -
      2 19      10016516        10016667        A00379:309:HV5GFDSXY:3:1401:7717:27602_CCGCGCA/2        60      +
      2 19      10032445        10032596        A00379:308:HNM35DSXY:3:2545:27371:21652_AGCTAAC/2       60      -
      2 19      10038363        10038514        A00379:308:HNM35DSXY:3:1429:8223:3709_CTGAGTT/2 60      -
      2 19      10039489        10039640        A00379:308:HNM35DSXY:3:1165:6686:27101_CTTTGAT/2        60      +
      2 19      10075939        10076090        A00379:308:HNM35DSXY:3:2667:18050:10394_GTACACG/2       60      +
      2 19      10079213        10079364        A00379:309:HV5GFDSXY:3:1153:19515:22921_GAACACA/2       60      +
      2 19      10095986        10096137        A00379:308:HNM35DSXY:3:2244:9155:34319_TGGGCCG/2        60      -
      2 19      10100576        10100727        A00379:308:HNM35DSXY:3:2225:16188:1031_GTCATGT/2        60      +
      2 19      10108153        10108304        A00379:308:HNM35DSXY:3:1361:1235:5040_ACCATTT/2 60      +

We tried 1.1.1 and also 1.0.1 and indeed, it didn't help for read2.
What worked for us (not sure if this is the right way), was to remove the secondary and supplementary reads before running umitools dedup.

@TomSmithCGAT
Copy link
Member

I agree with @IanSudbery that this shouldn't occur. Despite the lack of activity on this issue, I'm loath to close it in case we're missing something here. If someone can upload a reproducible example of this behaviour with the latest version of UMI-tools, that would be very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants