umi_tools dedup: Incorrect output with using --paired #347

christianbioinf · 2019-07-05T11:31:03Z

Hej hej,

I noticed an issue with umi_tools dedup using the option --paired which in some cases reports the read alignments multiple times in the output. By that, I mean that one gets literally duplicate lines in the SAM/BAM format.

I have traced it down that it is in conjunction with the output writer used with --paired, namely the TwoPassWriter in sam_methods.py. It is supposed to output the second-in-pair read alignment that corresponds to reported first-in-pair read alignments. However, in case the first-in-pair and second-in-pair read alignments share the same reference name and reference start, it may report the first-in-pair read alignment twice. Also, it produces incorrect (or at least unexpected) output in case the same second-in-pair read alignment is referred to by multiple first-in-pair read alignments (which bwa mem does by default in case there are secondary read alignments).

I compiled a small testset of read alignments illustrating both cases and I am currently working on a fix that I will make available as pull request. But for those who are interested and also later for checking the fix by the pull request, please find the alignment file for download (here). It contains the following read alignments (for those who do not want to download it).

TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	83	chr18	13300322	60	44S56M51S	=	13300195	-183	TCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGCCACCTCACCAGGCAGAACAGATGGCAGTGGTGAGGGCTGGAGCCCTCTGCAGGGCCGCCACTGGATGCTCTCCACATTGCACAGGGCAGGGTTGTTGCTGNACCGCA	9;.CEFECCA<GGGCCCGGGHGFHHFDEGGHHEGGGHGEHHFHGHFFGBGFHFHHHHHBHHHFGFFFFGGHGHHHEHGFEEHFHFGGGHHHHHHHGGGGFGHHGHHHHHHHGHHHFGGHHHHHHGFHHGGGGGGGGGGGBAA>>#CCCBBA	NM:i:0	MD:Z:100	AS:i:100	XS:i:19	RG:Z:HD200	SA:Z:chr7,55214317,+,53M98S,60,2;	MQ:i:60	MC:Z:127M
TestRead_SECONDARY_ATAACCGGTTAC	323	chr7	55214317	60	53M98H	chr18	13300195	0	TGCGGTNCAGCAACAACCCTGCCCTGTGCAATGTGGAGAGCATCCAGTGGCGG	ABBCCC#>>AABGGGGGGGGGGGHHFGHHHHHHGGFHHHGHHHHHHHGHHGFG	NM:i:2	MD:Z:6T24C21	AS:i:46	XS:i:19	RG:Z:HD200	SA:Z:chr18,13300278,-,100M51S,60,0;
TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M

To reproduce the issue, simply run the following command with the newest version of umi_tools (I used the master branch).

umi_tools dedup --paired --stdin umi_tools_bug_report.bam --verbose 0 -o

This results in the following output (without SAM header).

TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M
TestRead_Missing_Mate_With_Invalid_Flag_ATAACCGGTTAC	99	chr1	2274511	60	119M	=	2274511	119	CCCTAAGCCTTCATCTCCAGTAGCCTCCATCCTTCCTGAACGTTCAGAAAATAAAAGACACTCCAGAGCTCCTGGGGCCTGGGCCGTCGTGGCTGCTCCTGCGTATTTGGCAAGCATGA	ABBBAFFFFFBFGGGFGGGFGFCFHFHHHGHHBDGCFHGHHH2GH3FHEHHHHFHHGGGHHFGBFFFAFGHHHFCGGDEC11AGHGGGG/BEEAFEFFFGFHGGGFFHFBGHHEHFGFH	NM:i:1	MD:Z:10C108	AS:i:114	XS:i:19	RG:Z:HD200	MQ:i:60	MC:Z:119M
TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	163	chr18	13300195	60	127M	=	13300322	183	AGCAGGTGCCTCCTTACAGTTGGAGTCTGTCCAGGCACAGCAGAGAGCCAGGCCCGCAGAGGGGGCCGGCAGCCTCTTCCCACTCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGC	HHHHHDAGHGHHHHHHHHHFGHHHHHHHHEFHGHGHHHHHHHHGFFEHGHHEEFGEGGGGFFGGGGGECEGCDHHHHGHHHEFFHBGFFFHHFEGGGGGGGGGFEGGFFHHHGH0GHBGGHHFEFGH	NM:i:0	MD:Z:127	AS:i:127XS:i:0	RG:Z:HD200	MQ:i:60	MC:Z:44S56M51S
TestRead_SECONDARY_ATAACCGGTTAC	83	chr18	13300322	60	44S56M51S	=	13300195	-183	TCTCCTGCCCACCCCGCGCCTGTGCTCTGCCTCAGCCACAAGGCCACCTCACCAGGCAGAACAGATGGCAGTGGTGAGGGCTGGAGCCCTCTGCAGGGCCGCCACTGGATGCTCTCCACATTGCACAGGGCAGGGTTGTTGCTGNACCGCA	9;.CEFECCA<GGGCCCGGGHGFHHFDEGGHHEGGGHGEHHFHGHFFGBGFHFHHHHHBHHHFGFFFFGGHGHHHEHGFEEHFHFGGGHHHHHHHGGGGFGHHGHHHHHHHGHHHFGGHHHHHHGFHHGGGGGGGGGGGBAA>>#CCCBBA	NM:i:0	MD:Z:100	AS:i:100	XS:i:19	RG:Z:HD200	SA:Z:chr7,55214317,+,53M98S,60,2;	MQ:i:60	MC:Z:127M
TestRead_SECONDARY_ATAACCGGTTAC	323	chr7	55214317	60	53M98H	chr18	13300195	0	TGCGGTNCAGCAACAACCCTGCCCTGTGCAATGTGGAGAGCATCCAGTGGCGG	ABBCCC#>>AABGGGGGGGGGGGHHFGHHHHHHGGFHHHGHHHHHHHGHHGFG	NM:i:2	MD:Z:6T24C21	AS:i:46	XS:i:19	RG:Z:HD200	SA:Z:chr18,13300278,-,100M51S,60,0;

One can see that is unexpected to have two identical second-in-pair read alignments each of the reads.

Just for completion, I am using Python 3.6.7 and umi_tools (e8c2b47) on Mac OSX.

Best regards,
Christian

The text was updated successfully, but these errors were encountered:

IanSudbery · 2019-07-05T13:32:57Z

First of all, thanks for taking the time to report this and making to effort to offer a fix - your effort is appreciated.

Thanks for the catch with outputting read1 twice if it is at the same location as read2. My guess is that because this happens on close(), this is only and issue if you have chimeric reads (that is the mates are on different chrs).

For the problem with outputting read2 twice if two read1s point to it.... This is a little less clear cut. We have had a long dicussion in the past about when to do about mates in cases of supplementary reads etc. Some mappers will specifically output a read twice if it is the pair of two different mates. Our biggest problem here is that we are inconsistent - if you have two read1s on the same chr, then your read2 will only be output once. Unfortunately if your read1s are on two different chrs, then the read2 will be output twice. There is definitely a strong argument for your solution. Unfortunately it means storing the keys for the whole file in memory, which is going to be a problem once you start dealing with files that contain 200-300M reads.

There was a bit of dicussion on matter related to this when the TwoPassWriter was first introduced in #31 . Perhaps @TomSmithCGAT has views?

…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).

christianbioinf · 2019-07-08T07:33:19Z

Thanks for your feedback. I got your points and now tried to sustain the consistency, regardless of the strategy of the aligner, by using more information when searching for read2 alignments, namely the mate information that link back to read1. Hence, in case all read2 alignments are put out as long as they are linking back to read1s that were reported. This also renders the additional set void and thus any memory issue can be avoided.

IanSudbery · 2019-08-20T10:39:51Z

Thanks for this. Sorry its taken me a long time to get around to this.

I'm sorry to be a pain, but could you split this into two pull requests? The first problem definately needs your fix. I'm still troubled by the second problem though -

Your new solution certainly doesn't have the memory issues, but it assumes that if read2 is the mate of read1, then read1 will be the mate of read2, but this is not always the case. As you pointed out in your first post, BWA will have multiple read2s pointing to the same read1. Under this new scheme, if we don't output read1(p), but do output read1(s), then read2, which will point to read1(p) as its mate if it is mapped using BWA, will not be output, when it probably should be.

I'm still not sure I have a good alternative that works in all situations though.

…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).

… in the output

…uplicated_lines Fix for incorrect output of dedup with --paired (#347) in case of duplicated lines

SPPearce · 2020-05-06T16:15:02Z

Has this issue been fixed? I think I've just come across the bug, leading to people using CLC not being able to open the bam files that I'm producing.

IanSudbery · 2020-05-07T16:01:23Z

Yes. This should be fixed in 1.0.1

RoelKluin · 2021-02-04T14:47:24Z

version: 1.1.1 has the same issue still

cagaser · 2021-05-11T07:49:00Z

Hello, we are having the same issue still in version 1.1.1.

A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 99      19      15077383        60      67S76M  =       15077383        76      CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,-,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;

We noticed that it does solve the issue when running on seperate contigs. For example, when running only on chr19.

samtools view 104303-001-001.dedup.small.singularity.bam
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 99      19      15077383        60      67S76M  =       15077383        76      CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,-,70S73M,60,0;
A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC 147     19      15077383        60      67S76M  =       15077383        -76     CTACTTATCTGTAGTTTTAAAAATTGGATGAGGCATGGTGTCAGGAGAAGCAGGGCCCTGATGGCCAATACAGATGCAGGGAAATGCACTGAAGTTTGACAAGCTAATAACAAGCACAGTTGGAGAGCTGCTGGAGGATGGGA        FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFF        NM:i:0  MD:Z:76 MC:Z:67S76M     AS:i:76 XS:i:20 RG:Z:104303-001-001-1-HV5GFDSXY_L003    SA:Z:4,47146421,+,70S73M,60,0;

We are not sure if running on seperate contigs and then merging is the way to go. This might influence the possibility of detecting inter-chr translocations?

IanSudbery · 2021-05-11T13:54:59Z

Sorry, the bug that was fixed in 1.0.1 was that read1 could be output twice if it was mapped to the same location as read2. The problem you have here is outputting read2 twice.

Is this the total output? It seems to me that the only way this could happen would be if there was another alignment for A00379:309:HV5GFDSXY:3:1678:24361:22013_TAGACTC read 1 that also point to the same read2 alignment.

cagaser · 2021-05-18T07:29:25Z

Hello,

That was just a subset, but there are more of these cases where it was outputting read2 twice.
Here I have the BED format so its easy to see.

sort 104303-001-001.dedup.fromPipeline.chr19.bed | uniq -c | awk '{if ($1>1) print $0}' | head
      2 19      10003296        10003447        A00379:309:HV5GFDSXY:3:2355:6225:33442_CGTTTGA/2        60      -
      2 19      10016516        10016667        A00379:309:HV5GFDSXY:3:1401:7717:27602_CCGCGCA/2        60      +
      2 19      10032445        10032596        A00379:308:HNM35DSXY:3:2545:27371:21652_AGCTAAC/2       60      -
      2 19      10038363        10038514        A00379:308:HNM35DSXY:3:1429:8223:3709_CTGAGTT/2 60      -
      2 19      10039489        10039640        A00379:308:HNM35DSXY:3:1165:6686:27101_CTTTGAT/2        60      +
      2 19      10075939        10076090        A00379:308:HNM35DSXY:3:2667:18050:10394_GTACACG/2       60      +
      2 19      10079213        10079364        A00379:309:HV5GFDSXY:3:1153:19515:22921_GAACACA/2       60      +
      2 19      10095986        10096137        A00379:308:HNM35DSXY:3:2244:9155:34319_TGGGCCG/2        60      -
      2 19      10100576        10100727        A00379:308:HNM35DSXY:3:2225:16188:1031_GTCATGT/2        60      +
      2 19      10108153        10108304        A00379:308:HNM35DSXY:3:1361:1235:5040_ACCATTT/2 60      +

We tried 1.1.1 and also 1.0.1 and indeed, it didn't help for read2.
What worked for us (not sure if this is the right way), was to remove the secondary and supplementary reads before running umitools dedup.

TomSmithCGAT · 2024-03-07T21:30:24Z

I agree with @IanSudbery that this shouldn't occur. Despite the lack of activity on this issue, I'm loath to close it in case we're missing something here. If someone can upload a reproducible example of this behaviour with the latest version of UMI-tools, that would be very helpful!

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Jul 5, 2019

Fix for incorrect output of dedup with --paired (CGATOxford#347)

fb345c7

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Jul 5, 2019

Fix for incorrect output of dedup with --paired (CGATOxford#347)

26f71dc

christianbioinf mentioned this issue Jul 5, 2019

Fix for incorrect output of dedup with --paired (#347) #348

Closed

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Jul 8, 2019

Improvement of consistency in selection of read2 alignments for outpu…

78f457b

…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).

christianbioinf mentioned this issue Aug 20, 2019

Fix for incorrect output of dedup with --paired (#347) #363

Closed

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 22, 2019

Fix for incorrect output of dedup with --paired (CGATOxford#347)

86fb5d0

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 22, 2019

Improvement of consistency in selection of read2 alignments for outpu…

4253ddf

…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).

christianbioinf added a commit to christianbioinf/UMI-tools that referenced this issue Aug 23, 2019

Attempt to fix a part of CGATOxford#347 that leads to duplicate lines…

c6081f2

… in the output

IanSudbery added a commit that referenced this issue Aug 23, 2019

Merge pull request #366 from christianbioinf/{IS}_dedup_with_paired_d…

6945f6a

…uplicated_lines Fix for incorrect output of dedup with --paired (#347) in case of duplicated lines

Redmar-van-den-Berg mentioned this issue Jun 26, 2024

UMI-Tools does not remove all duplicate reads #650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

umi_tools dedup: Incorrect output with using --paired #347

umi_tools dedup: Incorrect output with using --paired #347

christianbioinf commented Jul 5, 2019

IanSudbery commented Jul 5, 2019

christianbioinf commented Jul 8, 2019

IanSudbery commented Aug 20, 2019

SPPearce commented May 6, 2020

IanSudbery commented May 7, 2020

RoelKluin commented Feb 4, 2021

cagaser commented May 11, 2021

IanSudbery commented May 11, 2021

cagaser commented May 18, 2021

TomSmithCGAT commented Mar 7, 2024

umi_tools dedup: Incorrect output with using --paired #347

umi_tools dedup: Incorrect output with using --paired #347

Comments

christianbioinf commented Jul 5, 2019

IanSudbery commented Jul 5, 2019

christianbioinf commented Jul 8, 2019

IanSudbery commented Aug 20, 2019

SPPearce commented May 6, 2020

IanSudbery commented May 7, 2020

RoelKluin commented Feb 4, 2021

cagaser commented May 11, 2021

IanSudbery commented May 11, 2021

cagaser commented May 18, 2021

TomSmithCGAT commented Mar 7, 2024