Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting the segmented bam file? #143

Open
ashokpatowary opened this issue Jul 26, 2022 · 8 comments
Open

Splitting the segmented bam file? #143

ashokpatowary opened this issue Jul 26, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@ashokpatowary
Copy link

Will it be possible to have a read tag for the MAS adapter used in the segmented bam file so that we can split the bam file if needed? Like if we are using 10 MAS adapter for ligation; can we split the segmented bam file into 10 independent bam file?

Also what is the memory requirement for the "logbow correct" step? Its continuously giving memory error.

Thanks

@jonn-smith
Copy link
Collaborator

There is currently no built-in way to split the segmented bam into files for each barcode. Since the molecules being ligated together are chosen randomly I'm not sure why you would want to split them by the MAS-adapter sequence - it shouldn't make a difference in the read content.

longbow correct currently requires a lot of memory. The 3' barcode lists need about 34 Gb of memory to create the correction index. The other barcode lists require less. This was the trade off to doing barcode correction quickly - we need a lot more memory at runtime.

@ashokpatowary
Copy link
Author

@jonn-smith thanks for your comment.

We are planning to use the MAS Isoseq on ~100 case control single cell cDNA samples. However we are thinking of mixing case and control in one concat product. If we can identify the mas adapter then we can segregate the case control readsin downstream. Thants the reason why we Re looking for it.

thanks

@jonn-smith
Copy link
Collaborator

@ashokpatowary That's great! It's an interesting use case. I'll put it on the todo list. I'm not sure when I can get to it - probably not for a couple of weeks.

In the meantime, everything already exists in the output from longbow already for you to implement such a filter using pysam. After annotating, filtering and segmenting the reads, you can iterate over the segmented reads and parse the SG tag . The first segment in the SG tag will correspond to the leading MAS adapter for that segment, and you can write each one to a separate output file depending on that first segment.

@jonn-smith jonn-smith added the enhancement New feature or request label Aug 4, 2022
@ashokpatowary
Copy link
Author

Thanks @jonn-smith for the suggestion. We just generated 1st batch of test data. We are also not in very hurry. Still trying to understand that data generated.

I am having a general follow up question. I try to look for distribution of the leading MAS adapter within the SG tag. It looks something like this

 239046 5p_Adapter
 280197 A
 299164 B
 326649 C
 331475 D
 337768 E
 333102 F
 329844 G
 312301 H
 312132 I
 440403 J
 444085 K
 423081 L
 373493 M
 360875 N
 334501 O
 21 Poly_T
 2769 random

I am not understanding from the wet lab experimental point of view why are we seeing so many reads with 5p_Adapter as the leading MAS adapter.

Another question regarding the segment step. I have modified the json file as the UMI length was 12 bp. What this warning about [WARNING 2022-08-06 07:59:46 segment] Model does not have Cell Barcode or UMI tags. All segments will be emitted.

[INFO 2022-08-06 07:59:45  segment] Invoked via: longbow segment -t 12 -o segmented_1.bam -v INFO -m longbow_model_mas_15_sc_Ashok.json -f filter_1.bam
[INFO 2022-08-06 07:59:45  segment] Running with 12 worker subprocess(es)
[INFO 2022-08-06 07:59:45  segment] Using simple splitting mode.
[INFO 2022-08-06 07:59:46  segment] Using mas_15_sc_10x3p_single_none: The 3' kit MAS-seq 15 array element model.
[INFO 2022-08-06 07:59:46  segment] Model has UMI annotation.
[INFO 2022-08-06 07:59:46  segment] Model has Cell Barcode annotation.
[WARNING 2022-08-06 07:59:46  segment] Model does not have Cell Barcode or UMI tags.  All segments will be emitted.
[INFO 2022-08-06 08:20:29  segment] Segmented 536976 reads with 5509617 total segments.
[INFO 2022-08-06 08:20:29  segment] MAS-seq gain factor: 10.26x
[INFO 2022-08-06 08:20:29  segment] Done. Elapsed time: 1243.44s.

@jonn-smith
Copy link
Collaborator

@ashokpatowary What we have noticed is that the first MAS adapter can be missing from the beginning of an array read. I don't know why the abundance of the 5p_Adapter would be quite that high, but that could be part of the issue.

For QC and metrics on your data, I suggest running longbow stats on the annotated bam file. This will produce a very useful summary statistics text file and some plots that you can use to evaluate your data.

@jonn-smith
Copy link
Collaborator

@ashokpatowary For the UMI / Cell barcode warning - the latest version of longbow segment extracts any cell barcode and unique molecular identifiers in your model, but they need to be annotated in the model in a very particular way, or they are ignored. For models with a CBC and / or UMI, if a segment doesn't contain one, then that segment is not emitted to the standard output file. This warning message is saying it can't find a UMI or CBC, so this filtering is not taking place. I recommend updating to the latest version of longbow and trying segment on a subset of your data to see if it still happens (I recently made some changes involving this).

If you can post the model json you're using I might be able to see what is missing to cause it to not find the CBC and UMI (if your model has them).

@ashokpatowary
Copy link
Author

@jonn-smith I have updated the longbow version on August 5th. So, I think I am using the latest version of the tool.

I have modify one line i the built in longbow_model_mas_15_sc_10x3p model
instead of "FixedLengthRandomBases": 10 I am using "FixedLengthRandomBases": 12

   "adapters": {
        "5p_Adapter": "TCTACACGACGCTCTTCCGATCT",
        "3p_Adapter": "CCCATGTACTCTGCGTTGATACCACTGCTT",
        "A": "AGCTTACTTGTGAAGA",
        "B": "ACTTGTAAGCTGTCTA",
        "C": "ACTCTGTCAGGTCCGA",
        "D": "ACCTCCTCCTCCAGAA",
        "E": "AACCGGACACACTTAG",
        "F": "AGAGTCCAATTCGCAG",
        "G": "AATCAAGGCTTAACGG",
        "H": "ATGTTGAATCCTAGCG",
        "I": "AGTGCGTTGCGAATTG",
        "J": "AATTGCGTAGTTGGCC",
        "K": "ACACTTGGTCGCAATC",
        "L": "AGTAAGCCTTCGTGTC",
        "M": "ACCTAGATCAGAGCCT",
        "N": "AGGTATGCCGGTTAAG",
        "O": "AAGTCACCGGCACCTT",
        "P": "ATGAAGTGGCTCGAGA",
        "Poly_T": {
            "HomopolymerRepeat": [
                "T",
                30
            ]
        },
        "CBC": {
            "FixedLengthRandomBases": 16
        },
        "UMI": {
            "FixedLengthRandomBases": 12
        },
        "cDNA": "random"
    }

However when I checked the segmet output file; I can see the CR and ZU tag in the reads. Are these tags low confidence one?

m64381e_220728_012507/1000131277008/ccs 4 * 0 255 * * 0 0 ATGTTGAATCCTAGCGTCTACACGACGCTCTTCCGATCTAGCCAATCAGATACCTTCATTTAGCACATTTTTTTTTTTTTTTTTTTTTTTTTTTTGGAATCCAATAAATCCTCATATAGTCACCTTTATGTCTGAGGGCAACATCTGCTACAAAATAATTTACCTATGACCTAAGTCTTATTTAAACATTATACTGAAAAAGAAAATTTATTTCCCTTACTTGCTCTCCCTATGTATACATAGTAAATTAATTCTAATGTGGTTCAAATCCCACTCCTGAAAGCTATCTACATTGAACAGAAAAGTGTAATTCTTGGCCCCTCTTCCTCTCTCGCCCTTTTCCTTTGAAACTTTTTTTAGGCTTTGTATGAGACTTCCAAAAAAGCCCAGCATGGATCCCATTGACGATTTGACCTCTTTTAGCCCCCAAAATTCTGTGTTATTTTCCTCATATTATAGAAATAGGGACATTTCATCTCTTTTCTGAACTTAGAATAGCATAAAATTACTTCCAATTTGTGAATAAAATGACAAAATAGCATCAATTTTTGTATACATCAGTGTTTTTGTCCCATGTACTCTGCGTTGATACCACTGCTT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~7~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RG:Z:0b940106 ac:B:i,40,2,42,0 ec:f:41.9829 ma:i:0 np:i:41 rq:f:0.999723 sn:B:f,11.8056,17.5088,3.94517,7.26763 we:i:10796555 ws:i:0 zm:i:131277 MM:Z:C+m,8,7,0,0,1,3,29,0,0,2,5,3,3,0,0,5,41,0,3,10,0,0,1,4,5,39,10,31,22,0,0,1,4,3,0,0,0,3,0,6,1,6,0,1,1,2,8,0,7,2,0,0,6,0,0,0,4,14,0,3,0,1,0,0,3,3,6,0,0,0,1,1,4,21,11,30,0,0,0,5,42,3,0,0,0,0,1,4,17,10,6,5,11,0,9,16,0,0,1,4,11,0,14,29,0,0,2,2,11,84,8,2,0,0,3,18,8,15,0,0,1,0,0,0,5,35,35,0,0,5,56,0,2,0,0; ML:B:C,0,0,1,51,25,3,2,13,29,22,12,162,5,1,20,2,0,0,7,0,0,114,2,28,102,180,8,232,19,1,19,15,1,0,0,0,0,201,197,11,1,11,34,11,8,46,1,0,19,22,28,118,3,2,179,12,3,51,15,0,3,1,0,59,147,2,4,2,1,20,1,11,1,154,75,2,2,36,119,1,3,9,1,4,2,32,3,1,14,229,49,62,0,8,1,1,1,22,80,1,12,6,141,1,13,51,0,12,7,2,1,217,13,105,12,82,102,55,14,9,1,1,6,216,0,173,1,1,41,22,0,0,0,3,24 YS:f:-7191.4 RC:i:1 XQ:Z:30/32,46/46,0/0,0/0,56/60,0/0,58/60,30/32,44/46,0/0,0/0,58/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,44/46,0/0,0/0,56/60,0/0,58/60,30/32,44/46,0/0,0/0,60/60,0/0,58/60,30/32,44/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,44/46,0/0,0/0,60/60,0/0,58/60,30/32,44/46,0/0,0/0,54/60,0/0,58/60,30/32,44/46,0/0,0/0,60/60,0/0,58/60,32/32 YQ:Z:0.9700 YN:Z:mas_15_sc_10x3p_single_none YV:i:1 YG:i:16 YK:Z:A XN:Z:m64381e_220728_012507/131277/ccs/3132_3731/H-3p_Adapter SG:Z:H:0-15,5p_Adapter:16-38,CBC:39-54,UMI:55-66,Poly_T:67-97,cDNA:98-569,3p_Adapter:570-599 CR:Z:AGCCAATCAGATACCT XC:Z:AGCCAATCAGATACCT XB:i:39 XF:i:100 ZU:Z:TCATTTAGCACA XM:Z:TCATTTAGCACA XU:i:55 it:Z:AGCCAATCAGATACCT,TCATTTAGCACA ic:i:1 im:Z:m64381e_220728_012507/131277/ccs is:i:1 XA:Z:XC-XM ZS:i:1

@jonn-smith
Copy link
Collaborator

@ashokpatowary Yes - those are low-confidence barcode and UMI tags. You will need to do post-processing to refine them. We have done a few things in the past that have worked well to do this refinement, but there are a lot of options out there.

As far as the tags go, Further down in the model you'll see a section for annotation_segments. This is what causes those tags to be written out. If you don't want to annotate those tags, you can set annoatation_segments to None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants