-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding barcodes with zero reads after deduplication #55
Comments
Hi @wmacnair, Thanks for the detailed question (and for including the nice graphics!). I will also cc @csoneson on my reply for her — always invaluable — take. Regarding your first question; yes. There is often a desire to perform a filtering based on the final UMI counts of each cell. Note, that the UMI counts (after deduplication) can differ, sometimes dramatically, from the read counts before deduplication. Once you have loaded the expression matrix from Regarding your second question, it's a bit less clear what's going on with the number of detected genes. One question I have is, what count values are you using for gene detection? Is it possible that some of the cells have high unspliced gene content that is being discarded if using the default Best, |
Hi @wmacnair 🙂 I'm a bit surprised by the detected gene knee plot. I have seen other datasets where there were cells with basically no detected genes, but there were usually maybe 2-10 such cells, not hundreds as it looks like in your case 🤔 . In those data sets, I was running As for what's being shown in the plots, |
Hi both, thanks for your detailed responses. I need to read them carefully, so it will take me a little while to respond to the tricky bits. On the easy bit, that makes sense regarding the knee plots. I think I would then suggest making this more explicit in the documentation. At present, the help page for I will have a look at which genes are present in the barcodes that collapse to zero. I didn't make any choices around which values are used for gene detection - these are just For some additional context, here are some notes on what I've done:
I'm a complete noob when it comes to mapping, so there are multiple things I'm unsure about:
Hope this helps give some clues. I'll dig around in the genes for the collapsing barcodes and let you know what I find. Will part of my bsub job:
|
A bit of further analysis, looking at all of our samples. It seems that we consistently get ~5% of permit-list barcodes ending up with 0 genes (ignoring two samples with v small permit lists).
I picked the sample with the largest number of zero-gene barcodes (EG004_P), and calculated medians for various measures recorded in Is the higher mapping rate in zero-gene barcodes surprising? This is a chunk higher in the barcodes with zero genes.
What would be the easiest way to see where these highly duplicated barcodes map to in the genome? Thanks! |
Hi @wmacnair, Are any of these samples sharable? There are a few ways one might check what's going on. One would be to take the collated RAD file, and then use the Best, |
If you happen to have a bam file of reads mapped to the genome, I think you could extract the reads corresponding to a given cell barcode using e.g. https://github.com/10XGenomics/subset-bam. Regarding the higher mapping rate I don't know - if the reads are all coming from the same place maybe that does make sense (either they all map, or none of them maps). I'm a bit puzzled by the fact that the deduplicated UMI count is exactly zero (do I read the table correctly?) - in my examples there were few, but not zero, entries retained. 🤔 |
Is it possible that in the QC aggregation that, somehow, unmapped UMIs are being considered as mapped? |
The |
One of @wmacnair's points that I've overlooked is the reference itself. It may indeed be worth considering using something like the 10X reference annotation. A large number of pseudogenes could result in a larger degree of multi-mapping, which, under the standard |
Hi both, thanks again for taking the time! Some answers / responses, slightly reordered to help me make sense of new information:
Sadly not, I think - they're human samples from a clinical study that's only just started.
Yes, these are exact zeros.
I'm not promising success, but I will have a go at this...!
Thanks for these points. Here's my go at synthesising them into a possible explanation, with some questions / comments:
Things I can do to provide more information:
Let me know if there is anything else useful that I can look at. And also if you have any comments on the possible explanation. Best |
I guess I would be a bit surprised if it was because of using the primary file rather than the basic gencode one - it adds 54 genes/59 transcripts (to the 61544/246624 in the |
Ok, thanks that's helpful for understanding the differences. I've just been discussing this with a colleague and we're wondering if it could be the choice of gene biotype that is the issue, specifically the inclusion of pseudogenes. A possible explanation is:
Does this seem plausible? And in general, what is your advice on including / excluding pseudogenes? (Either at the stage of making your reference transcriptome, or later.) |
I think that a situation like that could potentially explain what you see. @rob-p can provide more insight into the UMI deduplication, but I think that the gene/pseudogene pairs must also be very similar, so that there is a tie for the highest count gene for all the UMIs. Running with the |
Ah nice, something testable! I will try to rerun both with |
@rob-p, quick Q - could you point me towards the syntax for the |
Sure thing! So, first please make sure you grab the latest release (0.5.1) which fixed an options conflict between the top-level help and the flag for the
If you don't pass a file to
to get all of the records corresponding to a barcode of interest. |
Thanks for the
i.e. 627302 out of 629498 reads for these barcodes are derived from two genes. Any thoughts? I've had a look at those genes and I don't really get much from what I see... |
Ahhh are these huge counts multimappers? |
I'm not really sure what a multimapper is... I've had a look at those two transcripts, but couldn't really see any way to check if they were very similar. Is that what you meant, or was it something else? |
So each entry output by view is a mapping record. But the same read can have many mapping records if the read matches more than one place. This would show up in the output as the same read id having multiple mappings to different genes. I think we also explicitly print a tag for the number of hits for each read ( |
Ok, yes I think that is the case. These are outputs for the top 40 zero-gene barcodes, with the
|
Ok, so this suggests that most of these reads are mapping to > 1 distinct gene (and therefore being discarded). Also, most of them are mapping to (what looks like) the same subset of distinct genes. This doesn't quite explain why those counts don't show up in When a read maps to multiple locations, the rules for how it are resolved depend on what those locations are. If they are transcripts of the same gene (spliced) it will be assigned to that gene. If they are spliced and unspliced regions of the same gene, it will be assigned to that gene (as an ambiguous UMI). However, if the mapping is to transcripts of different genes, and (as @csoneson noted) neither gene has a larger number of reads displaying that UMI, then the UMI is not uniquely resolvable to a gene and will be discarded. The top two rows, at least, seem to fall into this category as there are two distinct lncRNA to which the reads map equally well. So overall, one observation here is that, for these cells, it looks like the vast majority of mappings come from only a handful of genes (perhaps as few as 2) where almost all of the reads map to more than one gene. |
Ok, so I think I'm happy with how the mapping works, and why we end up excluding those reads. The thing that's confusing me is why we end up with some barcodes mapping just to those two genes... I've just checked, and it's not just most of the reads, for almost all of the top 40 barcodes it's almost exclusively those two genes:
What could be going on here, so that for a small but non-zero proportion of droplets they only map to these two v similar genes? [just saw you had also noticed this in your edited comment!] |
Hi @wmacnair, Thanks for all of the useful diagnostic info. Regarding:
The only way to really dig down into that is to try to look at the sequencing reads themselves that are inducing those mappings. Since the RAD file encodes where the reads maps but, for purposes of keeping the files compact and fast to parse, doesn't write down the reads themselves (like a SAM file). If you really want to track this down, you could certainly go back to the original FASTQ files, grep for the barcode of interest in the technical read to extract the read names, and then pull those read names out of the biological read file. Then you would have the actual sequence reads that are inducing these mappings. |
I just remembered that I still had this issue open, apologies...! I've reread through the thread, and I remain confused about what was going on in my samples. I have moved on by (a) using the default 10x reference, rather than rolling my own from the newest Gencode data, and (b) using It doesn't seem like much was gained from the discussion, apart from teaching me some basics of mapping 😅 However, reading through again it was clear to me that it would be great to have a small, purpose-built standalone R package for loading At the moment you suggest using For example, I would love to see a package, e.g.
I'm aware that you must have a very long list of things to do, and you've probably already considered this as something you may do. So just consider this as an additional point of support for if and when you get it :) Thanks! |
Hi @wmacnair, Thanks for the thoughtful suggestions! Regarding Regarding your other suggestion, I'd like to pull @mikelove and @DongzeHE into this discussion. That is, I know that Mike likes to think of Best, |
We have some stuff in Recently we took out non-R code to move to a separate package for ease of installation. EDS package on Bioc (Avi’s) now has the original alevin data storage format which is read in with C++. |
Hi
I've been reading and working through various tutorials and bits of documentation, and have found the
alevinQC
package very helpful for understanding what is going on with thegenerate-permit-list
step.I'm a bit confused about the mapping step after the permit list step. It feels like there's a missing QC step here. I'm describing my understanding of the process, both to check I've got it right and in case it's helpful to someone else.
In the (very nice)
alevinQC
report, we get something like this for the white list, telling us which of the barcodes are most likely "real cells" rather than empty droplets:These reads are then mapped, and some of them end up not mapping to any genes at all. This gives us the actual counts for each white-listed barcode, and looks like this:
I have a couple of questions:
Thanks!
Will
The text was updated successfully, but these errors were encountered: