-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter settings for ECCITE-seq like approaches #531
Comments
Hi @rfarouni , Thanks a lot for raising the issue. |
Oh one more thing, is it possible to share a few hundred reads for your experiment, just for some unit testing on my side? |
d6abefd this should fix it . |
Seems to work on some of the test at my end, let me know if its still a problem. The command to use would be
One thing to note, since it's a 5' protocol, you might have to change |
Thanks @k3yavi !If you can forward me a Linux portable binary that would be great. Whenever I try to compile something on my computer, I fail half of the time . I have Ubuntu 18.04. I will ask permission to share with you part of the data and get back to you. Also, does Alevin use 10x cell barcode whitelist internally to correct barcodes? And do you recommend using the |
Hi @rfarouni , Please use this link to download a linux compatible binary, the fix will be available by default with the next release .
In our experiments, we find that, in expectation, the 10x generated experiments are clean enough that we don't need the 10x whitelisted barcode to be explicitly specified or used.
That's a very good question. Basically the answer lies in how complicated the UMI graph network is. Experiment with the antibody derived barcodes (ADT) with 20 protein panel, generally, doesn't need the In general, I'd recommend if you expect very low diversity in the number of barcodes in your experiment, use In short, 64 guide sequences are relatively high diversity and I'd advise skipping Hope it helps ! |
Hi Avi, Thanks for the detailed reply. I was able to run it (see logs below), but I had to use I am not sure why the
I will be sending you some reads from the experiments for unit testing shortly. Thanks! Run 1:
Run 2:
|
Hi Rob, The joint distribution of the read and UMI counts can contain important information. The majority of observations (CB + guide combination) lie along a well defined experiment-specific mean trend whose slope is given by the coverage ( ratio of reads to UMIs). Also the same regularity can be observed when aggregating across the cell barcodes. See figure below. The points below the black horizontal line are cells with less than 100 reads At the guide level, it would look like this In general, I often find myself needing to work with read counts. For example, the read counts can be used to estimate the hopping rate and detect hopped reads in multiplexed scRNAseq data as we show in this recent paper https://www.nature.com/articles/s41467-020-16522-z Rick, |
Thanks for the detailed answer, Rick! I just saw that paper pop up yesterday and it was on my reading list :). Internally, we have access to the number of occurrences of each UMI, gene pair within each barcode, so I do not think it would be too difficult to to provide read counts (optionally) along with deduplicated counts (though @k3yavi would be best equipped to say how easy or difficult this would be from the implementation perspective). Best, |
You are welcome! |
Hi @rfarouni , Thanks for the detailed answer.
I'm not sure about this, it's possible if the guide sequences were already reverse-complemented then the above behavior would makes sense. I am a little less familiar with the guideRNA based ECCITE-seq data, although the mRNA library should be 5' and the sequence does come from forward strand but do we expect the guide RNA to be on the forward strand as well ? Unclear . I'll ask around at nygc and would let you know.
That's again a great question. In short single-cell world is expanding rapidly and alevin was initially designed to work with 10x 3' data and some of the restriction are outdated with combinatorial indexing based multiplexed experiments. To be honest, 100k was just a random high enough number that was put down to throw away the obvious junk data. Having said that, you would notice that in both the logs you attached a significant fraction of barcodes are thrown away i.e.,
Congratulations on the awesome paper :). We were actually discussing yesterday about your paper and potentially modifying alevin to include model for correcting index-hoping, although it's still in discussion phase. To answer your question, thanks for the feature request, I can add that feature on the weekend if it's urgent. However, you can also generate that with the current version using the Let me know your thoughts. |
Hi Avi, Yes I just asked and the guide sequences were reverse complemented. I was looking through the results and comparing it with the output of another alignment software. I noticed that there are substantially fewer UMI per guide (in cell) throughout ( see figures for comparison). Also, the number of UMIs per cell barcode is consistently lower and there is around 796 barcodes that are not found in the 10X whitelist, the majority of which tend to have 1 UMI count only. Here is tally, where the TRUE column indicates the barcode is found in the whitelist. The row names indicate the total number of UMIs It would be great if you can implement the index hopping correction in Alevin. The software we have works fine if the number of samples is not too large. If had known how to code in C++, I would have implemented part of the code more efficiently using Rcpp. Please let me know if you ever decide to add this feature to Salmon. I am more than happy to help. Rick, |
Hi @rfarouni , Is it possible to visualize the above two plots on the same scale ? I am guessing here your motivation is a bit different i.e. considering very low confidence (even with 1 UMI) barcodes, while generally we discard anything below 10 as noise. Thanks a lot for offering to help with index-hopping idea. I agree, it'd be great to include the model in the alevin framework. Currently I just got the gist of your paper, let us go through the paper in a bit more detail and we'll get back to you as soon as we have some free cycles for the integration. |
Oh wow 14k v 126k is indeed a big difference, is it possible to share the Alevin log for your run ? From the logs you attached it's not clear what's the mapping rate. May I also ask to look at another log file inside the logs folder, called salmon_quant.log. that would have more information regarding the mapping rate. |
Here you go
salmon_quant.log
|
Thanks ! Aha, so indeed the mapping rate is super low, that explains it. |
Lemme work with the reads you forwarded, is it possible to share the guide sequence as well ? Otherwise I won't be able to check the mapping rate. |
I wonder if the max 1-edit distance restriction is too stringent for 21 length barcodes. One important flag to play with is the
i.e. we use the equation For k=1, we had |
I will be trying your suggestion out. I might be able to share with you a toy dataset with a fewer number of guides. I will update you as soon as I get it. |
Thanks @rfarouni ! A small dataset with few thousand reads would be great to have, the one I currently had was too few to test things on. |
With --minScoreFraction 0.607 I get a way much better mapping rate. I wonder if there is way to determine the optimal value empirically?
|
But now there are a lot of barcodes that are not in the whitelist Also with the default setting of --freqThreshold, no CB correction gets done
|
Thanks @rfarouni for the updates.
Glad to hear that, may I ask what percent of the reads are mapping now ? It's not clear from the alevin logs you shared but I think the total number of deduplicated UMIs are similar to your baseline experiment. I think defining an optimal empirical threshold is a great idea but the issue is that 21 length barcodes are kind of in the middle i.e. a tad longer than the regular barcodes and somewhat smaller than a full read. The full read alignment process indeed allows more erroneous reads to map but 21 is a bit too short to work with. @rob-p might have more thoughts on this one.
Thanks again for checking this, it is indeed concerning. However, as I was mentioning earlier in a regular single-cell experiment we end up throwing away almost all of these very low frequency count cellular barcodes. I'd say even 45 reads CBs are most probably a noise and will be filtered away, because only a fraction of the reads will map and after deduplication it'll result in significantly low count in 1 cellular barcode.
I can check why is this happening, let me know once you have a toy dataset to play with. |
Mapping rate = 73.2315%. The row numbers indicate UMI counts. Why would a cell barcode with 45 UMIs be considered noise in this context? If we are dealing with expression data, then I can see why, but not when we have so few features. |
Yes, absolutely, above I meant in scRNA-seq context, my apologies if it was not clear. |
I see. I will try providing the whitelist and see what happens. Once I get my hands on the toy dataset, I will share it with you as well :) |
When I add the whitelist using
|
Oh man, too many sanity checks over the years, can you just remove one cellular barcode from the full list and try again? Basically, many people have confused this flag by providing the full 10x whitelist without knowing the consequences, that's why the warning. Here our use case is specific and it should not matter. |
@k3yavi , we should add like a —force-barcodes flag for this. |
Thanks! Worked with a Mapping rate = 73.4157%. See log below. However, I only get half the number of mapped reads per cell-feature. I still need to examine the existing alignment to understand why
|
By the way, does alevin try to correct barcodes that are not in the whitelist but are 1 edit distance from the barcodes in the whitelist? Rick, |
Yes, if we provide a whitelist externally then Alevin will try and correct barcodes not in the whitelist and are 1-edit distance from them. |
Hi,
I would like quantify guide-RNAs (based on 5'-tagged scRNAseq 10X feature barcoding) using Alevin. Read 1 is 26bps long (16 CB +10 UMI) and Read 2 is 58bps long (19 constant region + 21 guide sequence). Now, when I use the following settings
salmon alevin -l ISR --barcodeLength 16 --umiLength 10 --end 5 --featureStart 19 --featureLength 21
I get this error
However, when I use the following instead
salmon alevin -l ISR --citeseq --featureStart 19 --featureLength 21
It works but since
--citeseq
assumes--umiLength=12
, I get the following outputI also tried
salmon alevin -l ISR --chromium --featureStart 19 --featureLength 21 --tgMap guide_to_gene.tsv
But I get the following output
Any suggestions on how to get this working are highly appreciated!
Thanks
The text was updated successfully, but these errors were encountered: