-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99
Conversation
For sleuth_gene_table, sleuth_results, and sleuth_to_matrix
Updated sleuth_gene_table, sleuth_prep, sleuth_wt, sleuth_fit, and models.
More precise definition of num_transcripts
temporarily disable filtering at gene level #vc stash gene mode #vc work in progress #vc add functionality to filter by target id #vc update aggregation #vc comment out a section in aggregation #vc deal with intersection #vc first cut at empirical bayes #vc fix small bug in empirical bayes #vc testing out pmax #vc empirical bayes and propagate filter #vc norm by effective length massive cleanup more cleaning refactor out check target mapping remove filter by target_id update warnings in check_target_mapping more cleanup
Gene aggregation
update vignette
…dded sleuth_prep option to select number of cores for mclapply
…ons for num_cores to throw informative error
…trix did not have dim names that matched the formula or the sample ids
… option into 'spread_abundance_by' to prevent downstream error when preparing just one sample
…trary string can be used as a sample name; proposed solution for pachterlab#89
+ any change to transform_fxn leads to all existing fits have 'FALSE' transform_synced statuses + the user is warned that all existing fits need to be redone.
+ User is now warned to re-do both sleuth_prep and sleuth_fit, since changing transform_fxn also affects bootstrap summaries + Bug 1: transform function was applied twice; now only applied in sleuth_prep + Bug 2: checks for transform sync status in sleuth_lrt referred to wrong variable
…iable for obs_norm_gene and tpm_norm_gene
…e syntax rather than dplyr
… pared down target_mapping table
…es leads to downstream errors when doing gene aggregation
+ make sure blank entries are treated as NA values + fix behavior of overall list of gene IDs for gene-level filter_bool
+ this processed data uses the full target mappings + also includes TPM calculations for the bootstraps and the extra_bootstrap_summary + update Ellahi dataset README with info on the preprocessed results
…eparation for parallelization
Update: proposed parallelization code is finished. added back in the num_cores option from #94. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @warrenmcg
I'm really sorry this took so long. so far this looking great. I've done a bit of testing locally and so far no issues. I'm going through and making comments to myself (no need for more action on your part) and will be making some minor modifications in the coming days.
Thanks so much for your hard work on this!
# for backwards compatibility | ||
tidy_bs <- dplyr::select(tidy_bs, target_id, | ||
est_counts, sample = bootstrap_num) | ||
tidy_bs <- merge(data.table::as.data.table(tidy_bs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all explicit by using scope operator.
done
} | ||
|
||
if (extra_bootstrap_summary) { | ||
bs_quant_est_counts <- aperm(apply(bs_mat, 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look into whether or not there is a faster matrixStats function. I think I tested the 1 that exists there and this is actually faster. will check soon
ret$bs_quants[[samp_name]]$est_counts <- bs_quant_est_counts | ||
} | ||
|
||
bs_mat <- transform_fxn(bs_mat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to transform_fun
for consistency
done
Don't worry about it at all!! Glad to see you back in action! :) |
merged in -- thanks again for all the hard work! will be making it into the next version scheduled for sometime next week! |
Hi @pimentel and Sleuth team,
I went ahead and made changes to reconcile the gene aggregation feature currently available on the most recent stable release, and the overhaul you've already done to improve the memory footprint of Sleuth. I started from commit 8fba175 from the gene_agg branch and accepted all of the commits through to the current stable release (commit 048f055). I then added in the two commits from the gene_memory branch (through commit b0d4731). This was to help facilitate an automatic merge. The rest is my work.
Here is a summary of what has changed from that point onward:
reads_per_base_transform
function to use data.table throughout, which results in a significant speed boost. This is used to process the bootstrap matrix and the bootstrap TPM counts, and increased the speed several fold (20 seconds each for the matrix and TPM calculations in one of my own datasets, compared to several minutes each per sample using the old code). Also, because of the details of how I did this work, the TPM counts are done before the est. counts summary.sleuth_prep
to process the bootstrap data at the gene-level. This takes advantage of using data.table throughout for fast processing (total processing per sample is <1 min per sample with my own full datasets).sleuth_prep
to update the sleuth object features to gene level data, not justbs_summary
:filter_df
,filter_bool
,obs_norm
, andobs_norm_filt
now report gene-level information.obs_raw
remains untouched, so it still has the transcript-level data.transform_function
option to give people the option of using another transformation function (e.g. using log2 instead of ln; using a different offset instead of 0.5). I extended the primitive $<- function and added atransform_synced
item to thesleuth_model
object, to make sure that users can't update the transformation function after having runsleuth_prep
to protect reproducibility when sharing sleuth results.If this work is accepted, the
gene_summary
function will be obsolete.Other features that are not related:
The features from previous pull requests are included, specifically #71, #94, #95, and #96. If those are all good, you can just focus on this pull request. If they are not good, we can discuss and make changes accordingly.
Plan:
I plan to move the code within the for loop to its own function, and then set things up so that the bootstrap reading can be done in parallel, so we can add back in the num_cores option. On my machine, this would reduce what previously took almost three hours with the old code (current release) down to 1-2 minutes (new code with parallel options).
Let me know what you think!