Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

Merged
merged 69 commits into from
May 18, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
8fba175
wip gene_summary #vc
pimentel Jan 21, 2016
cc678f5
Updated documentation
map222 Feb 25, 2016
734f992
Further documentation
map222 Feb 26, 2016
3cde4f6
Updated sleuth_gene_table
map222 Feb 27, 2016
17d1579
modifications to suggestions from @map222 (major thanks!) #vc
pimentel Mar 2, 2016
a7e64fc
Merge branch 'map222-master' (pull request Updated documentation #64)
pimentel Mar 2, 2016
0ebf3c5
gene aggregation #vc for most part
pimentel Jan 23, 2016
07ca1d9
fix gene mode and pca bug
pimentel Jun 16, 2016
a5c91b1
make interface nicer and update vignette
pimentel Jun 17, 2016
659e36f
update version
pimentel Jun 20, 2016
bd811f4
fix to plot_pc_variance
pimentel Jun 20, 2016
58913fc
update vignette
pimentel Jun 20, 2016
2b86978
merge in conflicts
pimentel Jun 20, 2016
ed5ad30
Merge pull request #74 from pachterlab/hjp/geneagg
pimentel Jun 20, 2016
9d4a67a
update vignette
pimentel Jun 21, 2016
048f055
Merge pull request #75 from pachterlab/hjp/geneagg
pimentel Jun 21, 2016
846673e
switch to column wise reading of bootstraps #vc
pimentel Mar 7, 2017
b0d4731
some cleanup and massive speedup using matrixStats #vc
pimentel Mar 8, 2017
932fc47
modified mclapply usage in gene_summary to reduce memory footprint; a…
warrenmcg Mar 13, 2017
90b5b99
corrected default value for num_cores, and strengthened check conditi…
warrenmcg Mar 13, 2017
161d231
fixed 'give a design matrix' test, which failed because the design ma…
warrenmcg Mar 13, 2017
66cea3a
now lintr clean
warrenmcg Mar 13, 2017
c321f90
clean a few lints that I missed
warrenmcg Mar 13, 2017
59ff84c
added code from pull request #71, which drops unused levels of s2c
warrenmcg Mar 13, 2017
62cfab4
add in code from pull request #92, which adds in 'drop = F' option in…
warrenmcg Mar 13, 2017
d6d1d2c
change 'give a design matrix' test to match changes on 'devel' branch
warrenmcg Mar 14, 2017
780b8dc
add code to 'sleuth_results' to add in gene annotations, to address #86
warrenmcg Mar 15, 2017
d77401a
Merge branch 'issue80_81'
warrenmcg Mar 15, 2017
bbd8859
Merge branch 'issue86'
warrenmcg Mar 15, 2017
a5f10e3
add grave accent ` to sample expressions for plot_scatter so any arbi…
warrenmcg Mar 16, 2017
36d9e2d
initialize the bootstrap matrix and other variables for gene summary
warrenmcg Mar 21, 2017
28af2a1
add code to build an equivalent gene-level matrix within the bootstra…
warrenmcg Mar 21, 2017
06dc471
add code to produce gene-level bs_quant_tpm matrix
warrenmcg Mar 21, 2017
ca13ad3
move bs_quant_tpm code to before the gene aggregation bs_mat code to …
warrenmcg Mar 22, 2017
01fbafc
switch to data.table dcast for significant speed boost of bs_quant_tp…
warrenmcg Mar 22, 2017
75f4b22
add in conditionals for code affected by choice of transcript vs gene…
warrenmcg Mar 21, 2017
2a3b7ad
move gene_mode and aggregation_column to initial object definition
warrenmcg Mar 22, 2017
1b28ea1
add documentation to bs_quant_tpm and other bootstrap_summary gene-le…
warrenmcg Mar 22, 2017
43683da
updated documentation, and fixed previous typos that got reverted in …
warrenmcg Mar 23, 2017
b5f1ae9
forgot gene_mode argument
warrenmcg Mar 23, 2017
b2e3b30
update plot functions to accept to scaled_reads_per_base and update d…
warrenmcg Mar 23, 2017
872aeb1
update documentation, including for models function
warrenmcg Mar 23, 2017
9cf346d
Merge branch 'issue89'
warrenmcg Mar 23, 2017
682f972
merge 'merge' branch that had merged the gene_aggregation and memory …
warrenmcg Mar 23, 2017
08a409b
add transform_function argument to allow users to customize the trans…
warrenmcg Mar 23, 2017
d875e25
add 'transform_synced' item to sleuth_model, in case a user changes t…
warrenmcg Mar 23, 2017
b5762a5
correct if conditional typo
warrenmcg Mar 24, 2017
bb57ce8
extend primitive '$<-' and add a replacement function for transform_fxn
warrenmcg Mar 23, 2017
bf82d58
Merge branch 'issue89' after typo correction
warrenmcg Mar 24, 2017
60a4c49
update documentation for transform argument and functions; also fix misc
warrenmcg Mar 24, 2017
e64abe5
fix bug where bootstrap summary overwrites tpm summary
warrenmcg Mar 24, 2017
4d077d1
Merge branch 'merge' after fixing bug
warrenmcg Mar 24, 2017
7eec121
fix bugs that arose with transform function option
warrenmcg Mar 24, 2017
2fe6fa7
fixed attribute bug introduced by reusing obs_norm; created a new var…
warrenmcg Apr 19, 2017
f16ff40
speed boost by converting gene aggregation code to use just data.tabl…
warrenmcg Apr 19, 2017
29453d0
fix bug that leads to incongruent tpm_norm_gene and obs_norm_gene tables
warrenmcg Apr 20, 2017
08d6410
improve memory usage by removing objects sooner, and switching to use…
warrenmcg Apr 20, 2017
ba3b374
fix bug where target_id names with non-synatically valid R column nam…
warrenmcg Apr 20, 2017
ffcb4a4
warn user when target_ids are missing annotations from aggregation_co…
warrenmcg Apr 20, 2017
325c267
fix bug introduced if aggregation column has NA values
warrenmcg Apr 20, 2017
db4d209
add gene mappings to Ellahi test data, and update README
warrenmcg Apr 20, 2017
c005aa4
add processed sleuth data for testing purposes
warrenmcg Apr 20, 2017
9763f7a
load preprocessed results for comparison during testing
warrenmcg Apr 20, 2017
482110f
add code to test gene-level aggregation
warrenmcg Apr 20, 2017
2e89826
add tests to make sure details of sleuth prep are unchanged
warrenmcg Apr 20, 2017
d53d14f
add tests for fit and Wald Test aspects of sleuth pipeline
warrenmcg Apr 20, 2017
6401590
move bootstrap summary code to separate function in bootstrap.R in pr…
warrenmcg Apr 21, 2017
a1e4765
modify code to parallelize the bootstrap summary step
warrenmcg Apr 22, 2017
08edf11
add back in num_cores option; correct typo
warrenmcg Apr 22, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
Package: sleuth
Title: Tools for investigating RNA-Seq
Version: 0.28.0
Version: 0.28.1
Authors@R: c(person("Harold", "Pimentel", , "[email protected]", role = c("aut", "cre")))
Description: Investigate transcript abundance from "kallisto" and differential expression analysis from RNA-Seq data.
Description: Investigate transcript abundance from "kallisto" and differential
expression analysis from RNA-Seq data.
License: GPL-3
LazyData: true
URL: https://github.com/pachterlab/sleuth
Expand All @@ -17,11 +18,14 @@ Imports:
tidyr,
reshape2,
rhdf5,
parallel,
lazyeval,
matrixStats,
shiny
Suggests:
MASS,
lintr,
testthat,
knitr
VignetteBuilder: knitr
RoxygenNote: 5.0.1
6 changes: 6 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Generated by roxygen2 (4.1.1): do not edit by hand

S3method("$<-",sleuth)
S3method(bias_table,kallisto)
S3method(bias_table,sleuth)
S3method(get_bootstraps,kallisto)
Expand All @@ -14,6 +15,7 @@ S3method(print,sleuth)
S3method(print,sleuth_model)
S3method(summary,sleuth)
S3method(tests,sleuth)
export("transform_fxn<-")
export(basic_filter)
export(bias_table)
export(bs_sigma_summary)
Expand All @@ -25,6 +27,7 @@ export(get_bootstrap_summary)
export(get_bootstraps)
export(get_quantile)
export(kallisto_table)
export(log_transform)
export(melt_bootstrap_sleuth)
export(models)
export(norm_factors)
Expand Down Expand Up @@ -63,6 +66,9 @@ export(sliding_window_grouping)
export(tests)
export(tpm_to_alpha)
export(transcripts_from_gene)
export(transform_status)
export(transform_status.sleuth)
export(transform_status.sleuth_model)
import(dplyr)
importFrom(data.table,fread)
importFrom(lazyeval,interp)
Expand Down
142 changes: 138 additions & 4 deletions R/bootstrap.R
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ aggregate_bootstrap <- function(kal, mapping, split_by = "gene_id",

if ( any(!complete.cases(mapping)) ) {
warning("Found some NAs in mapping. Removing them.")
mapping <- mapping[complete.cases(mapping),]
mapping <- mapping[complete.cases(mapping), ]
}

m_bs <- melt_bootstrap(kal, column)
Expand Down Expand Up @@ -316,8 +316,8 @@ sample_bootstrap <- function(obj, n_samples = 100L) {
# matrix sample
for (s in 1:n_samples) {
for (idx in 1:nrow(which_samp)) {
b <- which_samp[idx,s]
sample_mat[[s]][,idx] <- obj$kal[[idx]]$bootstrap[[b]]$est_counts
b <- which_samp[idx, s]
sample_mat[[s]][, idx] <- obj$kal[[idx]]$bootstrap[[b]]$est_counts
}
}

Expand Down Expand Up @@ -359,9 +359,143 @@ dcast_bootstrap.kallisto <- function(obj, units, nsamples = NULL) {
mat <- matrix(NA_real_, nrow = n_features, ncol = length(which_bs))

for (j in seq_along(which_bs)) {
mat[ ,j] <- obj[[ "bootstrap" ]][[which_bs[j]]][[ units ]]
mat[, j] <- obj[[ "bootstrap" ]][[which_bs[j]]][[ units ]]
}
rownames(mat) <- obj[["bootstrap"]][[1]][["target_id"]]

mat
}

# Function to process bootstraps for parallelization
process_bootstrap <- function(i, samp_name, kal_path,
num_transcripts, est_count_sf,
read_bootstrap_tpm, gene_mode,
extra_bootstrap_summary,
target_id, mappings, which_ids,
aggregation_column, transform_fxn)
{
dot(i)
bs_quants <- list()

num_bootstrap <- as.integer(rhdf5::h5read(kal_path$path,
"aux/num_bootstrap"))
if (num_bootstrap == 0) {
stop(paste0("File ", kal_path, " has no bootstraps.",
"Please generate bootstraps using \"kallisto quant -b\"."))
}

# TODO: only perform operations on filtered transcripts
eff_len <- rhdf5::h5read(kal_path$path, "aux/eff_lengths")
bs_mat <- read_bootstrap_mat(fname = kal_path$path,
num_bootstraps = num_bootstrap,
num_transcripts = num_transcripts,
est_count_sf = est_count_sf)

if (read_bootstrap_tpm) {
bs_quant_tpm <- aperm(apply(bs_mat, 1, counts_to_tpm,
eff_len))

# gene level code is analogous here to below code
if (gene_mode) {
colnames(bs_quant_tpm) <- target_id
# Make bootstrap_num an explicit column; each is treated as a "sample"
bs_tpm_df <- data.frame(bootstrap_num = c(1:num_bootstrap),
bs_quant_tpm, check.names = F)
rm(bs_quant_tpm)
# Make long tidy table; this step is much faster
# using data.table melt rather than tidyr gather
tidy_tpm <- data.table::melt(bs_tpm_df, id.vars = "bootstrap_num",
variable.name = "target_id",
value.name = "tpm")
tidy_tpm <- data.table::as.data.table(tidy_tpm)
rm(bs_tpm_df)
tidy_tpm$target_id <- as.character(tidy_tpm$target_id)
tidy_tpm <- merge(tidy_tpm, mappings, by = "target_id",
all.x = T)
# Data.table dcast uses non-standard evaluation
# So quote the full casting formula to make sure
# "aggregation_column" is interpreted as a variable
# see: http://stackoverflow.com/a/31295592
quant_tpm_formula <- paste("bootstrap_num ~",
aggregation_column)
bs_quant_tpm <- data.table::dcast(tidy_tpm,
quant_tpm_formula, value.var = "tpm",
fun.aggregate = sum)
bs_quant_tpm <- as.matrix(bs_quant_tpm[, -1])
rm(tidy_tpm) # these tables are very large
}
bs_quant_tpm <- aperm(apply(bs_quant_tpm, 2,
quantile))
colnames(bs_quant_tpm) <- c("min", "lower", "mid",
"upper", "max")
ret$bs_quants[[samp_name]]$tpm <- bs_quant_tpm
}

if (gene_mode) {
# I can combine target_id and eff_len
# I assume the order is the same, since it's read from the same kallisto
# file and each kallisto file has the same order
eff_len_df <- data.frame(target_id, eff_len,
stringsAsFactors = F)
# make bootstrap number an explicit column to facilitate melting
bs_df <- data.frame(bootstrap_num = c(1:num_bootstrap),
bs_mat, check.names = F)
rm(bs_mat)
# data.table melt function is much faster than tidyr's gather function
# output is a long table with each bootstrap's value for each target_id
tidy_bs <- data.table::melt(bs_df, id.vars = "bootstrap_num",
variable.name = "target_id",
value.name = "est_counts")
rm(bs_df)
# not sure why, but the melt function always returns a factor,
# even when setting variable.factor = F, so I coerce target_id
tidy_bs$target_id <- as.character(tidy_bs$target_id)
# combine the long tidy table with eff_len and aggregation mappings
# note that bootstrap number is treated as "sample" here
# for backwards compatibility
tidy_bs <- dplyr::select(tidy_bs, target_id,
est_counts, sample = bootstrap_num)
tidy_bs <- merge(data.table::as.data.table(tidy_bs),
Copy link
Collaborator

@pimentel pimentel May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all explicit by using scope operator.

done

data.table::as.data.table(eff_len_df), by = "target_id",
all.x = T)
tidy_bs <- merge(tidy_bs, mappings, by = "target_id",
all.x = T)
# create the median effective length scaling factor for each gene
scale_factor <- tidy_bs[, scale_factor := median(eff_len),
by = eval(parse(text=aggregation_column))]
# use the old reads_per_base_transform method to get gene scaled counts
scaled_bs <- reads_per_base_transform(tidy_bs,
scale_factor$scale_factor,
aggregation_column,
mappings)
# this step undoes the tidying to get back a matrix format
# target_ids here are now the aggregation column ids
bs_mat <- data.table::dcast(scaled_bs, sample ~ target_id,
value.var = "scaled_reads_per_base")
# this now has the same format as the transcript matrix
# but it uses gene ids
bs_mat <- as.matrix(bs_mat[, -1])
rm(tidy_bs, scaled_bs)
}

if (extra_bootstrap_summary) {
bs_quant_est_counts <- aperm(apply(bs_mat, 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look into whether or not there is a faster matrixStats function. I think I tested the 1 that exists there and this is actually faster. will check soon

quantile))
colnames(bs_quant_est_counts) <- c("min", "lower",
"mid", "upper", "max")
ret$bs_quants[[samp_name]]$est_counts <- bs_quant_est_counts
}

bs_mat <- transform_fxn(bs_mat)
Copy link
Collaborator

@pimentel pimentel May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to transform_fun for consistency

done

# If bs_mat was made at gene-level, already has column names
# If at transcript-level, need to add target_ids
if(!gene_mode) {
colnames(bs_mat) <- target_id
}
# all_sample_bootstrap[, i] bootstrap point estimate of the inferential
# variability in sample i
# NOTE: we are only keeping the ones that pass the filter
bootstrap_result <- matrixStats::colVars(bs_mat[, which_ids])

list(index = i, bs_quants = bs_quants, bootstrap_result = bootstrap_result)
}
14 changes: 14 additions & 0 deletions R/gene_analysis.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
propagate_transcript_filter <- function(filter_df, target_mapping,
grouping_column) {

filtered_target_mapping <- dplyr::inner_join(as.data.table(filter_df), # nolint
as.data.table(target_mapping), by = 'target_id') # nolint

filtered_target_mapping <- dplyr::select_(filtered_target_mapping,
grouping_column)

data.table::setnames(filtered_target_mapping, grouping_column, 'target_id')
filtered_target_mapping <- dplyr::distinct(filtered_target_mapping)

filtered_target_mapping
}
10 changes: 10 additions & 0 deletions R/likelihood.R
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ sleuth_lrt <- function(obj, null_model, alt_model) {
model_exists(obj, null_model)
model_exists(obj, alt_model)

if(!obj$fits[[alt_model]]$transform_synced) {
stop("Model '", alt_model, "' was not computed using the sleuth object's",
" current transform function. Please rerun sleuth_fit for this model.")
}

if(!obj$fits[[null_model]]$transform_synced) {
stop("Model '", null_model, "' was not computed using the sleuth object's",
" current transform function. Please rerun sleuth_fit for this model.")
}

if ( !likelihood_exists(obj, null_model) ) {
obj <- compute_likelihood(obj, null_model)
}
Expand Down
5 changes: 5 additions & 0 deletions R/matrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@
#' ("obs_norm" or "obs_raw")
#' @param which_units character vector of length one. Which units to use ("tpm"
#' or "est_counts")
#' @return a \code{list} with an attribute 'data', which contains a matrix of target_ids
#' and transcript expression in \code{which_units}
#' @examples
#' sleuth_matrix <- sleuth_to_matrix(sleuth_obj, 'obs_norm', 'tpm')
#' head(sleuth_matrix$data) # look at first 5 transcripts, sorted by name
#' @export
sleuth_to_matrix <- function(obj, which_df, which_units) {
if ( !(which_df %in% c("obs_norm", "obs_raw")) ) {
Expand Down
Loading