mclapply can't handle many samples; switch order it's applied in gene_summary? #80

warrenmcg · 2016-07-29T20:58:41Z

Hello sleuth team,

I am attempting to re-analyze another group's data, which has an unbalanced set of 21 samples split between two groups. The attempts to run gene aggregation fail with the following traceback:

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) :
  long vectors not supported yet: fork.c:376
Calls: make_sleuth_object ... <Anonymous> -> <Anonymous> -> lapply -> FUN -> sendMaster
No traceback available
Error: is(kal, "kallisto") is not TRUE
In addition: Warning message:
In parallel::mclapply(seq_along(obj_mod$kal), function(i) { :
  scheduled core 1 encountered error in user code, all values of the job will be affected
No traceback available
summarizing results
Error in is(obj, "sleuth") : object 'gene.so' not found
Calls: summarize_sleuth_results -> sleuth_results_modified -> stopifnot -> is
11: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
        ch), call. = FALSE, domain = NA)
10: stopifnot(is(kal, "kallisto"))
9: summarize_bootstrap(obj$kal[[i]], col, transform)
8: mutate_(.data, .dots = lazyeval::lazy_dots(...))
7: dplyr::mutate(summarize_bootstrap(obj$kal[[i]], col, transform),
       sample = cur_samp)
6: FUN(X[[i]], ...)
5: lapply(seq_along(obj$kal), function(i) {
       cur_samp <- obj$sample_to_covariates$sample[i]
       dplyr::mutate(summarize_bootstrap(obj$kal[[i]], col, transform),
           sample = cur_samp)
   })
4: sleuth_summarize_bootstrap_col(obj_mod, "scaled_reads_per_base",
       transform)
3: sleuth:::gene_summary(ret, aggregation_column, function(x) log2(x +
       0.5))
2: sleuth_prep(sample_to_covariates, full_model, target_mapping = target_mapping,
       norm_fun_counts = norm_function, norm_fun_tpm = norm_function,
       aggregation_column = aggregate_column)

I'm not that familiar with the innards of mclapply, but my understanding is that the job is split among several child processes, and one process pieces all of the data back together to send to the master process using sendMaster. However, to do this, it serializes everything into a raw vector. Because R is 32-bit, only objects <2 GB are able to be serialized without compression. Because I have so many samples, however, I suspect that the final aggregation is too big, causing the error seen above: long vectors not supported yet: fork.c:376. See a discussion here about this issue.

I've modified the code in gene_summary to switch the order of when mclapply is applied (instead of applying it on obj_mod$kal, apply it on each kal set of bootstraps), and this issue went away. I think this solution can scale. If you're interested, I'll send you a pull request with the modified code (after doing the suggested steps in your guidelines to contributing).

The text was updated successfully, but these errors were encountered:

warrenmcg · 2016-07-29T21:07:24Z

Here is a blog post from r-bloggers discussing ways to reduce memory footprint for mclapply: link. It seems strategy number 2 there is something to consider as well, where we would put the bootstraps sent to mclapply in their own environment, to minimize copying.

warrenmcg · 2017-05-23T17:40:49Z

Because of the changes to the code on the devel branch, I'm closing this issue.

warrenmcg mentioned this issue Jul 29, 2016

mc.cores for mclapply in gene_summary defaults to 2; allow user input? #81

Closed

This was referenced Mar 13, 2017

∆ use of mclapply to reduce memory footprint; add "num_cores" option #93

Closed

Issues #80 and #81: reduce 'mclapply' memory footprint; add "num_cores" option #94

Closed

warrenmcg closed this as completed May 23, 2017

Preve92 mentioned this issue Aug 1, 2019

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) qiime2/q2-dada2#119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mclapply can't handle many samples; switch order it's applied in gene_summary? #80

mclapply can't handle many samples; switch order it's applied in gene_summary? #80

warrenmcg commented Jul 29, 2016

warrenmcg commented Jul 29, 2016 •

edited

Loading

warrenmcg commented May 23, 2017

mclapply can't handle many samples; switch order it's applied in gene_summary? #80

mclapply can't handle many samples; switch order it's applied in gene_summary? #80

Comments

warrenmcg commented Jul 29, 2016

warrenmcg commented Jul 29, 2016 • edited Loading

warrenmcg commented May 23, 2017

warrenmcg commented Jul 29, 2016 •

edited

Loading