Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mclapply can't handle many samples; switch order it's applied in gene_summary? #80

Closed
warrenmcg opened this issue Jul 29, 2016 · 2 comments

Comments

@warrenmcg
Copy link
Collaborator

Hello sleuth team,

I am attempting to re-analyze another group's data, which has an unbalanced set of 21 samples split between two groups. The attempts to run gene aggregation fail with the following traceback:

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) :
  long vectors not supported yet: fork.c:376
Calls: make_sleuth_object ... <Anonymous> -> <Anonymous> -> lapply -> FUN -> sendMaster
No traceback available
Error: is(kal, "kallisto") is not TRUE
In addition: Warning message:
In parallel::mclapply(seq_along(obj_mod$kal), function(i) { :
  scheduled core 1 encountered error in user code, all values of the job will be affected
No traceback available
summarizing results
Error in is(obj, "sleuth") : object 'gene.so' not found
Calls: summarize_sleuth_results -> sleuth_results_modified -> stopifnot -> is
11: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
        ch), call. = FALSE, domain = NA)
10: stopifnot(is(kal, "kallisto"))
9: summarize_bootstrap(obj$kal[[i]], col, transform)
8: mutate_(.data, .dots = lazyeval::lazy_dots(...))
7: dplyr::mutate(summarize_bootstrap(obj$kal[[i]], col, transform),
       sample = cur_samp)
6: FUN(X[[i]], ...)
5: lapply(seq_along(obj$kal), function(i) {
       cur_samp <- obj$sample_to_covariates$sample[i]
       dplyr::mutate(summarize_bootstrap(obj$kal[[i]], col, transform),
           sample = cur_samp)
   })
4: sleuth_summarize_bootstrap_col(obj_mod, "scaled_reads_per_base",
       transform)
3: sleuth:::gene_summary(ret, aggregation_column, function(x) log2(x +
       0.5))
2: sleuth_prep(sample_to_covariates, full_model, target_mapping = target_mapping,
       norm_fun_counts = norm_function, norm_fun_tpm = norm_function,
       aggregation_column = aggregate_column)

I'm not that familiar with the innards of mclapply, but my understanding is that the job is split among several child processes, and one process pieces all of the data back together to send to the master process using sendMaster. However, to do this, it serializes everything into a raw vector. Because R is 32-bit, only objects <2 GB are able to be serialized without compression. Because I have so many samples, however, I suspect that the final aggregation is too big, causing the error seen above: long vectors not supported yet: fork.c:376. See a discussion here about this issue.

I've modified the code in gene_summary to switch the order of when mclapply is applied (instead of applying it on obj_mod$kal, apply it on each kal set of bootstraps), and this issue went away. I think this solution can scale. If you're interested, I'll send you a pull request with the modified code (after doing the suggested steps in your guidelines to contributing).

@warrenmcg
Copy link
Collaborator Author

warrenmcg commented Jul 29, 2016

Here is a blog post from r-bloggers discussing ways to reduce memory footprint for mclapply: link. It seems strategy number 2 there is something to consider as well, where we would put the bootstraps sent to mclapply in their own environment, to minimize copying.

@warrenmcg
Copy link
Collaborator Author

Because of the changes to the code on the devel branch, I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant