-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to generate and set seeds with a full 64 bits of entropy #10
Conversation
Remove debugging print statement when setting the seed.
I'm happy to throw in some tests as well, if you think that the overall idea is worthwhile. |
Thanks for the contribution. I like the idea to give users the possibility to provide more than 32 bits of randomness. However, I am unsure why you are using a raw vector and why |
Looks good in principle. I would prefer a few changes:
//' ....
//' @export
// [[Rcpp::export(rng = true)]]
Rcpp::List generateSeedVectors(int nseeds, int nwords = 2) {
Rcpp::List output(nseeds);
for (int i=0; i<nseeds; ++i) {
Rcpp::IntegerVector current(nwords, R_random_int);
output[i]=current;
}
return output;
} I think that it is possible to simplify |
Remove special dqset.seed() code for integer scalar inputs.
Point 1: Done. Point 2: Yep, I also noticed that we could replace the old Point 3: Done. Fascinating - I wasn't aware that it was possible to generate an R function fully from within C++ code! That's pretty cool. I also hardened Should some of the new header code be placed in the |
Excellent! But please don't set the version to 0.1.0 yet. For now 0.0.5.1 to indicate "development version" would be more appropriate. Concerning using the |
BTW, running the tests locally I get the following warnings:
Can these be avoided? |
Regarding the shift warnings: GCC is being a bit silly here, because it's emitting warnings about code that cannot possibly be reached (and this lack of reachability is known at compile time). Just above the shift statements are I don't think this can be avoided in a general sense, if you don't necessarily know whether sum <<= SHIFT - 1;
sum <<= 1; ... but this is pretty gross. I couldn't shut it up in the The number of warnings can be reduced by removing some of the shifts that are extraneous to the calculation of the seed. However, this would reduce compile and run-time protection against silly inputs. |
P.S. Feel free to license the new files as you see fit. |
Thanks a lot! |
Thanks @rstub for the merge. What timeframe do you have in mind for pushing to CRAN? I imagine that you'll want some time for real-world testing; I will do some high-level tests on my end as well. |
Good question. Unfortunately the time I could spend on dqrng went into chasing some weird error messages. Anyway, my plan is to also fix #7 with the next upload. And since MT from boost cannot be used as a replacement for C++11's MT, this means that MT has to go (it's slow and has some statistical issues, anyway). Since MT is the default, this is another disruptive change. However, I have not decided which one should be the new default. |
Okay. Perhaps this is related to RcppCore/Rcpp#832? In any case, reading your post to Rcpp-devel suggests that it's not a problem with the dqrng code itself, which is comforting. As for the loss of MT, I don't think it's that bad. We've already changed how the seed is defined for single-integer inputs, so people are going to see a different stream of random numbers in the next dqrng release anyway. Casual users shouldn't be able to see the difference between "different stream due to different seed" and "different stream due to different seed and different PRNG". And advanced users (if there are any, aside from us) probably would have chosen a different PRNG anyway, given the issues you've mentioned. |
I think that issue is different, and it is already visible on the CRAN checks page. I am going to mention that in the next Concerning MT: Yes, it makes sense to put these disruptive changes into one release. |
FYI, some testing of library(dqrng)
is.min <- is.zero <- is.max <- total <- 0
for (x in seq_len(200000)) {
if (x%%1000==0) message("Iteration ", x)
spawned <- generateSeedVectors(1000, nwords=1000)
for (j in seq_along(spawned)) {
current <- spawned[[j]]
lowest <- is.na(current)
is.min <- is.min + sum(lowest)
is.zero <- is.zero + sum(!lowest & current==0L)
is.max <- is.max + sum(!lowest & current==.Machine$integer.max)
total <- total + length(current)
}
}
is.min # gives me 40
is.max # gives me 45
is.zero # gives me 43
total/2^32 # theoretical expectation, 46.56613
pbinom(is.min, total, 2^-32) # 0.1882622
pbinom(is.max, total, 2^-32) # 0.4473918
pbinom(is.zero, total, 2^-32) # 0.3337348 Behaves well at the critical points; looks pretty good to me. |
@LTLA dqrng v0.1.0 is "on its way to CRAN". |
As discussed in the comments in #8, this PR provides a method for generating seeds with 64 bits of entropy in R and passing them into C++ to seed the various PRNGs offered by dqrng. This aims to avoid statistical biases from seeding from a limited part of the state space.
The PR contains three major components:
generateRawSeeds()
, an R function that generates seeds with any specified number of bits of randomness. This is done using raw vectors where each vector represents a seed and each entry contains up to 8 bits (in most significant order).convert_seed.h
, a header file that contains a C++ template function to convert a raw vector into a unsigned integer of any (appropriate) size. The idea is for other R packages that use dqrng's PRNGs in their C++ code to be able to use the seeds generated bygenerateRawSeeds()
and seed those PRNGs from the full state space.dqset.seed()
to allow R-only users to also use a raw vector to seed dqrng's PRNGs. This simply callsconvert_seed.h
to createuint64_t
s from a raw vector.I believe that these are quite modular changes - people don't have to use them if they don't want, but they can (in principle) improve the statistical performance of the PRNGs for more discerning users.