Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to generate and set seeds with a full 64 bits of entropy #10

Merged
merged 33 commits into from
Feb 7, 2019

Conversation

LTLA
Copy link
Contributor

@LTLA LTLA commented Jan 31, 2019

As discussed in the comments in #8, this PR provides a method for generating seeds with 64 bits of entropy in R and passing them into C++ to seed the various PRNGs offered by dqrng. This aims to avoid statistical biases from seeding from a limited part of the state space.

The PR contains three major components:

  • generateRawSeeds(), an R function that generates seeds with any specified number of bits of randomness. This is done using raw vectors where each vector represents a seed and each entry contains up to 8 bits (in most significant order).
  • convert_seed.h, a header file that contains a C++ template function to convert a raw vector into a unsigned integer of any (appropriate) size. The idea is for other R packages that use dqrng's PRNGs in their C++ code to be able to use the seeds generated by generateRawSeeds() and seed those PRNGs from the full state space.
  • A modification to dqset.seed() to allow R-only users to also use a raw vector to seed dqrng's PRNGs. This simply calls convert_seed.h to create uint64_ts from a raw vector.

I believe that these are quite modular changes - people don't have to use them if they don't want, but they can (in principle) improve the statistical performance of the PRNGs for more discerning users.

@LTLA
Copy link
Contributor Author

LTLA commented Jan 31, 2019

I'm happy to throw in some tests as well, if you think that the overall idea is worthwhile.

@rstub
Copy link
Member

rstub commented Feb 1, 2019

Thanks for the contribution. I like the idea to give users the possibility to provide more than 32 bits of randomness. However, I am unsure why you are using a raw vector and why generateRawSeeds() is an R function. Can you please create an issue to discuss the design first? Thanks.

@rstub
Copy link
Member

rstub commented Feb 5, 2019

Looks good in principle. I would prefer a few changes:

  • Please add yourself to DESCRIPTION as contributor.
  • convert_seed can handle single element vectors as well, so I see no use for the current dqset_seed function. This is a breaking change, since the same user-provided seed will result in a different seed for the RNG, so the next version will probably be 0.1.0. BTW, why do you use SEXP instead of Rcpp::IntegerVector as argument for dqset_seed_vec?
  • generateSeedVectors could be implemented in C++ only and go into a separate C++ file. It might also make sense to split the convert_seed.h header in to two parts: One for the actual seed conversion and one for interfacing with R's RNG. It can also be simplified slightly. Note that I have also removed the bits to words conversion, since the number of bits per word is fixed anyway:
//' ....
//' @export
// [[Rcpp::export(rng = true)]]
Rcpp::List generateSeedVectors(int nseeds, int nwords = 2) {
  Rcpp::List output(nseeds);
  for (int i=0; i<nseeds; ++i) {
    Rcpp::IntegerVector current(nwords, R_random_int);
    output[i]=current;
  }
  return output;
}

I think that it is possible to simplify seed_conversion_internal slightly using make_unsigned, but I haven't tried it.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 5, 2019

Point 1: Done.

Point 2: Yep, I also noticed that we could replace the old dqset_seed entirely, but I didn't want to introduce a breaking change without consultation. I've done it now. I used SEXP purely out of habit (a long time ago, I used to program with the raw C API, rather than using Rcpp); I've now changed all of them to the more natural Rcpp::IntegerVector. I also took the liberty of bumping the version number.

Point 3: Done. Fascinating - I wasn't aware that it was possible to generate an R function fully from within C++ code! That's pretty cool. I also hardened seed_conversion_internal to better protect against arbitrary integers in the template parameters or in the input types (should someone try to use it directly, or should the code be compiled on a system with 64-bit ints or other weirdness).

Should some of the new header code be placed in the dqrng:: namespace?

@rstub
Copy link
Member

rstub commented Feb 6, 2019

Excellent! But please don't set the version to 0.1.0 yet. For now 0.0.5.1 to indicate "development version" would be more appropriate.

Concerning using the dqrng namespace: Both the convert_seed* function templates and the R_random* functions should go in there.

@rstub
Copy link
Member

rstub commented Feb 6, 2019

BTW, running the tests locally I get the following warnings:


/home/ralf/devel/dqrng/include/convert_seed.h: In instantiation of ‘OUT convert_seed_internal(const IN*, size_t) [with OUT = short unsigned int; IN = int; int SHIFT = 32; size_t = long unsigned int]’:
/home/ralf/devel/dqrng/include/convert_seed.h:81:45:   required from ‘T convert_seed(const int*, size_t) [with T = short unsigned int; size_t = long unsigned int]’
/home/ralf/devel/dqrng/include/convert_seed.h:86:27:   required from ‘T convert_seed(Rcpp::IntegerVector) [with T = short unsigned int; Rcpp::IntegerVector = Rcpp::Vector<13>]’
convert.cpp:12:26:   required from ‘std::__cxx11::string convert_base(Rcpp::IntegerVector) [with U = short unsigned int; std::__cxx11::string = std::__cxx11::basic_string<char>; Rcpp::IntegerVector = Rcpp::Vector<13>]’
convert.cpp:20:39:   required from here
/home/ralf/devel/dqrng/include/convert_seed.h:47:25: warning: right shift count >= width of type [-Wshift-count-overflow]
             if (current >> SHIFT != 0) { // using 'current' to avoid shift warnings from GCC on common 32-bit 'int' platforms.
                 ~~~~~~~~^~~~~~~~
/home/ralf/devel/dqrng/include/convert_seed.h:56:17: warning: left shift count >= width of type [-Wshift-count-overflow]
             sum <<= SHIFT;
             ~~~~^~~~~~~~~
/home/ralf/devel/dqrng/include/convert_seed.h: In instantiation of ‘OUT convert_seed_internal(const IN*, size_t) [with OUT = unsigned int; IN = int; int SHIFT = 32; size_t = long unsigned int]’:
/home/ralf/devel/dqrng/include/convert_seed.h:81:45:   required from ‘T convert_seed(const int*, size_t) [with T = unsigned int; size_t = long unsigned int]’
/home/ralf/devel/dqrng/include/convert_seed.h:86:27:   required from ‘T convert_seed(Rcpp::IntegerVector) [with T = unsigned int; Rcpp::IntegerVector = Rcpp::Vector<13>]’
convert.cpp:12:26:   required from ‘std::__cxx11::string convert_base(Rcpp::IntegerVector) [with U = unsigned int; std::__cxx11::string = std::__cxx11::basic_string<char>; Rcpp::IntegerVector = Rcpp::Vector<13>]’
convert.cpp:25:39:   required from here
/home/ralf/devel/dqrng/include/convert_seed.h:47:25: warning: right shift count >= width of type [-Wshift-count-overflow]
             if (current >> SHIFT != 0) { // using 'current' to avoid shift warnings from GCC on common 32-bit 'int' platforms.
                 ~~~~~~~~^~~~~~~~
/home/ralf/devel/dqrng/include/convert_seed.h:56:17: warning: left shift count >= width of type [-Wshift-count-overflow]
             sum <<= SHIFT;
             ~~~~^~~~~~~~~

Can these be avoided?

@LTLA
Copy link
Contributor Author

LTLA commented Feb 6, 2019

Regarding the shift warnings: GCC is being a bit silly here, because it's emitting warnings about code that cannot possibly be reached (and this lack of reachability is known at compile time). Just above the shift statements are if clauses requiring that the word size be greater than the SHIFT value, and these are all constexprs, so I would have thought GCC would be able to figure out that this code was actually safe.

I don't think this can be avoided in a general sense, if you don't necessarily know whether OUT is larger than SHIFT (e.g., uint16_t outputs). I've considered switching to mult/division instead of shifting, but that would require specifying the scaling factor in an integer type large enough to hold 2^SHIFT, which introduces yet another type into the mix. I guess one could break up the shifts with something like:

sum <<= SHIFT - 1;
sum <<= 1;

... but this is pretty gross. I couldn't shut it up in the sourceCpp either.

The number of warnings can be reduced by removing some of the shifts that are extraneous to the calculation of the seed. However, this would reduce compile and run-time protection against silly inputs.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 6, 2019

P.S. Feel free to license the new files as you see fit.

@rstub rstub merged commit 65c5b95 into daqana:master Feb 7, 2019
@rstub
Copy link
Member

rstub commented Feb 7, 2019

Thanks a lot!

@LTLA
Copy link
Contributor Author

LTLA commented Feb 8, 2019

Thanks @rstub for the merge. What timeframe do you have in mind for pushing to CRAN? I imagine that you'll want some time for real-world testing; I will do some high-level tests on my end as well.

@rstub
Copy link
Member

rstub commented Feb 12, 2019

Good question. Unfortunately the time I could spend on dqrng went into chasing some weird error messages. Anyway, my plan is to also fix #7 with the next upload. And since MT from boost cannot be used as a replacement for C++11's MT, this means that MT has to go (it's slow and has some statistical issues, anyway). Since MT is the default, this is another disruptive change. However, I have not decided which one should be the new default.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 12, 2019

Okay. Perhaps this is related to RcppCore/Rcpp#832? In any case, reading your post to Rcpp-devel suggests that it's not a problem with the dqrng code itself, which is comforting.

As for the loss of MT, I don't think it's that bad. We've already changed how the seed is defined for single-integer inputs, so people are going to see a different stream of random numbers in the next dqrng release anyway. Casual users shouldn't be able to see the difference between "different stream due to different seed" and "different stream due to different seed and different PRNG". And advanced users (if there are any, aside from us) probably would have chosen a different PRNG anyway, given the issues you've mentioned.

@rstub
Copy link
Member

rstub commented Feb 12, 2019

I think that issue is different, and it is already visible on the CRAN checks page. I am going to mention that in the next cran-messages.md.

Concerning MT: Yes, it makes sense to put these disruptive changes into one release.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 25, 2019

FYI, some testing of generateSeedVectors to see whether it traverses the full range:

library(dqrng)
is.min <- is.zero <- is.max <- total <- 0
for (x in seq_len(200000)) {
    if (x%%1000==0) message("Iteration ", x)
    spawned <- generateSeedVectors(1000, nwords=1000)

    for (j in seq_along(spawned)) {
        current <- spawned[[j]]
        lowest <- is.na(current)
        is.min <- is.min + sum(lowest)
        is.zero <- is.zero + sum(!lowest & current==0L)
        is.max <- is.max + sum(!lowest & current==.Machine$integer.max)
        total <- total + length(current)
    }
}

is.min # gives me 40
is.max # gives me 45
is.zero # gives me 43
total/2^32 # theoretical expectation, 46.56613

pbinom(is.min, total, 2^-32) # 0.1882622
pbinom(is.max, total, 2^-32) # 0.4473918
pbinom(is.zero, total, 2^-32) # 0.3337348

Behaves well at the critical points; looks pretty good to me.

@rstub
Copy link
Member

rstub commented Mar 4, 2019

@LTLA dqrng v0.1.0 is "on its way to CRAN".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants