-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differences in results between dqrng::dqrunif() and dqrng::uniform_distribution dist(min, max); #77
Comments
You are seeing only a small difference? For me the difference is large and differs between runs. A typical example is
Reason for that is that you are fixing the seed with I have to think a bit more if there is a better pattern for this. Please let me know where the documentation could be improved. |
Oh yes, but indeed, I also used |
OK, so these are the differences you are talking about:
I am not sure what to make of that. Have you tried using |
Well, yes, this is the solution, indeed. Yet I don't understand the difference because in both cases the code in {dqrng} use runif_impl internally. Here is the results using the code above plus this one:
Note: I am looking for the fastest as possible pseudo-random generator for large vectors of 1e6-1e8 items. {dqrng} is fantastic,... I try to push it a little further with parallelization, but still with a non parallel equivalent version. |
Thanks. I have approached it from the other direction, removing as much as possible: // [[Rcpp::depends(dqrng, BH)]]
#include <Rcpp.h>
#include <dqrng.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
// [[Rcpp::export]]
Rcpp::NumericVector runif_cpp1(int n, double min = 0.0, double max = 1.0) {
Rcpp::NumericVector res(Rcpp::no_init(n));
dqrng::dqset_seed(Rcpp::IntegerVector{42});
res = dqrng::dqrunif(n, min, max);
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector runif_cpp2(int n, double min = 0.0, double max = 1.0) {
Rcpp::NumericVector res(Rcpp::no_init(n));
// The generator for the distribution
dqrng::uniform_distribution dist(min, max);
dqrng::xoshiro256plusplus rng(42);
for (int i = 0; i < n; ++i) {
res[i] = dist(rng);
}
return res;
}
/***R
dqrng::dqRNGkind("Xoshiro256++")
res1 <- runif_cpp1(22, 0, 1)
res2 <- runif_cpp2(22, 0, 1)
identical(res1, res2)
all.equal(res1, res2)
res1 - res2
*/ Using
So it is indeed only a difference between |
Right, we get a little bit closer... Note also that differences are larger for, say, |
I am speculating that changing the internals from |
OK, thanks. If I find something I'll let you know. |
No idea why I missed that yesterday, but sometimes it helps to let things rest a bit. Anyway, the explanation is quite straight forward. The changed internals of // [[Rcpp::depends(dqrng, BH)]]
#include <Rcpp.h>
#include <dqrng.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
// [[Rcpp::export]]
Rcpp::NumericVector rnorm_cpp1(int n) {
Rcpp::NumericVector res(Rcpp::no_init(n));
dqrng::dqset_seed(Rcpp::IntegerVector{42});
res = dqrng::dqrnorm(n);
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector rnorm_cpp2(int n) {
Rcpp::NumericVector res(Rcpp::no_init(n));
dqrng::normal_distribution dist;
dqrng::xoshiro256plusplus rng(42);
for (int i = 0; i < n; ++i) {
res[i] = dist(rng);
}
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector rnorm_cpp3(int n) {
Rcpp::NumericVector res(Rcpp::no_init(n));
dqrng::normal_distribution dist;
auto rng = dqrng::generator<dqrng::xoshiro256plusplus>(42);
for (int i = 0; i < n; ++i) {
res[i] = dist(*rng);
}
return res;
}
/***R
dqrng::dqRNGkind("Xoshiro256++")
res1 <- rnorm_cpp1(22)
res2 <- rnorm_cpp2(22)
res3 <- rnorm_cpp3(22)
identical(res1, res2)
all.equal(res1, res2)
identical(res1, res3)
all.equal(res1, res3)
*/ Result:
This works well for you since you have an explicit seed within the function available. In principle I would like to have it such that one can make use of the global RNG even for parallel processing, see the last bullet point in #64. Please let me know where you would have expected documentation for this behavior. |
After merging #79 we now have proper tools to write such code w/o the need to explicitly seed an RNG. Instead, the global RNG is used as basis for stream-specific cloned RNGs as well as for filling in the remaining data. It is also possible to make this a templated function, making it easy to derive many different explicit implementations: // [[Rcpp::depends(dqrng, BH, RcppParallel)]]
#include <dqrng.h>
#include <dqrng_distribution.h>
#include <RcppParallel.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
template<typename Dist, typename... Params>
Rcpp::NumericVector parallel_generate(std::size_t n, int threads, int streams, Params&&... params) {
int stream_size = n/streams;
Rcpp::NumericVector res(Rcpp::no_init(n));
RcppParallel::RVector<double> work(res);
// use global RNG from dqrng
dqrng::random_64bit_accessor rng{};
#ifdef _OPENMP
int maxthreads = omp_get_num_procs();
if (threads > maxthreads)
threads = maxthreads;
// No need for more threads than there are streams
if (threads > streams)
threads = streams;
#endif
#pragma omp parallel num_threads(threads)
{
#pragma omp for schedule(static,1)
for (int i = 0; i < streams; ++i) {
// Our private RNG in each stream
auto prng = rng.clone(i+1);
prng->generate<Dist>(std::begin(work) + i * stream_size,
std::begin(work) + (i + 1) * stream_size,
std::forward<Params>(params)...);
}
}
// Fill remaining items from the main stream
int remainder = n % streams;
if (remainder > 0) {
rng.generate<Dist>(std::begin(work) + streams * stream_size,
std::end(work),
std::forward<Params>(params)...);
}
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector runif_para(int n, double min = 0, double max = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::uniform_distribution>(n, threads, streams, min, max);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnorm_para(int n, double mean = 0, double sd = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::normal_distribution>(n, threads, streams, mean, sd);
}
// [[Rcpp::export]]
Rcpp::NumericVector rexp_para(int n, double rate = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::exponential_distribution>(n, threads, streams, rate);
}
/***R
n <- 1e6
bench::mark(runif(n),
dqrng::dqrunif(n),
runif_para(n, threads = 8L),
runif_para(n, threads = 4L),
runif_para(n, threads = 2L),
runif_para(n, threads = 1L),
check = FALSE)[, 1:6]
bench::mark(rnorm(n),
dqrng::dqrnorm(n),
rnorm_para(n, threads = 8L),
rnorm_para(n, threads = 4L),
rnorm_para(n, threads = 2L),
rnorm_para(n, threads = 1L),
check = FALSE)[, 1:6]
bench::mark(rexp(n),
dqrng::dqrexp(n),
rexp_para(n, threads = 8L),
rexp_para(n, threads = 4L),
rexp_para(n, threads = 2L),
rexp_para(n, threads = 1L),
check = FALSE)[, 1:6]
dqset.seed(42); norm1 <- rnorm_para(22, threads = 1)
dqset.seed(42); norm2 <- rnorm_para(22, threads = 4)
identical(norm1, norm2)
*/ Here I am contemplating to add this as an example to the Parallel RNG usage vignette or as an explicit header that can easily be included to generate functions like this. What do you think? In what form would you liked to be attributed? |
This is very nice! Just it would be better not to depend on both RcppParallel and OpenMP simultaneously. It is possible to get rid of RcppParallel by using I don't see either how to easily replace the Anyway, the remainder stuff in my initial code is not very nice. Here is a better solution that redistributes the remainder into the first few iterations so that we always use the same number of streams. Hence, generating n = 7 with streams = 3 would generate 7 values using streams 1,1,1,2,2,3,3 while the initial solution generated values using streams 1,1,2,2,3,3,4.
Please, add it to the vignette or as a header. I am fine with it. Just cite me as a contributor. Do you mind if I implement something on this line in the {svFast} package here: https://github.com/SciViews/svFast ? I'll wait for your changes before doing anything on my side. |
There is still a problem: since we do not consume the default rng, two successive calls to, say
Not a perfect solution and it supposes that the default RNG is seeded on stream 0, but still better than previous one. Is it another way than calling |
This is an interesting problem. Thanks for bringing that up! I am not a fan of re-seeding the RNG regularly. So I would prefer, if one of the "local RNGs" would actually be the global one. Maybe I should special case |
Yes, I was thinking in that direction, but I failed to find a way to do so properly. |
Handing out a unique pointer to the global RNG won't work, since this might get the global RNG deleted. Instead, I have come up with the following: #include <sstream>
// [[Rcpp::depends(dqrng, BH, RcppParallel)]]
#include <dqrng.h>
#include <dqrng_distribution.h>
#include <RcppParallel/RVector.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
template<typename Dist, typename... Params>
Rcpp::NumericVector parallel_generate(int n, int threads, int streams, Params&&... params) {
if (n < streams)
streams = n;
int stream_size = n/streams;
int remainder = n % streams;
Rcpp::NumericVector res(Rcpp::no_init(n));
RcppParallel::RVector<double> work(res);
// use global RNG from dqrng
dqrng::random_64bit_accessor rng{};
std::stringstream buffer;
#ifdef _OPENMP
int maxthreads = omp_get_num_procs();
if (threads > maxthreads)
threads = maxthreads;
// No need for more threads than there are streams
if (threads > streams)
threads = streams;
#endif
#pragma omp parallel num_threads(threads)
{
int start,end;
#pragma omp for schedule(static,1)
for (int i = 0; i < streams; ++i) {
if (i < remainder) {
start = i * stream_size + i;
end = start + stream_size + 1;
} else {
start = i * stream_size + remainder;
end = start + stream_size;
}
// Our private RNG in each stream; RNG with i == 0 is identical to global RNG
auto prng = rng.clone(i);
prng->generate<Dist>(std::begin(work) + start, std::begin(work) + end,
std::forward<Params>(params)...);
if (i == 0) {// Save the state of the global RNG's clone
buffer << *prng;
}
}
}
// Make sure that the global RNG advances aas well by applying the state of the global RNG's clone to the global RNG
buffer >> rng;
return res;
}
// [[Rcpp::export]]
Rcpp::NumericVector runif_para(int n, double min = 0, double max = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::uniform_distribution>(n, threads, streams, min, max);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnorm_para(int n, double mean = 0, double sd = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::normal_distribution>(n, threads, streams, mean, sd);
}
// [[Rcpp::export]]
Rcpp::NumericVector rexp_para(int n, double rate = 1,
int threads = 2, int streams = 8) {
return parallel_generate<dqrng::exponential_distribution>(n, threads, streams, rate);
}
/***R
n <- 1e6
bench::mark(stats::runif(n),
dqrng::dqrunif(n),
runif_para(n, threads = 8L),
runif_para(n, threads = 4L),
runif_para(n, threads = 2L),
runif_para(n, threads = 1L),
check = FALSE)[, 1:6]
bench::mark(stats::rnorm(n),
dqrng::dqrnorm(n),
rnorm_para(n, threads = 8L),
rnorm_para(n, threads = 4L),
rnorm_para(n, threads = 2L),
rnorm_para(n, threads = 1L),
check = FALSE)[, 1:6]
bench::mark(stats::rexp(n),
dqrng::dqrexp(n),
rexp_para(n, threads = 8L),
rexp_para(n, threads = 4L),
rexp_para(n, threads = 2L),
rexp_para(n, threads = 1L),
check = FALSE)[, 1:6]
dqset.seed(42); norm1 <- rnorm_para(22, threads = 1)
dqset.seed(42); norm2 <- rnorm_para(22, threads = 4)
identical(norm1, norm2)
dqset.seed(42); norm1 <- rnorm_para(22); norm2 <- rnorm_para(22)
!identical(norm1, norm2)
*/ Here I use a buffer to save the state of the BTW, I have also change What do you think? |
Yess! This is exactly what I wanted to do. Many, many thanks! I would very much appreciate if PCG64 would also work here because reading the literature, it is an excellent RNG on par or better than Xoshiro256++, even if a little bit slower. This code is already excellent, certainly for a vignette, but I see room for improvement in the following directions for a final implementation:
|
Please have a look at #82. Does the attribution fit your expectations? BTW, feel free to use this in svFast. Concerning the additions to |
I understand. Let's keep it simple. We both certainly want to finish and close this issue and the result is very nice, I think. One last point, though (really sorry). Should we keep |
Thanks for mentioning the |
Any final comments before I merge this? |
This is perfect! Just in the parallel.Rmd vignette, I would add after the last sentence "This way, the global RNG's state advances as if it had processed one of the streams." This way, the global RNG's state advances as if it had processed one of the streams and successive calls to
Pitfall: should you use
Note how values 1-6 of v2 are identical to values 3-8 of v1 because of an overlap of the streams consumed in each call to
Of course, it is also advised to use a different seed (first argument to |
Thanks! I have added it to the vignette with the exception of the last sentence. I am not a fan of seeding different processes with different seeds and hoping for the best that there is no overlap when starting from the resulting positions in the RNG's state space. As least as long as we do have control of it with appropriate stream switches. |
Right! |
I am building a parallel version of
dq<dist>()
that uses 10-11 streams. I want to get the same results as a non-parallel implementation, of course also using 10 or 11 streams. In the parallel version I usedqrng::uniform_distribution dist(min, max);
and in the non parallel one,dqrng::dqrunif()
which is faster, but if I am correct, is not compatible with OpenMP. I don't understand why results slightly differ (and largely differ in the case of normal or exponential distributions)? Here is a reproducible example:The text was updated successfully, but these errors were encountered: