-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give an option not to call set.seed()
in functions
#52
Comments
the majority of the ClusterR package functions rely on RcppArmadillo. I use a specific seed function in the cpp files based on R RNGs There are numerous questions similar to yours in the RcppArmadillo Github repository and the reason I included this R seed function in the compiled code was for reproducibility purposes Does my answer cover your question? |
Thank you for the quick reply.
Yes, I know. This code is not a special C++ function, but just calling R's
Yes, reproducibility is important. But my point is that the seeding should be done by users before calling functions that use RNGs. Seeding is not Here is a schematic code to show why functions should not include function_with_RNG_force_seed = function(seed = 1) {
set.seed(seed)
runif(3L)
}
function_with_RNG = function() {
runif(3L)
}
## Reproducible, of course, but no option not to call set.seed()
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
## Reproducible, and users can choose when to use set.seed()
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
## Reproducible in larger scale.
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
## Users cannot do this if set.seed() is forced in a function. And this is my proposal to fix the problem in ClusterR while keeping backward compatibility: function_with_RNG_compromised = function(seed = 1) {
if (!is.na(seed)) set.seed(seed)
runif(3L)
}
## Reproducible in larger scale.
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140 |
I understand your point now. Your main concern is that I use as a default seed the value of 1 in the ClusterR package functions. What I could do (based on what you mention in this issue) is to update the documentation and inform the user that there is also the option to set the seed to NA (however I have to test it first to see if this won't cause any issues in Cpp ) I'll tell you what I think in general about arguments and default values. I am an open source developer (and I see you are too). That means I spent time to implement this R package, to include detailed documentation and a vignette. Wouldn't be the user of this R package responsible to read the documentation and set the parameters (such as the 'seed' value) to values so that it works best for his/her case? I am hesitant to change the default value because as you can see there are a few R packages that currently suggest, link or import the ClusterR and I don't won't to cause any issues only due to a default value. |
No. Any value is fine as a default value. You should keep What I am suggesting from the beginning is giving an option to avoid calling
No. You cannot directly give |
the ClusterR functions are not used only in an R session but can be also linked at the Rcpp level as I describe in the README.md file. The fact that the seed function exists in Rcpp code is to allow reproduciblity both in R and Rcpp level I understand what you mean but I don't want to change the way the functions which include the seed parameter work. |
I know. But it does not justify declining my request. I am NOT suggesting the removal of
Again, the modification I suggest does not affect the current usage at all. Just adding the ability to skip |
This should address mlampros#52 but currently it does not work because `NA_integer_` is implemented as the maximum negative value, and a negative seed causes an error "negative length vectors are not allowed". `base::set.seed()` accepts negative seeds without any problem, which indicates ClusterR is doing something wrong with negative seeds.
This should address mlampros#52 but currently it does not work because `NA_integer_` is implemented as the maximum negative value, and a negative seed causes an error "negative length vectors are not allowed". `base::set.seed()` accepts negative seeds without any problem, which indicates ClusterR is doing something wrong with negative seeds.
base::set.seed()
should not be called in package functions because it affects users' global environment in an uncontrollable way. Users, not package functions, should call it explicitly before using RNGs (or functions using RNGs) if necessary.The following example shows how the current version of
ClusterR::KMeans_*()
inhibits natural consumption of RNG stream:Here is the proposal:
The text was updated successfully, but these errors were encountered: