Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give an option not to call set.seed() in functions #52

Closed
heavywatal opened this issue Jun 7, 2024 · 6 comments
Closed

Give an option not to call set.seed() in functions #52

heavywatal opened this issue Jun 7, 2024 · 6 comments
Labels

Comments

@heavywatal
Copy link

base::set.seed() should not be called in package functions because it affects users' global environment in an uncontrollable way. Users, not package functions, should call it explicitly before using RNGs (or functions using RNGs) if necessary.

The following example shows how the current version of ClusterR::KMeans_*() inhibits natural consumption of RNG stream:

SEED = 19937
set.seed(SEED)
res1 = stochastic_simulation_step1()  # different res1 with different SEED
res2 = stochastic_simulation_step2()  # different res2 with different SEED
km = ClusterR::KMeans_arma(res2, n)   # set.seed(1) implicitly! surprising!
res3 = stochastic_simulation_step3()  # always same res3 with different SEED
res4 = stochastic_simulation_step4()  # always same res4 with different SEED

Here is the proposal:

# NG
ClusterR::KMeans_arma(data, n)  # implicitly set.seed(1)! surprising!
ClusterR::KMeans_arma(data, n, seed = 42)  # not surprising now, but fixed-seed problem remains.

# Workaround to avoid fixed seed (with cost of unnecessary RNG consumption and seed reset)
ClusterR::KMeans_arma(data, n, seed = sample.int(.Machine$integer.max, 1L))

# Ideal usage, but may be a breaking change for some ClusterR users
set.seed(42)                    # explicitly by users
ClusterR::KMeans_arma(data, n)  # no seed option, no set.seed() call, but reproducible.

# Compromised solution with backward compatibility
ClusterR::KMeans_arma(data, n, seed = NA)  # option to avoid calling set.seed()
@github-actions github-actions bot added the triage label Jun 7, 2024
@mlampros
Copy link
Owner

mlampros commented Jun 8, 2024

@heavywatal

the majority of the ClusterR package functions rely on RcppArmadillo. I use a specific seed function in the cpp files based on R RNGs

There are numerous questions similar to yours in the RcppArmadillo Github repository and the reason I included this R seed function in the compiled code was for reproducibility purposes

Does my answer cover your question?

@heavywatal
Copy link
Author

Thank you for the quick reply.

I use a specific seed function in the cpp files based on R RNGs

Yes, I know. This code is not a special C++ function, but just calling R's base::set.seed() from C++ via Rcpp.

for reproducibility purposes

Yes, reproducibility is important. But my point is that the seeding should be done by users before calling functions that use RNGs. Seeding is not ClusterR::KMeans_* functions' job.

Here is a schematic code to show why functions should not include set.seed():

function_with_RNG_force_seed = function(seed = 1) {
  set.seed(seed)
  runif(3L)
}

function_with_RNG = function() {
  runif(3L)
}

## Reproducible, of course, but no option not to call set.seed()
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534

## Reproducible, and users can choose when to use set.seed()
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534

## Reproducible in larger scale.
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
## Users cannot do this if set.seed() is forced in a function.

And this is my proposal to fix the problem in ClusterR while keeping backward compatibility:

function_with_RNG_compromised = function(seed = 1) {
  if (!is.na(seed)) set.seed(seed)
  runif(3L)
}

## Reproducible in larger scale.
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140

@mlampros
Copy link
Owner

mlampros commented Jun 9, 2024

@heavywatal

I understand your point now. Your main concern is that I use as a default seed the value of 1 in the ClusterR package functions. What I could do (based on what you mention in this issue) is to update the documentation and inform the user that there is also the option to set the seed to NA (however I have to test it first to see if this won't cause any issues in Cpp )

I'll tell you what I think in general about arguments and default values. I am an open source developer (and I see you are too). That means I spent time to implement this R package, to include detailed documentation and a vignette.

Wouldn't be the user of this R package responsible to read the documentation and set the parameters (such as the 'seed' value) to values so that it works best for his/her case?

I am hesitant to change the default value because as you can see there are a few R packages that currently suggest, link or import the ClusterR and I don't won't to cause any issues only due to a default value.

@heavywatal
Copy link
Author

I understand your point now. Your main concern is that I use as a default seed the value of 1 in the ClusterR package

No. Any value is fine as a default value. You should keep 1 for compatibility.

What I am suggesting from the beginning is giving an option to avoid calling set.seed(). Checking NA is just an example to implement that functionality. It can be NULL, integer(0), or any FALSE-like value you choose.

update the documentation and inform the user that there is also the option to set the seed to NA

No. You cannot directly give NA to base::set.seed(). The code has to be changed before documentation. I just want you to add if (!is.na(seed)), if (!is.null(seed)), or if (!isFALSE(seed)) before set.seed(seed) in ClusterR functions. Then, users can give NA, NULL, or FALSE to CluserR functions to avoid calling set.seed(). Current users are not affected by this change. Users can continue with the default 1, or their own seeds. But the functions just acquire a new ability to run without calling set.seed().

@mlampros
Copy link
Owner

@heavywatal

the ClusterR functions are not used only in an R session but can be also linked at the Rcpp level as I describe in the README.md file. The fact that the seed function exists in Rcpp code is to allow reproduciblity both in R and Rcpp level

I understand what you mean but I don't want to change the way the functions which include the seed parameter work.

@heavywatal
Copy link
Author

the ClusterR functions are not used only in an R session but can be also linked at the Rcpp level as I describe in the README.md file. The fact that the seed function exists in Rcpp code is to allow reproduciblity both in R and Rcpp level

I know. But it does not justify declining my request. I am NOT suggesting the removal of set.seed(). I am requesting an optional way to avoid calling it every time in ClusterR functions to achieve reproducibility in a larger scale.

but I don't want to change the way the functions which include the seed parameter work.

Again, the modification I suggest does not affect the current usage at all. Just adding the ability to skip set.seed() when seed=NA is given.

heavywatal added a commit to heavywatal/ClusterR that referenced this issue Jun 10, 2024
This should address mlampros#52 but currently it does not work because
`NA_integer_` is implemented as the maximum negative value, and
a negative seed causes an error "negative length vectors are not allowed".
`base::set.seed()` accepts negative seeds without any problem,
which indicates ClusterR is doing something wrong with negative seeds.
heavywatal added a commit to heavywatal/ClusterR that referenced this issue Jun 19, 2024
This should address mlampros#52 but currently it does not work because
`NA_integer_` is implemented as the maximum negative value, and
a negative seed causes an error "negative length vectors are not allowed".
`base::set.seed()` accepts negative seeds without any problem,
which indicates ClusterR is doing something wrong with negative seeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants