Give an option not to call `set.seed()` in functions #52

heavywatal · 2024-06-07T19:43:20Z

base::set.seed() should not be called in package functions because it affects users' global environment in an uncontrollable way. Users, not package functions, should call it explicitly before using RNGs (or functions using RNGs) if necessary.

The following example shows how the current version of ClusterR::KMeans_*() inhibits natural consumption of RNG stream:

SEED = 19937
set.seed(SEED)
res1 = stochastic_simulation_step1()  # different res1 with different SEED
res2 = stochastic_simulation_step2()  # different res2 with different SEED
km = ClusterR::KMeans_arma(res2, n)   # set.seed(1) implicitly! surprising!
res3 = stochastic_simulation_step3()  # always same res3 with different SEED
res4 = stochastic_simulation_step4()  # always same res4 with different SEED

Here is the proposal:

# NG
ClusterR::KMeans_arma(data, n)  # implicitly set.seed(1)! surprising!
ClusterR::KMeans_arma(data, n, seed = 42)  # not surprising now, but fixed-seed problem remains.

# Workaround to avoid fixed seed (with cost of unnecessary RNG consumption and seed reset)
ClusterR::KMeans_arma(data, n, seed = sample.int(.Machine$integer.max, 1L))

# Ideal usage, but may be a breaking change for some ClusterR users
set.seed(42)                    # explicitly by users
ClusterR::KMeans_arma(data, n)  # no seed option, no set.seed() call, but reproducible.

# Compromised solution with backward compatibility
ClusterR::KMeans_arma(data, n, seed = NA)  # option to avoid calling set.seed()

The text was updated successfully, but these errors were encountered:

mlampros · 2024-06-08T11:42:14Z

@heavywatal

the majority of the ClusterR package functions rely on RcppArmadillo. I use a specific seed function in the cpp files based on R RNGs

There are numerous questions similar to yours in the RcppArmadillo Github repository and the reason I included this R seed function in the compiled code was for reproducibility purposes

Does my answer cover your question?

heavywatal · 2024-06-08T23:23:39Z

Thank you for the quick reply.

I use a specific seed function in the cpp files based on R RNGs

Yes, I know. This code is not a special C++ function, but just calling R's base::set.seed() from C++ via Rcpp.

for reproducibility purposes

Yes, reproducibility is important. But my point is that the seeding should be done by users before calling functions that use RNGs. Seeding is not ClusterR::KMeans_* functions' job.

Here is a schematic code to show why functions should not include set.seed():

function_with_RNG_force_seed = function(seed = 1) {
  set.seed(seed)
  runif(3L)
}

function_with_RNG = function() {
  runif(3L)
}

## Reproducible, of course, but no option not to call set.seed()
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_force_seed()
# [1] 0.2655087 0.3721239 0.5728534

## Reproducible, and users can choose when to use set.seed()
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534

## Reproducible in larger scale.
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG()
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG()
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG()
# [1] 0.9446753 0.6607978 0.6291140
## Users cannot do this if set.seed() is forced in a function.

And this is my proposal to fix the problem in ClusterR while keeping backward compatibility:

function_with_RNG_compromised = function(seed = 1) {
  if (!is.na(seed)) set.seed(seed)
  runif(3L)
}

## Reproducible in larger scale.
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140
set.seed(1)
function_with_RNG_compromised(seed = NA)
# [1] 0.2655087 0.3721239 0.5728534
function_with_RNG_compromised(seed = NA)
# [1] 0.9082078 0.2016819 0.8983897
function_with_RNG_compromised(seed = NA)
# [1] 0.9446753 0.6607978 0.6291140

mlampros · 2024-06-09T10:27:58Z

@heavywatal

I understand your point now. Your main concern is that I use as a default seed the value of 1 in the ClusterR package functions. What I could do (based on what you mention in this issue) is to update the documentation and inform the user that there is also the option to set the seed to NA (however I have to test it first to see if this won't cause any issues in Cpp )

I'll tell you what I think in general about arguments and default values. I am an open source developer (and I see you are too). That means I spent time to implement this R package, to include detailed documentation and a vignette.

Wouldn't be the user of this R package responsible to read the documentation and set the parameters (such as the 'seed' value) to values so that it works best for his/her case?

I am hesitant to change the default value because as you can see there are a few R packages that currently suggest, link or import the ClusterR and I don't won't to cause any issues only due to a default value.

heavywatal · 2024-06-10T03:25:09Z

I understand your point now. Your main concern is that I use as a default seed the value of 1 in the ClusterR package

No. Any value is fine as a default value. You should keep 1 for compatibility.

What I am suggesting from the beginning is giving an option to avoid calling set.seed(). Checking NA is just an example to implement that functionality. It can be NULL, integer(0), or any FALSE-like value you choose.

update the documentation and inform the user that there is also the option to set the seed to NA

No. You cannot directly give NA to base::set.seed(). The code has to be changed before documentation. I just want you to add if (!is.na(seed)), if (!is.null(seed)), or if (!isFALSE(seed)) before set.seed(seed) in ClusterR functions. Then, users can give NA, NULL, or FALSE to CluserR functions to avoid calling set.seed(). Current users are not affected by this change. Users can continue with the default 1, or their own seeds. But the functions just acquire a new ability to run without calling set.seed().

mlampros · 2024-06-10T06:13:08Z

@heavywatal

the ClusterR functions are not used only in an R session but can be also linked at the Rcpp level as I describe in the README.md file. The fact that the seed function exists in Rcpp code is to allow reproduciblity both in R and Rcpp level

I understand what you mean but I don't want to change the way the functions which include the seed parameter work.

heavywatal · 2024-06-10T07:26:47Z

the ClusterR functions are not used only in an R session but can be also linked at the Rcpp level as I describe in the README.md file. The fact that the seed function exists in Rcpp code is to allow reproduciblity both in R and Rcpp level

I know. But it does not justify declining my request. I am NOT suggesting the removal of set.seed(). I am requesting an optional way to avoid calling it every time in ClusterR functions to achieve reproducibility in a larger scale.

but I don't want to change the way the functions which include the seed parameter work.

Again, the modification I suggest does not affect the current usage at all. Just adding the ability to skip set.seed() when seed=NA is given.

This should address mlampros#52 but currently it does not work because `NA_integer_` is implemented as the maximum negative value, and a negative seed causes an error "negative length vectors are not allowed". `base::set.seed()` accepts negative seeds without any problem, which indicates ClusterR is doing something wrong with negative seeds.

github-actions bot added the triage label Jun 7, 2024

mlampros closed this as completed Jun 10, 2024

heavywatal mentioned this issue Jun 10, 2024

Skip set.seed() if NA_integer_ is given. #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give an option not to call `set.seed()` in functions #52

Give an option not to call `set.seed()` in functions #52

heavywatal commented Jun 7, 2024

mlampros commented Jun 8, 2024 •

edited

Loading

heavywatal commented Jun 8, 2024

mlampros commented Jun 9, 2024

heavywatal commented Jun 10, 2024

mlampros commented Jun 10, 2024

heavywatal commented Jun 10, 2024

Give an option not to call set.seed() in functions #52

Give an option not to call set.seed() in functions #52

Comments

heavywatal commented Jun 7, 2024

mlampros commented Jun 8, 2024 • edited Loading

heavywatal commented Jun 8, 2024

mlampros commented Jun 9, 2024

heavywatal commented Jun 10, 2024

mlampros commented Jun 10, 2024

heavywatal commented Jun 10, 2024

Give an option not to call `set.seed()` in functions #52

Give an option not to call `set.seed()` in functions #52

mlampros commented Jun 8, 2024 •

edited

Loading