Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SPEC7: Seeding pseudo-random number generation #180

Merged
merged 57 commits into from
Aug 29, 2024

Conversation

stefanv
Copy link
Member

@stefanv stefanv commented Apr 19, 2023

Under discussion at scipy/scipy#14322

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea to try to uniformize all that 👍

(I suppose you meant SPEC and not NEP.)

spec-0007/index.md Outdated Show resolved Hide resolved
@stefanv stefanv changed the title Add draft NEP7: seeding pseudo-random number generation Add draft SPEC7: seeding pseudo-random number generation Apr 19, 2023
@stefanv stefanv closed this Apr 19, 2023
@stefanv stefanv deleted the nep7-prng branch April 19, 2023 18:36
@stefanv stefanv restored the nep7-prng branch April 19, 2023 18:37
@stefanv stefanv reopened this Apr 19, 2023
@rkern
Copy link

rkern commented Apr 19, 2023

The big problem with random_state is that it allows for None, which then grabs global state. So, that will always conflict with an rng=None kwarg.

There's a deprecation strategy that can work to migrate from random_state to rng, if one wants that. Functions will (for a time) take random_state and rng arguments. Have a check_rng(rng=None, random_state=None) function that will return a Generator given its arguments that function authors can use. You can change the behavior over time, with DeprecationWarnings. So on the first release, if rng=None, then it looks at random_state and start issuing DeprecationWarnings, but otherwise using the same semantics as check_random_state(random_state) (but taking out the BitGenerator of the resulting RandomState and wrapping it in a Generator instead). Then when you enforce the deprecations, you can migrate to just returning default_rng(rng) and start raise informative errors if something other than None is passed to random_state, then eventually you can drop the random_state= argument entirely.

@stefanv
Copy link
Member Author

stefanv commented Apr 20, 2023

There's a deprecation strategy that can work to migrate from random_state to rng

To make sure I understand, this will change the return values of functions (rng=None, random_seed=None) over time (i.e., violate the Hinsen principle), but we can choose how long that time period is.

@rkern
Copy link

rkern commented Apr 20, 2023

Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy.

@rgommers
Copy link
Contributor

@stefanv thanks for starting to summarize that long and complex discussion!

Specifically, behavior will change over a long period of time in the case where no seed is specified

Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact.

I guess the point here is:

  • the user may be seeding elsewhere with np.random.seed(a_number),
  • and not threading the state for that seed through to this API call,
  • and executing without multiprocessing or a similar parallel mechanism,
  • hence was relying implicitly on the global state controlling by seeding,
  • this global state is now no longer determining the exact random numbers to the unseeded API call,
  • hence the numerical result returned by that API call changes,

And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware.

I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from random_state to rng.

@rkern
Copy link

rkern commented Apr 25, 2023

The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using scipy and sklearn components that use check_random_state()) that rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Right, and my take was that this is a desired outcome from our perspective.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

And then there's the question whether this scenarios matter.

By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust.

@ilayn
Copy link

ilayn commented Apr 25, 2023

I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments.

@stefanv
Copy link
Member Author

stefanv commented Apr 25, 2023

I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward!

@rkern
Copy link

rkern commented Apr 25, 2023

Right, and my take was that this is a desired outcome from our perspective.

Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what random_state=None does since stream reproducibility across versions is not something that most of the downstream libraries using the check_random_state() pattern guarantee. I was confirming that there was something beyond the stream reproducibility that is a concern.

@rgommers
Copy link
Contributor

rely on controlling the output deterministically (for the same program, same builds, same environment, not across versions or builds) by calling np.random.seed(seed) once at the top rather than explicitly threading through a RandomState. Following that deprecation, those programs won't be able to be made deterministic anymore without some (small) rewriting.

Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using multiprocessing. So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't. Right?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Why don't we emit a warning from numpy random.seed?

@rgommers
Copy link
Contributor

Because it will create utter havoc in the many valid uses in test suites?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Isn't that what you want, eventually?

@stefanv
Copy link
Member Author

stefanv commented Apr 27, 2023

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

@rkern
Copy link

rkern commented Apr 28, 2023

So I think what you're worried about is end user code doing this. And missing the change in semantics, then re-run experiments after upgrading to the next scipy version and saving results that they think are deterministic but silently aren't.

Yes. There are plenty of ML programs (in particular) that call np.random.seed(seed) (and random.seed(seed) and torch.manual_seed(seed), etc.) at the top because that's what they've been told to do (and encouraged to do by well-meaning frameworks) and call sklearn and scipy functions, and those scripts are deterministic (and mostly fine). That will silently change if we just change check_random_state(None) to do default_rng(None) without a deprecation switcheroo. I care less about the results they get from any one run (they'll be perfectly valid) so much as they will now show unexpected and hard-to-debug behavior as everyone chases down a bunch of red herrings.

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

One enormous hurdle at a time, please. 😉

@stefanv
Copy link
Member Author

stefanv commented Apr 28, 2023

One enormous hurdle at a time, please. 😉

Fair enough :)

@rgommers
Copy link
Contributor

Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed?

There is no plan to do so. Deprecating random_state=None in scipy and (hopefully after that) scikit-learn is a very different thing from deprecating the legacy numpy.random functionality. There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

@rgommers
Copy link
Contributor

and those scripts are deterministic (and mostly fine)

Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with random_state, I think the average reader will understand this well enough.

That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer rng over random_state.

@tupui
Copy link
Member

tupui commented Apr 28, 2023

There is no such plan, it's legacy only and NEP 19 explicitlly lays out that it will stay, be ultra-stable, and useful for testing.

I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid.

Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of RandomState is clear, but the rest of the global section even ends with:

This NEP does not propose that these requirements remain in perpetuity. After we have experience with the new PRNG subsystem, we can and should revisit these issues in future NEPs.

@stefanv
Copy link
Member Author

stefanv commented May 2, 2023

There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the seed (random_state) to rng transition there, and hope that it doesn't cause too much work in the future should the decision here go differently.

I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome.

Thanks!

stefanv added a commit to stefanv/scikit-image that referenced this pull request May 9, 2023
Use `rng` consistently, replacing `random_state` and `seed`.

See also scientific-python/specs#180
@mdhaber
Copy link
Contributor

mdhaber commented Aug 3, 2024

@seberg do these changes along with stefanv#3 make sense?

Copy link
Contributor

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, left some comments, but nothing serious (cool to have tests and have it much more verbose)!

The one thing that I do feel is that it would be good to either make sure rng is fully normalized (i.e. a RandomState instance also in the old paths), or at least mention briefly that normalization is still required.

Fully normalizing may be a bit tedious in old positional use, unless you keep their signature as:

def library_function(..., random_state=None, *, rng=None):
    # Always just ignore random_state

(Although I guess it isn't hard, you just have to build a new args tuple.)

spec-0007/transition_to_rng.py Outdated Show resolved Hide resolved
spec-0007/test_transition_to_rng.py Outdated Show resolved Hide resolved
spec-0007/transition_to_rng.py Outdated Show resolved Hide resolved
spec-0007/transition_to_rng.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented Aug 4, 2024

The one thing that I do feel is that it would be good to either make sure rng is fully normalized (i.e. a RandomState instance also in the old paths),

I removed the normalization code that was run with the decorator because we don't know what the decorated function was doing with values passed to old arguments like random_state or seed. We shouldn't assume because this could cause immediate backwards incompatibilities. Moreover, the decorator doesn't need to do anything except to arguments passed with the new argument rng (because that is the only thing this SPEC makes recommendations about).

or at least mention briefly that normalization is still required.

We can try to mention this explicitly. Again, the decorator just emits warnings and processes input passed to rng by keyword. It does not touch the argument if it is passed by position or by random_state keyword. So I wonder why libraries would be tempted to change whatever normalization code they already have (until the end of the deprecation period, at which point they can simply replace it with rng = np.random.default_rng(rng))? It is because the name of the argument changed from random_state/seed to rng? Inside the function, the variable can immediately be renamed to random_state to work with whatever normalization code is already there; it's just the interface that we are changing.

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read the current version of the SPEC. It's pretty good, but a bit too brief in places. Most importantly, it'd be nice to expand the motivation, given it's only a few sentences now, for a change with a very large blast radius. The point on avoiding naive seeding strategies is clear, the one on avoiding global state is not. It requires explaining why this is a net win.

A consequence of no longer using global state is that if a library function uses random number generation internally, all calls from other library functions (including those in downstream libraries) now must expose an rng keyword as well and thread through the rng=rng keywords. Otherwise code that could previously be made to run deterministically no longer can be made to do so - which is not great. It was always a good idea to thread through random-state keywords, but it'll be much more important to do so now. It's worth calling this out explicitly.

From an adopting point of view, as a SciPy maintainer I'd really like to see scikit-learn commit to implementing this proposal. SciPy now mostly matches scikit-learn and took the current random_state= infra from scikit-learn IIRC. Hence if we'd change SciPy without scikit-learn doing the same, the API uniformity doesn't get all that much better.

spec-0007/index.md Outdated Show resolved Hide resolved

Legacy behavior in packages such as scikit-learn (`sklearn.utils.check_random_state`) typically handle `None` (use the global seed state), an int (convert to `RandomState`), or `RandomState` object.

Two strong motivations for moving over to `Generator`s are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend splitting this off into a separate ## Motivation section which goes above the Implementation section. That will help the reader, and it will make clearer to the authors how short the motivation given now is (it needs expanding).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

Copy link
Contributor

@mdhaber mdhaber Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed by stefanv@2dcf128, but I leave addition of motivation for a future PR. I'm not the best person to comment on this; I was comfortable controlling using np.random.rand and only changed due to peer pressure : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That links back to my comment - I think you wanted to link a commit?

Copy link
Contributor

@mdhaber mdhaber Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fixed it - stefanv@2dcf128. But I just moved the section; didn't add additional motivation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPEC template doesn't have a motivation section (/cc @jarrodmillman perhaps we should consider), but I added some motivational text to the description.

spec-0007/index.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent round of changes look good to me. Good to include /... for old names other than random_state and seed as you have, and I think reordering the points was a good idea.

Copy link
Member

@lagru lagru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads great and I think is ready to merge as a draft. Nevertheless, I have some non-critical comments. Those don't necessarily have to hold this up and could also be addressed in follow-up PRs. 😊

Sorry if my comments where already addressed in previous discussions. I didn't check those.

spec-0007/index.md Outdated Show resolved Hide resolved
spec-0007/index.md Outdated Show resolved Hide resolved
- If neither `rng` nor `random_state`/`seed`/`...` is specified and `np.random.seed` has been used to set the seed, emit a `FutureWarning` about the upcoming change in behavior.
- If `random_state`/`seed`/`...` is passed by keyword or by position, treat it as before, but:

- Emit a `DeprecationWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Emit a `DeprecationWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`.
- Emit a `FutureWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`.

Not sure why using the deprecated keyword shouldn't also be a FutureWarning when using the deprecated position is. It's both aimed at end users (Python warning categories).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @mdhaber take that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not familiar with that being the official distinction. This was written based on what I've seen in SciPy.

In our release notes, there is a standard note about users checking for DeprecationWarnings:

Before upgrading, we recommend that users check that their own code does not use deprecated SciPy functionality (to do so, run your code with python -Wd and check for DeprecationWarnings).

In SciPy, I have always seen a DeprecationWarning when something is going to raise an error in the future (i.e. in line with Hinsen principle) and a FutureWarning when something is going to change behavior but not raise an error (i.e. in violation of the Hinsen principle).

Often DeprecationWarnings are used for functions and keywords being removed. FutureWarnings are a lot less common. Here's an example of a decision made between the two.

In this case, a keyword is being removed in the sense that if a user passes random_state=10 they will get an error, so SciPy would probably emit a DeprecationWarning. If they pass 10 to random_state by position, on the other hand, they will not get an error in the future, but the behavior will change, so SciPy would probably emit a FutureWarning.

I suppose we can mention that projects are welcome to follow their conventions about which type of warning to emit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main important distinction here seem to be: by default, users won't see the DeprecationWarnings (without explicitly hunting for them with -Wd), whereas they will see FutureWarnings "out the box".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's usually why we default to FutureWarning in scikit-image. Though, I'm happy for this SPEC to recommend whatever. 😉

RNGLike = np.random.Generator | np.random.BitGenerator


def my_func(rng: RNGLike | SeedLike | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def my_func(rng: RNGLike | SeedLike | None = None):
def my_func(*, rng: RNGLike | SeedLike | None = None):

Use rng as keyword-only here to set a "good example"?

Unless switching the behavior in-place is actually intended like the decorator implies? Not using keyword-only increases the the number of cases in which the Hinsen principle isn't maintained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably better to enforce it; I've added the change.

Copy link
Contributor

@mdhaber mdhaber Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is there a good reference for the name "Hinsen principle"? The idea is great, but this is the Google search result for the "famous" Hinsen principle : )

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had to google for it, but then found a couple of interesting reads that now populates a few browser tabs set aside to be read later :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it comes from conversations with @khinsen, or perhaps his blog post https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stability-in-the-scipy-ecosystem.html

Konrad complains about the scientific Python ecosystem a fair bit, especially about how we don't pay attention to backward compatibility. So, I suppose it is ironic that we spend so much time on this problem, and only fitting that we name one of our biggest headaches after him ;)

[Konrad doesn't seem like someone who'd take offense at my humor, but I'll make clear that that's what it is either way.]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am an academic, any citation is good for me ;-)

😂

Since we have you here, Konrad, would you say the above is the best citation of yours to the principle? I can add the URL to the SPEC and to the skimage discussion, so people will hopefully find it more easily in the future.

Alternatively, we can do something a tiny bit more ambitious and write a blog post at https://blog.scientific-python.org to explain the problem.

I don't know if anyone has been following, but scikit-image has been in a major holding pattern over this for years now. We're trying terribly hard to get a new version of the package out, with API designs we consider superior, but in many cases run into silent backward incompatibility. Our only viable solution so far is to release into an entirely new namespace, skimage2, which in itself is a headache. We haven't proceeded with that yet, because we are somewhat terrified of estranging existing users, but at the same time know we have to move forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also probably apologize to @khinsen for using the name Scientific Python without his permission (ref https://pypi.org/project/ScientificPython/). An oversight at the time, but hopefully we will receive forgiveness, given that we're trying to fulfill a mission I believe he is strongly in favor of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber, I'm not clear on your view on the current change to recommend keyword-only here. Is that okay or would you prefer to revert it before we merge this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keyword-only is preferable, too. If the original argument was keyword-only, I think the decorator can keep it keyword-only, so this is not necessarily inconsistent with what the decorator can do.

I'll think about extending it to make it able to deprecate positional use of an argument if desired. But that doesn't need to hold anything up.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you say the above is the best citation of yours to the principle?

Yes!

Our only viable solution so far is to release into an entirely new namespace, skimage2, which in itself is a headache.

I see why it's a headache, but I also believe it's the best solution in the long run. Explicit is better than implicit. New API, new name. And it's done elsewhere with good success.

hopefully we will receive forgiveness, given that we're trying to fulfill a mission I believe he is strongly in favor of.

Indeed I am! The name clash has regularly been a source of confusion, but nothing serious. And by now, my ScientificPython has disappeared from most people's radar, being Python-2-only.

spec-0007/index.md Outdated Show resolved Hide resolved
"changing: the argument will be validated using "
f"`np.random.default_rng` beginning in SciPy {end_version}, "
"and the resulting `Generator` will be used to generate "
"random numbers."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add, that the underlying distribution of numbers will not change?

Copy link
Member Author

@stefanv stefanv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about adding:

"The `Generator`s underlying bitstream may "
"be different, but the *distributions* of pseudo-random "
"numbers generated hold the same properties as before."

But, I'm not sure we can say that with confidence. The user does not know precisely how the library is using their RNG.

Copy link
Member

@lagru lagru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback. Happy to get this in.

@lagru
Copy link
Member

lagru commented Aug 29, 2024

Thanks everyone! I'm going ahead and will merge this as all threads seem more or less resolved. Any follow-up work can be done in new PRs. :)

@lagru lagru merged commit 15e7048 into scientific-python:main Aug 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.