-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SPEC7: Seeding pseudo-random number generation #180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea to try to uniformize all that 👍
(I suppose you meant SPEC and not NEP.)
There's a deprecation strategy that can work to migrate from |
To make sure I understand, this will change the return values of functions ( |
Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy. |
@stefanv thanks for starting to summarize that long and complex discussion!
Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact. I guess the point here is:
And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware. I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from |
The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using |
Right, and my take was that this is a desired outcome from our perspective. |
By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust. |
I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments. |
I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward! |
Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what |
Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using |
Why don't we emit a warning from numpy random.seed? |
Because it will create utter havoc in the many valid uses in test suites? |
Isn't that what you want, eventually? |
Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed? |
Yes. There are plenty of ML programs (in particular) that call
One enormous hurdle at a time, please. 😉 |
Fair enough :) |
There is no plan to do so. Deprecating |
Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer |
I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid. Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of
|
There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome. Thanks! |
Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, left some comments, but nothing serious (cool to have tests and have it much more verbose)!
The one thing that I do feel is that it would be good to either make sure rng
is fully normalized (i.e. a RandomState
instance also in the old paths), or at least mention briefly that normalization is still required.
Fully normalizing may be a bit tedious in old positional use, unless you keep their signature as:
def library_function(..., random_state=None, *, rng=None):
# Always just ignore random_state
(Although I guess it isn't hard, you just have to build a new args tuple.)
I removed the normalization code that was run with the decorator because we don't know what the decorated function was doing with values passed to old arguments like
We can try to mention this explicitly. Again, the decorator just emits warnings and processes input passed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-read the current version of the SPEC. It's pretty good, but a bit too brief in places. Most importantly, it'd be nice to expand the motivation, given it's only a few sentences now, for a change with a very large blast radius. The point on avoiding naive seeding strategies is clear, the one on avoiding global state is not. It requires explaining why this is a net win.
A consequence of no longer using global state is that if a library function uses random number generation internally, all calls from other library functions (including those in downstream libraries) now must expose an rng
keyword as well and thread through the rng=rng
keywords. Otherwise code that could previously be made to run deterministically no longer can be made to do so - which is not great. It was always a good idea to thread through random-state keywords, but it'll be much more important to do so now. It's worth calling this out explicitly.
From an adopting point of view, as a SciPy maintainer I'd really like to see scikit-learn commit to implementing this proposal. SciPy now mostly matches scikit-learn and took the current random_state=
infra from scikit-learn IIRC. Hence if we'd change SciPy without scikit-learn doing the same, the API uniformity doesn't get all that much better.
spec-0007/index.md
Outdated
|
||
Legacy behavior in packages such as scikit-learn (`sklearn.utils.check_random_state`) typically handle `None` (use the global seed state), an int (convert to `RandomState`), or `RandomState` object. | ||
|
||
Two strong motivations for moving over to `Generator`s are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend splitting this off into a separate ## Motivation
section which goes above the Implementation section. That will help the reader, and it will make clearer to the authors how short the motivation given now is (it needs expanding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially addressed by stefanv@2dcf128, but I leave addition of motivation for a future PR. I'm not the best person to comment on this; I was comfortable controlling using np.random.rand
and only changed due to peer pressure : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That links back to my comment - I think you wanted to link a commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fixed it - stefanv@2dcf128. But I just moved the section; didn't add additional motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SPEC template doesn't have a motivation section (/cc @jarrodmillman perhaps we should consider), but I added some motivational text to the description.
…or `end_version` optional
Adjustments to SPEC7 per recent reviews
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recent round of changes look good to me. Good to include /...
for old names other than random_state
and seed
as you have, and I think reordering the points was a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads great and I think is ready to merge as a draft. Nevertheless, I have some non-critical comments. Those don't necessarily have to hold this up and could also be addressed in follow-up PRs. 😊
Sorry if my comments where already addressed in previous discussions. I didn't check those.
- If neither `rng` nor `random_state`/`seed`/`...` is specified and `np.random.seed` has been used to set the seed, emit a `FutureWarning` about the upcoming change in behavior. | ||
- If `random_state`/`seed`/`...` is passed by keyword or by position, treat it as before, but: | ||
|
||
- Emit a `DeprecationWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Emit a `DeprecationWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`. | |
- Emit a `FutureWarning` if passed by keyword, warning about the deprecation of keyword `random_state` in favor of `rng`. |
Not sure why using the deprecated keyword shouldn't also be a FutureWarning
when using the deprecated position is. It's both aimed at end users (Python warning categories).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @mdhaber take that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not familiar with that being the official distinction. This was written based on what I've seen in SciPy.
In our release notes, there is a standard note about users checking for DeprecationWarning
s:
Before upgrading, we recommend that users check that their own code does not use deprecated SciPy functionality (to do so, run your code with
python -Wd
and check forDeprecationWarning
s).
In SciPy, I have always seen a DeprecationWarning
when something is going to raise an error in the future (i.e. in line with Hinsen principle) and a FutureWarning
when something is going to change behavior but not raise an error (i.e. in violation of the Hinsen principle).
Often DeprecationWarning
s are used for functions and keywords being removed. FutureWarning
s are a lot less common. Here's an example of a decision made between the two.
In this case, a keyword is being removed in the sense that if a user passes random_state=10
they will get an error, so SciPy would probably emit a DeprecationWarning
. If they pass 10
to random_state
by position, on the other hand, they will not get an error in the future, but the behavior will change, so SciPy would probably emit a FutureWarning
.
I suppose we can mention that projects are welcome to follow their conventions about which type of warning to emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main important distinction here seem to be: by default, users won't see the DeprecationWarning
s (without explicitly hunting for them with -Wd
), whereas they will see FutureWarning
s "out the box".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's usually why we default to FutureWarning
in scikit-image. Though, I'm happy for this SPEC to recommend whatever. 😉
spec-0007/index.md
Outdated
RNGLike = np.random.Generator | np.random.BitGenerator | ||
|
||
|
||
def my_func(rng: RNGLike | SeedLike | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def my_func(rng: RNGLike | SeedLike | None = None): | |
def my_func(*, rng: RNGLike | SeedLike | None = None): |
Use rng
as keyword-only here to set a "good example"?
Unless switching the behavior in-place is actually intended like the decorator implies? Not using keyword-only increases the the number of cases in which the Hinsen principle isn't maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably better to enforce it; I've added the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had to google for it, but then found a couple of interesting reads that now populates a few browser tabs set aside to be read later :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it comes from conversations with @khinsen, or perhaps his blog post https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stability-in-the-scipy-ecosystem.html
Konrad complains about the scientific Python ecosystem a fair bit, especially about how we don't pay attention to backward compatibility. So, I suppose it is ironic that we spend so much time on this problem, and only fitting that we name one of our biggest headaches after him ;)
[Konrad doesn't seem like someone who'd take offense at my humor, but I'll make clear that that's what it is either way.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I am an academic, any citation is good for me ;-)
😂
Since we have you here, Konrad, would you say the above is the best citation of yours to the principle? I can add the URL to the SPEC and to the skimage discussion, so people will hopefully find it more easily in the future.
Alternatively, we can do something a tiny bit more ambitious and write a blog post at https://blog.scientific-python.org to explain the problem.
I don't know if anyone has been following, but scikit-image has been in a major holding pattern over this for years now. We're trying terribly hard to get a new version of the package out, with API designs we consider superior, but in many cases run into silent backward incompatibility. Our only viable solution so far is to release into an entirely new namespace, skimage2
, which in itself is a headache. We haven't proceeded with that yet, because we are somewhat terrified of estranging existing users, but at the same time know we have to move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also probably apologize to @khinsen for using the name Scientific Python without his permission (ref https://pypi.org/project/ScientificPython/). An oversight at the time, but hopefully we will receive forgiveness, given that we're trying to fulfill a mission I believe he is strongly in favor of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdhaber, I'm not clear on your view on the current change to recommend keyword-only here. Is that okay or would you prefer to revert it before we merge this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keyword-only is preferable, too. If the original argument was keyword-only, I think the decorator can keep it keyword-only, so this is not necessarily inconsistent with what the decorator can do.
I'll think about extending it to make it able to deprecate positional use of an argument if desired. But that doesn't need to hold anything up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you say the above is the best citation of yours to the principle?
Yes!
Our only viable solution so far is to release into an entirely new namespace, skimage2, which in itself is a headache.
I see why it's a headache, but I also believe it's the best solution in the long run. Explicit is better than implicit. New API, new name. And it's done elsewhere with good success.
hopefully we will receive forgiveness, given that we're trying to fulfill a mission I believe he is strongly in favor of.
Indeed I am! The name clash has regularly been a source of confusion, but nothing serious. And by now, my ScientificPython has disappeared from most people's radar, being Python-2-only.
"changing: the argument will be validated using " | ||
f"`np.random.default_rng` beginning in SciPy {end_version}, " | ||
"and the resulting `Generator` will be used to generate " | ||
"random numbers." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add, that the underlying distribution of numbers will not change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about adding:
"The `Generator`s underlying bitstream may "
"be different, but the *distributions* of pseudo-random "
"numbers generated hold the same properties as before."
But, I'm not sure we can say that with confidence. The user does not know precisely how the library is using their RNG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the feedback. Happy to get this in.
Thanks everyone! I'm going ahead and will merge this as all threads seem more or less resolved. Any follow-up work can be done in new PRs. :) |
Under discussion at scipy/scipy#14322