Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SPEC7: Seeding pseudo-random number generation #180

Merged
merged 57 commits into from
Aug 29, 2024
Merged
Changes from 11 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b233242
Add draft SPEC7: seeding pseudo-random number generation
stefanv Apr 19, 2023
1e0d3f9
Add deprecation strategy
stefanv Apr 21, 2023
009453f
Explain why globally seeded code will become unseeded
stefanv Apr 25, 2023
cfa4fd8
Integrate some reviewer feedback
stefanv Jun 4, 2024
9bf299a
Adjust based on https://github.com/scientific-python/specs/pull/180#d…
stefanv Jun 4, 2024
fb64c31
Fix typo
stefanv Jun 4, 2024
d55c0b2
Add code snippet to SPEC 7 (as proposed for SciPy)
seberg Jun 4, 2024
935fed2
Add a library function example
seberg Jun 4, 2024
a17eba9
Apply suggestions from code review
seberg Jun 6, 2024
0703ee8
Fix implementation note about random.seed()
seberg Jun 6, 2024
3fd8fc7
Merge pull request #1 from seberg/nep7-prng
stefanv Jul 2, 2024
7c163df
Add Pamphile as co-author
stefanv Jul 12, 2024
e2d6541
Add @lagru's keyword-only suggestion
stefanv Jul 12, 2024
9894af0
Appease linter
stefanv Jul 12, 2024
8d89cc8
Add @ilayn's suggestion to separate high-level goals and technical re…
stefanv Jul 12, 2024
151d68d
Add type annotation
stefanv Jul 12, 2024
5491549
How to transition away from np.random.seed
stefanv Jul 12, 2024
f4cf078
Using random_state kwd will eventually raise an error
stefanv Jul 12, 2024
9d60785
Clarify language describing decorator
stefanv Jul 12, 2024
7531111
Add @mdhaber's docstring to the example function
stefanv Jul 12, 2024
c196b74
Assorted tweaks and decorator code generalization
mdhaber Jul 13, 2024
415f173
Adjustments per review
mdhaber Jul 13, 2024
efc6428
Correct decorator
mdhaber Jul 15, 2024
83a1d92
Apply suggestions from code review
stefanv Jul 16, 2024
7a4258e
Pull code into external file
stefanv Jul 17, 2024
4679d40
Mention that type annotation describes additional types
stefanv Jul 17, 2024
14ab4a4
DOC: tweak
mdhaber Jul 17, 2024
07a2235
MAINT: add common message about different behavior of default_rng
mdhaber Jul 17, 2024
07744f2
Apply suggestions from code review
mdhaber Jul 17, 2024
be41cd6
Update spec-0007/transition_to_rng.py
mdhaber Jul 17, 2024
4b1c1c4
Merge pull request #2 from mdhaber/nep7-prng
stefanv Jul 17, 2024
a6e1caa
Fix unterminated string
stefanv Jul 17, 2024
38341b1
Add tests for _transition_to_rng
stefanv Jul 17, 2024
63a7bf5
Test for multiple specifications of rng/random_state
stefanv Jul 17, 2024
ef35bb9
General transition_to_rng docstring edits
stefanv Jul 17, 2024
4f8a441
Rename NEW_NAME to caps, as suggested by @mdhaber
stefanv Jul 17, 2024
e6cf422
Add tests for keyword-only decorator usage
stefanv Jul 17, 2024
a7d844f
Move Hinsen footnote closer to mention
stefanv Jul 17, 2024
83b28b6
Indent list in numerated list
stefanv Jul 17, 2024
cef9c51
Avoid usage of overloaded term "recommend"
stefanv Jul 17, 2024
f894a7a
More careful phrasing around Hinsen principle
stefanv Jul 17, 2024
044ff76
Apply @mdhaber's suggestions from code review
stefanv Jul 18, 2024
27934c7
MAINT: change optional dep_version to required end_version
mdhaber Jul 20, 2024
636bc5d
MAINT: some positional arguments should not warn
mdhaber Jul 20, 2024
bd32f7b
DOC: describe other cases in which a warning is emitted when arg is p…
mdhaber Jul 20, 2024
6ec3b6a
MAINT: specify when warnings should begin to be emitted; make decorat…
mdhaber Aug 7, 2024
4aa7051
Compute cmn_msg at import time to save work
mdhaber Aug 7, 2024
37ff883
Clarify extent of deprecations
mdhaber Aug 7, 2024
2dcf128
Move motivation section
mdhaber Aug 7, 2024
cc43293
Show example 'library_function' in different stages
mdhaber Aug 7, 2024
7b4c61b
Adjust tests
mdhaber Aug 7, 2024
ab3f448
Apply suggestions from code review
mdhaber Aug 26, 2024
0ae7919
Merge pull request #3 from mdhaber/nep7-prng
stefanv Aug 27, 2024
b7ddc1f
Improve motivation; general edits
stefanv Aug 27, 2024
81da885
Apply pre-commit changes
stefanv Aug 27, 2024
b106497
Merge remote-tracking branch 'origin/main' into nep7-prng
stefanv Aug 27, 2024
1180190
Edits based on @lagru's feedback
stefanv Aug 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions spec-0007/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
---
title: "SPEC 7 — Seeding pseudo-random number generation"
date: 2023-04-19
author:
- "Stéfan van der Walt <[email protected]>"
- "Sebastian Berg <[email protected]>"
stefanv marked this conversation as resolved.
Show resolved Hide resolved
- Other participants in the discussion <[email protected]>"
tupui marked this conversation as resolved.
Show resolved Hide resolved
discussion: https://github.com/scipy/scipy/issues/14322
endorsed-by:
---

## Description

<!--
Briefly and clearly describe the proposal.
Explain the general need and the advantages of this specific proposal.
If relevant, include examples of how the new functionality would be used,
intended use-cases, and pseudo-code illustrating its use.
-->

There is disparity in the APIs libraries provide to seed random number generation.
This SPEC suggests a single, pragmatic API for the ecosystem, taking into account technical and historical factors.
Adopting such a uniform API will simplify the user experience, especially for those who rely on multiple projects.

Specifically, we recommend to:

- Deprecate the use of `RandomState` and `np.random.seed`.
- Standardize usage and interpretation of an `rng` keyword for seeding.
stefanv marked this conversation as resolved.
Show resolved Hide resolved

### Concepts

- `BitGenerator`: Generates a stream of pseudo-random bits. The default generator in NumPy (`np.random.default_rng`) uses PCG64.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, can this reference to BitGenerator be removed? For this SPEC, the user facing API is around Generator and RandomState.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion taken; I am waiting to see how the rest of the text pans out, to see whether we ever need to explain the notion of a bitgenerator. I quite like explaining it this way, since those are fundamental building blocks, but also fine to leave it out if it does not add anything.

- `Generator`: Derives pseudo-random numbers from the bits produced by a `BitGenerator`.
- `RandomState`: a [legacy object in NumPy](https://numpy.org/doc/stable/reference/random/index.html), similar to `Generator`, that produces random numbers based on the Mersenne Twister.

### Constraints

NumPy, SciPy, scikit-learn, scikit-image, and NetworkX all implement pseudo-random seeding in slightly different ways.
Common keyword arguments include `random_state` and `seed`.
In practice, the seed is unfortunately also often controlled using `np.random.seed`.

## Implementation

<!--
Discuss how this would be implemented.
-->

Legacy behavior in packages such as scikit-learn (`sklearn.utils.check_random_state`) typically handle `None` (use the global seed state), an int (convert to `RandomState`), or `RandomState` object.

Two strong motivations for moving over to `Generator`s are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend splitting this off into a separate ## Motivation section which goes above the Implementation section. That will help the reader, and it will make clearer to the authors how short the motivation given now is (it needs expanding).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

Copy link
Contributor

@mdhaber mdhaber Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed by stefanv@2dcf128, but I leave addition of motivation for a future PR. I'm not the best person to comment on this; I was comfortable controlling using np.random.rand and only changed due to peer pressure : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That links back to my comment - I think you wanted to link a commit?

Copy link
Contributor

@mdhaber mdhaber Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fixed it - stefanv@2dcf128. But I just moved the section; didn't add additional motivation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPEC template doesn't have a motivation section (/cc @jarrodmillman perhaps we should consider), but I added some motivational text to the description.


(1) they avoid naïve seeding strategies, such as using successive integers, via the underlying [SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html#seedsequence-spawning);
(2) they avoid using global state (from `np.random.mtrand._rand`).

Our recommendation here is a deprecation strategy which does not in _all_ cases adhere to the Hinsen[^hinsen] principle.

The [deprecation strategy](https://github.com/scientific-python/specs/pull/180#issuecomment-1515248009) is:

1. Accept both `rng` and `random_state` keyword arguments.
2. If `rng=None`, handle `random_state` as in legacy behavior (see above), except use a compatible Generator instead of RandomState.
A DeprecationWarning is raised to warn about a future change in behavior.
3. After <X time>, use only `rng`, seeding with `default_rng(rng)`.
Raise an error if `random_state` is provided.
4. At a time of the library's choosing, remove any machinery related to `random_state`.

### Impact

The following users will be affected:

1. Those who use `np.random.seed`. The proposal will do away with that global seeding mechanism, meaning that code that relies on it will, after a certain deprecation period, start seeing a different stream of random numbers than before. To ensure that this does not go unnoticed, the library should raise a `FutureWarning` if `np.random.seed` was called earlier (we show how to do that further down).

Such code will, in effect, go from being seeded to being unseeded.
To avoid that from happening, the code will have to be modified to pass in explicitly an `rng` argument on each function call.
stefanv marked this conversation as resolved.
Show resolved Hide resolved

2. Those who do not seed. Their code will, after the deprecation period, use the newly proposed default. Since they were already not requesting repeatable sequences, and since the underlying _distributions_ of pseudo-random numbers did not change, they should be unaffected.

3. Users of `random_state=...`. Support for the `random_state` argument may be dropped eventually, but meanwhile we can provide clear guidance, via deprecation warnings and documentation, on how to migrate to the new `rng` keyword.

[^hinsen]: The Hinsen principle states, loosely, that code should, whether executed now or in the future, return the same result, or raise an error.

### Code

stefanv marked this conversation as resolved.
Show resolved Hide resolved
As an example, consider how SciPy would transition from the `seed` to the `rng` keyword using a decorator.
This is implemented using:
1. A `check_random_state` function which normalizes either old (`seed`) or new (`rng`) input to a `Generator` object.
If neither `seed` nor `rng` was passed but the user has previously called `np.random.seed()`
this function gives a `FutureWarning` because the behavior will change as noted in
the Impact section point 1.
2. A decorator to deal with the `seed` to `rng` keyword rename. In future versions, this will deprecate the `seed` keyword. Meanwhile, it ensures that the documentation and auto-completion only advertises the new parameter name.
stefanv marked this conversation as resolved.
Show resolved Hide resolved
Delaying the deprecation ensures that downstream users can switch to `rng=` on all supported SciPy versions when the deprecation happens.

```python
_NoValue = object() # singleton to indicate not explicitly passed


def check_random_state(seed=_NoValue, rng=_NoValue):
if rng is not _NoValue and seed is not _NoValue:
raise TypeError("cannot pass both `rng=` and `random_state=` at the same time.")
if rng is not _NoValue:
return np.random.default_rng(rng)
stefanv marked this conversation as resolved.
Show resolved Hide resolved

if seed is _NoValue:
# If the user passed nothing, we have to reach into NumPy here:
# 1. If np.random.seed(None) was called (or never called), then we can
# just use the default_rng (the result is random anyway).
# 2. If it was called, we must return the global random state object
# and warn about future ignoring of seed!
if np.random.mtrand._rand._bit_generator._seed_seq is not None:
# The user did not seed, so no need to warn.
return np.random.default_rng()
warnings.warn(
"The NumPy global rng was seeded in call to np.random.seed() "
"in the future this function will ignore this seed and return "
"random values as if a new `np.random.default_rng()` was created.",
FutureWarning, stacklevel=5)
return np.random.mtrand._rand
if seed is None or seed is np.random:
return np.random.mtrand._rand
if isinstance(seed, (numbers.Integral, np.integer)):
return np.random.RandomState(seed)
if isinstance(seed, (np.random.RandomState, np.random.Generator)):
return seed

raise ValueError(f"'{seed}' cannot be used to seed a numpy.random.RandomState"
" instance")


def _prepare_rng(old_name, dep_version=None):
new_name = "rng"

def decorator(fun):
@functools.wraps(fun)
def wrapper(*args, **kwargs):
if old_name in kwargs:
if dep_version:
end_version = dep_version.split('.')
end_version[1] = str(int(end_version[1]) + 2)
end_version = '.'.join(end_version)
message = (f"Use of keyword argument `{old_name}` is "
f"deprecated and replaced by `{new_name}`. "
f"Support for `{old_name}` will be removed "
f"in SciPy {end_version}.")
warnings.warn(message, DeprecationWarning, stacklevel=2)
if new_name in kwargs:
message = (f"{fun.__name__}() got multiple values for "
f"argument now known as `{new_name}`")
raise TypeError(message)

kwargs[new_name] = check_random_state(
kwargs.pop(old_name, _NoValue),
rng=kwargs.pop(new_name, _NoValue)
)
return fun(*args, **kwargs)
return wrapper
return decorator


@_prepare_rng("random_state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this decorator uses the deprecated random_state while point 2 above talks about seed as the deprecated parameter? Is this a typo or am I confused myself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, should align it to random_state probably. SciPy has both, which is where this slip up cames from probably.

def library_function(/, rng=None):
# The decorated library function takes an `rng` argument which is
# guaranteed to be a either a Generator or a RandomState.
# `random_state=` is supported input (the old can be customized).
assert isinstance(rng, (np.random.Generator, np.random.RandomState))
stefanv marked this conversation as resolved.
Show resolved Hide resolved
```

### Core Project Endorsement

<!--
Discuss what it means for a core project to endorse this SPEC.
-->

### Ecosystem Adoption

<!--
Discuss what it means for a project to adopt this SPEC.
-->

## Notes

<!--
Include a bulleted list of annotated links, comments,
and other ancillary information as needed.
-->