Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Family::get_or_create #102

Closed
wants to merge 1 commit into from

Conversation

nox
Copy link
Contributor

@nox nox commented Oct 13, 2022

The method should not overwrite an existing key after it obtains the write lock.

Consider the following scenario with Family<LabelSet, Gauge> used by threads A and B:

  1. A and B both call family.get_or_insert(&label_set)
  2. A and B both acquire read lock and finds no key
  3. A gets write lock and inserts a new gauge with value 0
  4. A increments returned gauge to 1
  5. B gets write lock and inserts a new gauge with value 0
  6. B increments returned gauge to 1
  7. A decrements gauge back to 0
  8. B decrements gauge which now overflows to u64::MAX

@nox nox force-pushed the fix-get-or-create branch 4 times, most recently from d9ec6e0 to d777de3 Compare October 13, 2022 08:56
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh silly me.

Great catch. Thanks for the patch.

Can you include a changelog entry?

Out of curiosity, did you hit this race condition, or found it while reading?

@nox
Copy link
Contributor Author

nox commented Oct 13, 2022

Out of curiosity, did you hit this race condition, or found it while reading?

We discovered it the hard way with gauges reporting increments of 18.4 quintillions after making the switch in 3 repos.

@nox
Copy link
Contributor Author

nox commented Oct 13, 2022

I'll add a changelog entry. Can you release 0.18.1 with that fix?

@nox nox force-pushed the fix-get-or-create branch from d777de3 to fe41d10 Compare October 13, 2022 09:07
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you hit this race condition, or found it while reading?

We discovered it the hard way with gauges reporting increments of 18.4 quintillions after making the switch in 3 repos.

Sorry to hear that. My bad. Thanks for debugging.

Do you have some time to tackle the second race condition as well? If not, I will give it a shot tomorrow.

I'll add a changelog entry. Can you release 0.18.1 with that fix?

Yes, I will do that as soon as we have both race conditions fixed. See below.

self.metrics
.write()
.entry(label_set.clone())
.or_insert_with(|| self.constructor.new_metric());

RwLockReadGuard::map(self.metrics.read(), |metrics| {
Copy link
Member

@mxinden mxinden Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Family::remove there is actually a second race condition.

In the scenarios where we have two threads:

  1. A: Call Family::get_or_create with a new label set.
    1. Acquiring the read lock. Nothing there.
    2. Acquiring the write lock. Inserting the new metric.
    3. Being preempted by the OS.
  2. B: Call Family::remove with the above label set.
  3. A: Acquiring the read lock for the second (last) time, .expect()ing the previously inserted metric to be there.

This can be prevented by leveraging RwLockWriteGuard::downgrade of the previously acquired write lock instead of doing acquiring another read lock via self.metrics.read.

Does the above make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense yes. The patch now uses RwLockWriteGuard::downgrade.

@nox nox force-pushed the fix-get-or-create branch from fe41d10 to e08410a Compare October 13, 2022 22:17
The method should not overwrite an existing key after it obtains the write lock.

Consider the following scenario with `Family<LabelSet, Gauge>` used by threads A and B:

1. A and B both call `family.get_or_insert(&label_set)`
2. A and B both acquire read lock and finds no key
3. A gets write lock and inserts a new gauge with value 0
4. A increments returned gauge to 1
5. B gets write lock and inserts a new gauge with value 0
6. B increments returned gauge to 1
7. A decrements gauge back to 0
8. B decrements gauge which now overflows to `u64::MAX`

Signed-off-by: Anthony Ramine <[email protected]>
@nox nox force-pushed the fix-get-or-create branch from e08410a to 290369f Compare October 13, 2022 22:21
This was referenced Oct 14, 2022
@mxinden
Copy link
Member

mxinden commented Oct 14, 2022

Released with #103 as v0.18.1. Merged into master via #104.

I am closing here.

Again, thanks @nox for debugging and thanks for providing a fix. Great work.

@mxinden mxinden closed this Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants