Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

Merged
merged 1 commit into from
Oct 11, 2021

Conversation

badboy
Copy link
Member

@badboy badboy commented Oct 8, 2021

…the time

This fixes a crashing bug, that probably started being an issue since
March 2020 / Glean v26, when we removed finalizers from the Kotlin
library1. Depending on how GC runs it was theoretically possible to
hit that even with finalizers in the SDK.

Contrary to the commit message in 1 we do have one more place where we
dynamically construct metrics: Labeled Metrics!
It used to be the case that whenever you access the submetric of a
labeled metric, that is Metrics.Category.Name["label"], we created a
new Rust object, put that into the internal ConcurrentHandleMap and
returned a handle to it.

A ConcurrentHandleMap does a lot of work to ensure that you can't misuse
handles against the wrong map, when the item was already replaced, etc.
It also sets an upper limit of entries it can handle2.
That limit is high enough for the static use of Rust (you would have to
use 32767+ metrics of a single type), but quickly[3] breaks down

The solution is to not re-create labeled submetrics on every access.
We cache those ourselves now, identified by the parent's metric handle
and the label.
Now accessing the same label 32768+ times is fine, you get the same
Rust object every time (as long as it exists).

There's one last piece: We do expose a way to destroy metrics, even if
that shouldn't happen for any statically defined ones.
This is not happening on Kotlin, where we removed finalizers some time
ago 1.
We're now removing those finalizers on Swift and Python as well.
We could try to get smart and do some double-checks that the handle is
not stale when the submetric is accessed.
But that breaks down quickly, because the metric could be destroyed
between the access and the recording, still leading to issues.
The only way to handle that would be some reference counting on the
object itself, but the current implementation doens't easily allow that.
So for now we just don't destroy boolean/counter/string metrics, ever.

The included tests fail without the corresponding library changes,
giving us a bit more confidence that the fix works.

Now there's still one way to trigger a crash, in theory:
Keep accessing previously-unused labels 32768+ times.
If you do that you're really on your own.

[3]: Well, "quickly". Reproducing that on get.webgl.org takes 8+ minutes.

@badboy badboy requested review from Dexterp37 and travis79 October 8, 2021 15:29
@badboy badboy force-pushed the fenix-crashes branch 2 times, most recently from 0079897 to 03b6378 Compare October 8, 2021 15:49
@mdboom
Copy link
Contributor

mdboom commented Oct 8, 2021

Question regarding this:

Now there's still one way to trigger a crash, in theory:
Keep accessing previously-unused labels 32768+ times.
If you do that you're really on your own.

Isn't that impossible given that we limit the number of unknown labels and then move to __other__?

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

glean-core/python/tests/metrics/test_labeled.py Outdated Show resolved Hide resolved
Copy link
Member

@travis79 travis79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking, just a question about using bit shifting mostly. LGTM!

Copy link
Contributor

@Dexterp37 Dexterp37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: with the change requests from Mike and Travis addressed. Great job.

…the time

This fixes a crashing bug, that probably started being an issue since
March 2020 / Glean v26, when we removed finalizers from the Kotlin
library[1]. Depending on how GC runs it was theoretically possible to
hit that even with finalizers in the SDK.

Contrary to the commit message in [1] we do have one more place where we
dynamically construct metrics: Labeled Metrics!
It used to be the case that whenever you access the submetric of a
labeled metric, that is `Metrics.Category.Name["label"]`, we created a
new Rust object, put that into the internal ConcurrentHandleMap and
returned a handle to it.

A ConcurrentHandleMap does a lot of work to ensure that you can't misuse
handles against the wrong map, when the item was already replaced, etc.
It also sets an upper limit of entries it can handle[2].
That limit is high enough for the static use of Rust (you would have to
use 32767+ metrics of a single type), but quickly[3] breaks down

The solution is to not re-create labeled submetrics on every access.
We cache those ourselves now, identified by the parent's metric handle
and the label.
Now accessing the same label 32768+ times is fine, you get the same
Rust object every time (as long as it exists).

There's one last piece: We do expose a way to destroy metrics, even if
that shouldn't happen for any statically defined ones.
This is not happening on Kotlin, where we removed finalizers some time
ago [1].
We're now removing those finalizers on Swift and Python as well.
We could try to get smart and do some double-checks that the handle is
not stale when the submetric is accessed.
But that breaks down quickly, because the metric could be destroyed
between the access and the recording, still leading to issues.
The only way to handle that would be some reference counting on the
object itself, but the current implementation doens't easily allow that.
So for now we just don't destroy boolean/counter/string metrics, ever.

The included tests fail without the corresponding library changes,
giving us a bit more confidence that the fix works.

Now there's still one way to trigger a crash, in theory:
Keep accessing previously-unused labels 32768+ times.
If you do that you're really on your own.

[1]: 25bf9ea
[2]: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L160-L166
[3]: Well, "quickly". Reproducing that on get.webgl.org takes 8+ minutes.
@badboy badboy merged commit 6c3be03 into main Oct 11, 2021
@badboy badboy deleted the fenix-crashes branch October 11, 2021 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants