Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

badboy · 2021-10-08T15:29:16Z

…the time

This fixes a crashing bug, that probably started being an issue since
March 2020 / Glean v26, when we removed finalizers from the Kotlin
library1. Depending on how GC runs it was theoretically possible to
hit that even with finalizers in the SDK.

Contrary to the commit message in 1 we do have one more place where we
dynamically construct metrics: Labeled Metrics!
It used to be the case that whenever you access the submetric of a
labeled metric, that is Metrics.Category.Name["label"], we created a
new Rust object, put that into the internal ConcurrentHandleMap and
returned a handle to it.

A ConcurrentHandleMap does a lot of work to ensure that you can't misuse
handles against the wrong map, when the item was already replaced, etc.
It also sets an upper limit of entries it can handle2.
That limit is high enough for the static use of Rust (you would have to
use 32767+ metrics of a single type), but quickly[3] breaks down

The solution is to not re-create labeled submetrics on every access.
We cache those ourselves now, identified by the parent's metric handle
and the label.
Now accessing the same label 32768+ times is fine, you get the same
Rust object every time (as long as it exists).

There's one last piece: We do expose a way to destroy metrics, even if
that shouldn't happen for any statically defined ones.
This is not happening on Kotlin, where we removed finalizers some time
ago 1.
We're now removing those finalizers on Swift and Python as well.
We could try to get smart and do some double-checks that the handle is
not stale when the submetric is accessed.
But that breaks down quickly, because the metric could be destroyed
between the access and the recording, still leading to issues.
The only way to handle that would be some reference counting on the
object itself, but the current implementation doens't easily allow that.
So for now we just don't destroy boolean/counter/string metrics, ever.

The included tests fail without the corresponding library changes,
giving us a bit more confidence that the fix works.

Now there's still one way to trigger a crash, in theory:
Keep accessing previously-unused labels 32768+ times.
If you do that you're really on your own.

[3]: Well, "quickly". Reproducing that on get.webgl.org takes 8+ minutes.

mdboom · 2021-10-08T15:50:30Z

Question regarding this:

Now there's still one way to trigger a crash, in theory:
Keep accessing previously-unused labels 32768+ times.
If you do that you're really on your own.

Isn't that impossible given that we limit the number of unknown labels and then move to __other__?

mdboom

r+wc

glean-core/python/tests/metrics/test_labeled.py

glean-core/python/glean/metrics/boolean.py

travis79

Nothing blocking, just a question about using bit shifting mostly. LGTM!

glean-core/android/src/test/java/mozilla/telemetry/glean/private/LabeledMetricTypeTest.kt

glean-core/ios/GleanTests/Metrics/LabeledMetricTests.swift

glean-core/python/tests/metrics/test_labeled.py

Dexterp37

with the change requests from Mike and Travis addressed. Great job.

…the time This fixes a crashing bug, that probably started being an issue since March 2020 / Glean v26, when we removed finalizers from the Kotlin library[1]. Depending on how GC runs it was theoretically possible to hit that even with finalizers in the SDK. Contrary to the commit message in [1] we do have one more place where we dynamically construct metrics: Labeled Metrics! It used to be the case that whenever you access the submetric of a labeled metric, that is `Metrics.Category.Name["label"]`, we created a new Rust object, put that into the internal ConcurrentHandleMap and returned a handle to it. A ConcurrentHandleMap does a lot of work to ensure that you can't misuse handles against the wrong map, when the item was already replaced, etc. It also sets an upper limit of entries it can handle[2]. That limit is high enough for the static use of Rust (you would have to use 32767+ metrics of a single type), but quickly[3] breaks down The solution is to not re-create labeled submetrics on every access. We cache those ourselves now, identified by the parent's metric handle and the label. Now accessing the same label 32768+ times is fine, you get the same Rust object every time (as long as it exists). There's one last piece: We do expose a way to destroy metrics, even if that shouldn't happen for any statically defined ones. This is not happening on Kotlin, where we removed finalizers some time ago [1]. We're now removing those finalizers on Swift and Python as well. We could try to get smart and do some double-checks that the handle is not stale when the submetric is accessed. But that breaks down quickly, because the metric could be destroyed between the access and the recording, still leading to issues. The only way to handle that would be some reference counting on the object itself, but the current implementation doens't easily allow that. So for now we just don't destroy boolean/counter/string metrics, ever. The included tests fail without the corresponding library changes, giving us a bit more confidence that the fix works. Now there's still one way to trigger a crash, in theory: Keep accessing previously-unused labels 32768+ times. If you do that you're really on your own. [1]: 25bf9ea [2]: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L160-L166 [3]: Well, "quickly". Reproducing that on get.webgl.org takes 8+ minutes.

badboy requested review from Dexterp37 and travis79 October 8, 2021 15:29

badboy force-pushed the fenix-crashes branch 2 times, most recently from 0079897 to 03b6378 Compare October 8, 2021 15:49

mdboom approved these changes Oct 8, 2021

View reviewed changes

glean-core/python/tests/metrics/test_labeled.py Outdated Show resolved Hide resolved

mdboom reviewed Oct 8, 2021

View reviewed changes

glean-core/python/glean/metrics/boolean.py Show resolved Hide resolved

travis79 approved these changes Oct 8, 2021

View reviewed changes

Dexterp37 approved these changes Oct 11, 2021

View reviewed changes

badboy force-pushed the fenix-crashes branch from 03b6378 to 4993094 Compare October 11, 2021 08:49

badboy merged commit 6c3be03 into main Oct 11, 2021

badboy deleted the fenix-crashes branch October 11, 2021 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

badboy commented Oct 8, 2021

mdboom commented Oct 8, 2021

mdboom left a comment

travis79 left a comment

Dexterp37 left a comment

Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all … #1823

Conversation

badboy commented Oct 8, 2021

mdboom commented Oct 8, 2021

mdboom left a comment

Choose a reason for hiding this comment

travis79 left a comment

Choose a reason for hiding this comment

Dexterp37 left a comment

Choose a reason for hiding this comment