-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Extract metrics from sampled transactions [INGEST-331] #1161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, do we also need to conditionally disable trace sampling here?
relay/relay-server/src/actors/envelopes.rs
Line 2422 in 4639455
utils::sample_trace( |
So basically,
- if metrics extraction is enabled, defer trace sampling to envelope processor.
- else, do it in envelope manager as we do now.
Could be a separate PR though.
@jjbayer goddamnit. i'll try to fix it in the same PR, that was an oversight. |
@jjbayer since we are currently attempting to focus on releasehealth, and in order to not leave this PR hanging around, @jan-auer and I decided that it'd be best to merge this PR in its semi-broken state (assuming we will get buy-in from @getsentry/visibility), and later fix the trace-sampling case. |
@untitaker I'm fine with this behavioural change. But I'll check with the Performance team now that the entire team is 100% back from OOO. If I haven't heard any objections by EOD (Toronto timezone), I'll approve this PR as is. |
@untitaker overall looks good. I just want to confirm that there would be a way in the future to add breakdowns from SDKs with extra dev work? |
@silent1mezzo It does not become impossible, though at that point we need to define how a span ops breakdown from the SDK is merged with one computed by the server. It depends on those details as to how hard it will be. |
Extract metrics from transactions that are affected by sampling rules. Since metrics are extracted from breakdowns, and breakdowns are something commputed in the store normalization, duplicate breakdown computation into the metrics extraction codepath.
We could instead move breakdown extraction upfront, such that the event is mutated before metrics extraction happens, but this way metrics extraction (from dropped transactions) is easier to optimize should that be necessary. Concretely I have the fear that down the road we realize that a) event deserialization is too slow b) making it faster is impossible. If we manage to avoid mutating the event for metrics extraction, it becomes easier to later write a custom Deserialize impl that extracts the data before it is passed to FromValue at all, if we really need to go down that road. There's still some mutation that we'd have to port to metrics extraction though (clock drift processing/finalize_event) if that happens.
We need to be careful about not accidentally writing something different into the event than what we emit for metrics, but we'll just have to see how that plays out. We probably want to ensure that both versions of breakdown computation happen in the same relay instance (i.e. not compute breakdowns for metrics in a customer relay, then write the breakdowns in a processing relay), to avoid version mismatches, but unclear beyond that.
This change does not yet introduce transaction metric extraction to customer relays. So the interaction between dyn. sampling and metrics is only correct if no customer relay is involved. It can still happen that a customer relay drops the transaction without sending metrics, then the processing relay can't do much about it. That's tbd later.
As a side effect, and that's now interesting for @getsentry/visibility, this PR also removes support for ingesting breakdowns from SDKs. There are explicit tests for this functionality, but there's no SDK that actually sends them, and from a quick conversation w alberto it seems that we may not need it anytime soon. Removing this feature simplifies idempotency of breakdown extraction over multiple relays. If it turns out that we need/will need this feature I'd like to know the timeframe so we can appropriately decide whether we should keep it (which means more work now) vs re-add it later (more work in the future).