Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spans): Use gauges to report self and total time to lower costs #3448

Merged
merged 6 commits into from
Apr 22, 2024

Conversation

phacops
Copy link
Contributor

@phacops phacops commented Apr 17, 2024

We're currently storing self and total times for spans as distribution metrics. Those store a lot of information in order to be flexible enough to calculate percentiles.

Since querying percentiles has been too slow to display on any screen, we decided not to do it this way and use a different method. It's not necessary to keep storing them as distributions and we could use gauges instead to lower our cost.

The plan is to add those new gauge metrics, record them for a while, add support to query them in the product behind a feature flag and switch to them when we're happy with the result.

@phacops phacops requested a review from a team as a code owner April 17, 2024 16:19
Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea! This will pretty much double the amount of buckets we send between Relays and to Kafka. Can we put the new metrics behind a feature flag?

},
MetricSpec {
category: DataCategory::Span,
mri: "g:spans/self_time@millisecond".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick with exclusive_time and duration instead of self_time and total_time? Or would that cause naming clashes in some downstream component that ignores the type prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can stick with them but we do rename them to total_time and self_time in the UI. I thought that was a good occasion to stop having to do that.

Any reason why not to change the names?

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this PR to experiment with what the product would look like without distributions, or is it already decided and we now need the double write for the transition period?

If it's just for experimenting, I don't think we actually need to record a gauge metric to see what the product would look like without distributions. Distributions are (almost) a superset of gauges, so we should be able to query min, max, sum, and count from the distributions table just as easily as from the gauges table.

@phacops
Copy link
Contributor Author

phacops commented Apr 19, 2024

This is not to experiment, this is to "downgrade" the metric to lower its cost since we only use averages and a solution with percentiles won't come by improving distributions. Gauges are cheaper than distributions to store.

@phacops phacops requested a review from jjbayer April 19, 2024 21:26
@phacops phacops self-assigned this Apr 19, 2024
@phacops phacops enabled auto-merge (squash) April 22, 2024 11:39
];

if double_write_distributions_as_gauges {
metrics.append(&mut vec![
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metrics.append(&mut vec![
metrics.extend([

@phacops phacops merged commit 03643c3 into master Apr 22, 2024
20 checks passed
@phacops phacops deleted the pierre/spans-use-gauges-to-lower-cost branch April 22, 2024 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants