Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rusty: Integrate stats with the metrics framework #377

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

jfernandez
Copy link
Contributor

We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code.

This patch introduces the metrics [1] crate, which provides a framework for defining metrics and publishing them to different backends.

The initial implementation includes the dispatched_tasks_count metric, tagged with type. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged.

A new flag, --enable-prometheus, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards.

Example of charting 1s rate of dispatched tasks by type:
image

Future changes will migrate additional stats to this framework and add support for other backends.

[1] https://metrics.rs/

Copy link
Contributor

@dschatzberg dschatzberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to what I added to scx_layered. I'm mostly happy with the approach, but there's still a fair bit of boilerplate. One idea I kicked around was making a procedural macro to define stats from a Struct which would generate a lot of this boilerplate.

counter!("dispatched_tasks_count", "type" => "direct_greedy_far").increment(direct_greedy_far);
counter!("dispatched_tasks_count", "type" => "dsq").increment(dsq);
counter!("dispatched_tasks_count", "type" => "greedy_local").increment(greedy_local);
counter!("dispatched_tasks_count", "type" => "greedy_xnuma").increment(greedy_xnuma);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This registers a new counter on each report(). I would think we want to register the counters ahead of time and only set their absolute value here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into that and see if we'd benefit from instantiating them once and reusing them. I was under the impression that the library authors want this pattern and have optimized the implementation to avoid the perceived overhead of registering new counters on each call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dschatzberg I created a Metrics struct to hold the metrics and made it a member of the Scheduler struct. We now only register the metrics once.

);

let kick_greedy = stat(bpf_intf::stat_idx_RUSTY_STAT_KICK_GREEDY);
let repatriate = stat(bpf_intf::stat_idx_RUSTY_STAT_REPATRIATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not exposing these to the metrics backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add them in this PR as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

stat_pct(bpf_intf::stat_idx_RUSTY_STAT_DL_CLAMP),
stat_pct(bpf_intf::stat_idx_RUSTY_STAT_DL_PRESET),
stat_pct(dl_clamped),
stat_pct(dl_preset),
);

info!("slice_length={}us", self.tuner.slice_ns / 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to make all this logging of stats into a metrics backend itself. That would reduce some of the boilerplate needed to add more stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. My plan for the next PR was to make a terminal log exporter and have that own this logic.

@jfernandez
Copy link
Contributor Author

This looks very similar to what I added to scx_layered. I'm mostly happy with the approach, but there's still a fair bit of boilerplate. One idea I kicked around was making a procedural macro to define stats from a Struct which would generate a lot of this boilerplate.

@dschatzberg can you give an idea of what that macro would look like? In general, I prefer not to add additional abstractions to metric clients and to be as close to the API as possible. But I'd be open to adding a macro if you give me some ideas of what it would look like.

@Byte-Lab
Copy link
Contributor

This looks excellent, thanks again for working on it. Once we've addressed the few things that @dschatzberg pointed out, I think we can land this and iterate in tree. Something I do think we'll want to consider addressing longer-term (which we already discussed offline, but just recording here for others' benefit): I do think that we could benefit from having a metrics crate that abstracts some of this stuff and avoids boilerplate, as I expect that a lot of the schedulers will be exporting stats in nearly the same way. In general it seems like schedulers do the following:

- Define various stats as a big enum -> record those stats in a per-cpu array map in BPF -> read those stats in user space and record them elsewhere (i.e. to stdout rihgt now). It'd be pretty slick if we could integrate with the build system and have some of this boilerplate be auto-generated.

If we could somehow make that declarative (sort of similar to https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest_stats_table.h and https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.c#L210-L219 in scx_nest), I think it'd let us get rid of a good amount of code, and would it a lot easier both to add stats, and to understand them.

In any case, this LG for now! Once Dan's changes are addressed, I'll stamp and merge.

@htejun
Copy link
Contributor

htejun commented Jun 20, 2024

This looks better than OM and no objection from me. Some things to consider for the future:

  • Compact and declarative stat definition would be lovely.
  • Down the line, if all rust scheds follow the same recipe, it'd be great.
  • It'd be nicer if we can just pack stats into a struct and hand it to a stat backend which can then present it however it wants to present (one of those could be stdout formatter that the scheduler implementation provides). It's so much easier to deal with such structs for e.g. calculating deltas, fractions, serializing into something else and so on.

@jfernandez
Copy link
Contributor Author

jfernandez commented Jun 21, 2024

@Byte-Lab @htejun agreed and aligned on the long-term vision. My plan is to use scx_rusty to learn the common observability patterns and iterate to generalize this solution for all schedulers. I'll start working soon on removing boilerplate.

@dschatzberg I applied your feedback and I also created a Struct to hold the metrics. If you are not opposed, I'd like to first create a terminal logging backend so that we can clean up the report fn, and then circle back to removing boilerplate with macros as I mentioned above.

@dschatzberg
Copy link
Contributor

@jfernandez Yeah, that's totally fine by me. The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@jfernandez
Copy link
Contributor Author

jfernandez commented Jun 21, 2024

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

@jfernandez
Copy link
Contributor Author

I pushed one more change, I forgot to append _count to the end of the metric names after my refactor. That is a convention required by Prometheus that I think we should follow to be more explicit about metric types.

We need a layer of indirection between the stats collection and their
output destinations. Currently, stats are only printed to stdout. Our
goal is to integrate with various telemetry systems such as Prometheus,
StatsD, and custom metric backends like those used by Meta and Netflix.
Importantly, adding a new backend should not require changes to the
existing stats code.

This patch introduces the `metrics` [1] crate, which provides a
framework for defining metrics and publishing them to different
backends.

The initial implementation includes the `dispatched_tasks_count`
metric, tagged with `type`. This metric increments every time a task is
dispatched, emitting the raw count instead of a percentage. A monotonic
counter is the most suitable metric type for this use case, as
percentages can be calculated at query time if needed. Existing logged
metrics continue to print percentages and remain unchanged.

A new flag, `--enable-prometheus`, has been added. When enabled, it
starts a Prometheus endpoint on port 9000 (default is false). This
endpoint allows metrics to be charted in Prometheus or Grafana
dashboards.

Future changes will migrate additional stats to this framework and add
support for other backends.

[1] https://metrics.rs/

Signed-off-by: Jose Fernandez <[email protected]>
@jfernandez
Copy link
Contributor Author

One more change, I got the convention wrong. It's _total, not count. Updated.

@Byte-Lab
Copy link
Contributor

Byte-Lab commented Jun 21, 2024

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

@jfernandez I'm a bit confused here. These counter!s are monotonic, but they're incremented and never reset from the BPF side. So if we incremented the values then we'd be incrementing using absolute values. Can we not use the absolute() function? In my mind a gauge is meant for something that's non-monotonic.

Edit: I see why we'd want to increment here instead of setting absolute if it's scraped and reset by the target backend, but if we do that I think we'd need to track what the actual increment was relative to the last time we collected the stats.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

Hmm, I'm still not quite following how we won't need state to track this (meaning, we record what the value was last time, and then increment based on that). Even if we emit the metric on every call to increment!(), we're still emitting an absolute value.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

@jfernandez
Copy link
Contributor Author

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

@Byte-Lab
Copy link
Contributor

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

Sorry @jfernandez, as we discussed on slack you weretotally correct. The values are reset on each read here: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_rusty/src/main.rs#L433-L435. What you have now LG.

@Byte-Lab Byte-Lab merged commit 5038f54 into sched-ext:main Jun 21, 2024
1 check passed
@jfernandez jfernandez deleted the metrics-rs branch June 21, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants