rusty: Integrate stats with the metrics framework #377

jfernandez · 2024-06-20T04:33:48Z

We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code.

This patch introduces the metrics [1] crate, which provides a framework for defining metrics and publishing them to different backends.

The initial implementation includes the dispatched_tasks_count metric, tagged with type. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged.

A new flag, --enable-prometheus, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards.

Example of charting 1s rate of dispatched tasks by type:

Future changes will migrate additional stats to this framework and add support for other backends.

[1] https://metrics.rs/

dschatzberg

This looks very similar to what I added to scx_layered. I'm mostly happy with the approach, but there's still a fair bit of boilerplate. One idea I kicked around was making a procedural macro to define stats from a Struct which would generate a lot of this boilerplate.

dschatzberg · 2024-06-20T14:54:47Z

scheds/rust/scx_rusty/src/main.rs

+        counter!("dispatched_tasks_count", "type" => "direct_greedy_far").increment(direct_greedy_far);
+        counter!("dispatched_tasks_count", "type" => "dsq").increment(dsq);
+        counter!("dispatched_tasks_count", "type" => "greedy_local").increment(greedy_local);
+        counter!("dispatched_tasks_count", "type" => "greedy_xnuma").increment(greedy_xnuma);


This registers a new counter on each report(). I would think we want to register the counters ahead of time and only set their absolute value here, no?

Let me look into that and see if we'd benefit from instantiating them once and reusing them. I was under the impression that the library authors want this pattern and have optimized the implementation to avoid the perceived overhead of registering new counters on each call.

@dschatzberg I created a Metrics struct to hold the metrics and made it a member of the Scheduler struct. We now only register the metrics once.

dschatzberg · 2024-06-20T14:56:00Z

scheds/rust/scx_rusty/src/main.rs

        );

+        let kick_greedy = stat(bpf_intf::stat_idx_RUSTY_STAT_KICK_GREEDY);
+        let repatriate = stat(bpf_intf::stat_idx_RUSTY_STAT_REPATRIATE);


You're not exposing these to the metrics backend?

I can add them in this PR as well

dschatzberg · 2024-06-20T14:57:14Z

scheds/rust/scx_rusty/src/main.rs

-            stat_pct(bpf_intf::stat_idx_RUSTY_STAT_DL_CLAMP),
-            stat_pct(bpf_intf::stat_idx_RUSTY_STAT_DL_PRESET),
+            stat_pct(dl_clamped),
+            stat_pct(dl_preset),
        );

        info!("slice_length={}us", self.tuner.slice_ns / 1000);


I wonder if it's possible to make all this logging of stats into a metrics backend itself. That would reduce some of the boilerplate needed to add more stats.

Yes, absolutely. My plan for the next PR was to make a terminal log exporter and have that own this logic.

jfernandez · 2024-06-20T17:57:19Z

This looks very similar to what I added to scx_layered. I'm mostly happy with the approach, but there's still a fair bit of boilerplate. One idea I kicked around was making a procedural macro to define stats from a Struct which would generate a lot of this boilerplate.

@dschatzberg can you give an idea of what that macro would look like? In general, I prefer not to add additional abstractions to metric clients and to be as close to the API as possible. But I'd be open to adding a macro if you give me some ideas of what it would look like.

Byte-Lab · 2024-06-20T20:57:50Z

This looks excellent, thanks again for working on it. Once we've addressed the few things that @dschatzberg pointed out, I think we can land this and iterate in tree. Something I do think we'll want to consider addressing longer-term (which we already discussed offline, but just recording here for others' benefit): I do think that we could benefit from having a metrics crate that abstracts some of this stuff and avoids boilerplate, as I expect that a lot of the schedulers will be exporting stats in nearly the same way. In general it seems like schedulers do the following:

- Define various stats as a big enum -> record those stats in a per-cpu array map in BPF -> read those stats in user space and record them elsewhere (i.e. to stdout rihgt now). It'd be pretty slick if we could integrate with the build system and have some of this boilerplate be auto-generated.

If we could somehow make that declarative (sort of similar to https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest_stats_table.h and https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.c#L210-L219 in scx_nest), I think it'd let us get rid of a good amount of code, and would it a lot easier both to add stats, and to understand them.

In any case, this LG for now! Once Dan's changes are addressed, I'll stamp and merge.

htejun · 2024-06-20T22:25:24Z

This looks better than OM and no objection from me. Some things to consider for the future:

Compact and declarative stat definition would be lovely.
Down the line, if all rust scheds follow the same recipe, it'd be great.
It'd be nicer if we can just pack stats into a struct and hand it to a stat backend which can then present it however it wants to present (one of those could be stdout formatter that the scheduler implementation provides). It's so much easier to deal with such structs for e.g. calculating deltas, fractions, serializing into something else and so on.

jfernandez · 2024-06-21T02:44:21Z

@Byte-Lab @htejun agreed and aligned on the long-term vision. My plan is to use scx_rusty to learn the common observability patterns and iterate to generalize this solution for all schedulers. I'll start working soon on removing boilerplate.

@dschatzberg I applied your feedback and I also created a Struct to hold the metrics. If you are not opposed, I'd like to first create a terminal logging backend so that we can clean up the report fn, and then circle back to removing boilerplate with macros as I mentioned above.

dschatzberg · 2024-06-21T14:55:42Z

@jfernandez Yeah, that's totally fine by me. The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

jfernandez · 2024-06-21T16:01:23Z

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

jfernandez · 2024-06-21T16:07:08Z

I pushed one more change, I forgot to append _count to the end of the metric names after my refactor. That is a convention required by Prometheus that I think we should follow to be more explicit about metric types.

We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code. This patch introduces the `metrics` [1] crate, which provides a framework for defining metrics and publishing them to different backends. The initial implementation includes the `dispatched_tasks_count` metric, tagged with `type`. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged. A new flag, `--enable-prometheus`, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards. Future changes will migrate additional stats to this framework and add support for other backends. [1] https://metrics.rs/ Signed-off-by: Jose Fernandez <[email protected]>

jfernandez · 2024-06-21T16:19:48Z

One more change, I got the convention wrong. It's _total, not count. Updated.

Byte-Lab · 2024-06-21T18:03:21Z

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

@jfernandez I'm a bit confused here. These counter!s are monotonic, but they're incremented and never reset from the BPF side. So if we incremented the values then we'd be incrementing using absolute values. Can we not use the absolute() function? In my mind a gauge is meant for something that's non-monotonic.

Edit: I see why we'd want to increment here instead of setting absolute if it's scraped and reset by the target backend, but if we do that I think we'd need to track what the actual increment was relative to the last time we collected the stats.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

Hmm, I'm still not quite following how we won't need state to track this (meaning, we record what the value was last time, and then increment based on that). Even if we emit the metric on every call to increment!(), we're still emitting an absolute value.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

jfernandez · 2024-06-21T19:19:51Z

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

Byte-Lab · 2024-06-21T20:19:05Z

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

Sorry @jfernandez, as we discussed on slack you weretotally correct. The values are reset on each read here: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_rusty/src/main.rs#L433-L435. What you have now LG.

jfernandez force-pushed the metrics-rs branch from e053204 to 7cb8f24 Compare June 20, 2024 12:22

dschatzberg reviewed Jun 20, 2024

View reviewed changes

jfernandez force-pushed the metrics-rs branch from 7cb8f24 to 08e4b85 Compare June 21, 2024 02:30

jfernandez force-pushed the metrics-rs branch from 08e4b85 to d00bb37 Compare June 21, 2024 16:05

jfernandez force-pushed the metrics-rs branch from d00bb37 to 83373b1 Compare June 21, 2024 16:18

Byte-Lab approved these changes Jun 21, 2024

View reviewed changes

Byte-Lab merged commit 5038f54 into sched-ext:main Jun 21, 2024
1 check passed

jfernandez deleted the metrics-rs branch June 21, 2024 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rusty: Integrate stats with the metrics framework #377

rusty: Integrate stats with the metrics framework #377

jfernandez commented Jun 20, 2024

dschatzberg left a comment

dschatzberg Jun 20, 2024

jfernandez Jun 20, 2024

jfernandez Jun 21, 2024

dschatzberg Jun 20, 2024

jfernandez Jun 20, 2024

jfernandez Jun 21, 2024

dschatzberg Jun 20, 2024

jfernandez Jun 20, 2024

jfernandez commented Jun 20, 2024

Byte-Lab commented Jun 20, 2024

htejun commented Jun 20, 2024

jfernandez commented Jun 21, 2024 •

edited

Loading

dschatzberg commented Jun 21, 2024

jfernandez commented Jun 21, 2024 •

edited

Loading

jfernandez commented Jun 21, 2024

jfernandez commented Jun 21, 2024

Byte-Lab commented Jun 21, 2024 •

edited

Loading

jfernandez commented Jun 21, 2024

Byte-Lab commented Jun 21, 2024

rusty: Integrate stats with the metrics framework #377

rusty: Integrate stats with the metrics framework #377

Conversation

jfernandez commented Jun 20, 2024

dschatzberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfernandez commented Jun 20, 2024

Byte-Lab commented Jun 20, 2024

htejun commented Jun 20, 2024

jfernandez commented Jun 21, 2024 • edited Loading

dschatzberg commented Jun 21, 2024

jfernandez commented Jun 21, 2024 • edited Loading

jfernandez commented Jun 21, 2024

jfernandez commented Jun 21, 2024

Byte-Lab commented Jun 21, 2024 • edited Loading

jfernandez commented Jun 21, 2024

Byte-Lab commented Jun 21, 2024

jfernandez commented Jun 21, 2024 •

edited

Loading

jfernandez commented Jun 21, 2024 •

edited

Loading

Byte-Lab commented Jun 21, 2024 •

edited

Loading