Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_stats, scx_layered: Implement independent stats client sessions #514

Merged
merged 12 commits into from
Aug 19, 2024

Conversation

htejun
Copy link
Contributor

@htejun htejun commented Aug 19, 2024

Currently, scx_layered generates stats at a fixed interval and every stats client reads out the content of the latest update. This means that a client whose poll interval is longer than the scheduler's would see sampled stats (e.g. the average and nr_cpus ranges wouldn't represent the whole period) and a client with shorter poll interval may see the same outputs repeated.

This PR implements ScxStatsServer::add_stats_ops() which enables the user to specify open() and close() closures. The open() closure creates a read() closure per client session which can carry mutable states. In addition, it provides multiplexed communication channels between the clients and the server user which can be useful when implementing stateful stats generation.

scx_layered is updated to make use of the new feature. After this PR, every client always gets accurate stats, including the time averages and period nr_layer_cpus min/max tracking, independent of each other.

This allows stats reader to maintain persistent states per connection.
…mmunication channel

This will make implementing connection sessions easier where each stats
client connection maintains a set of states.
Instead of keeping one copy of sched_stats, each stats server session
carries their own so that stats can be generated independently by each
client at any interval. CPU allocation min/max tracking is broken for now.
open_ops tracks which ops have been opened by the client session; however,
it was being created on each handle_request() making every request to open
each time. Fix it by moving it to the caller.
So that new stats session doesn't start with an inflated utilization number.
With this, every client sees the correct nr_layer_cpus_ranges without
interfering with each other.
let (req_pair, res_pair) = ChannelPair::<Req, Res>::bidi();
match add_req.send(res_pair) {
Ok(()) => debug!("sent new channel to proxy"),
Err(e) => warn!("ScxStatsServer::proxy() failed ({})", &e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is kind of meta, but I wonder if it is worth adding a set of stats on failures on the stats server itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, some statistics can be useful. I'll put that on the todo list.

- If --monitor is specified with layer specs, the scheduler also starts
  stats monitoring on a thread.

- Standalone monitoring mode no longer exits when the scheduler isn't there.
};
retry_cnt = 0;

while !shutdown.load(Ordering::Relaxed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this inner one needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Otherwise, the inner loop will never exit on ctrl-c.

info!("Stats server not avaliable, retrying...");
}
retry_cnt += 1;
sleep(Duration::from_secs(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe give up after a max number of retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. After how many? I think the mental model is like following a log file. Wouldn't it make sense to just let it sit there until told to exit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see the flow of this now... Yeah, this makes sense!

Copy link
Contributor

@hodgesds hodgesds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems legit. I think the unbounded channel is a little scary from a resource leaking perspective, but imo is easier to handle/debug something that could block.

@htejun
Copy link
Contributor Author

htejun commented Aug 19, 2024

seems legit. I think the unbounded channel is a little scary from a resource leaking perspective, but imo is easier to handle/debug something that could block.

We should be able to switch to bounded channels with 0 size without noticeable behavior changes. There's only supposed to be one message in flight in each connection pair at any given time and all the operations are fully synchronous (one send followed by one receive or vice-versa). Yeah, using bounded is probably better here. Will do that later.

@htejun htejun merged commit 3498a2b into main Aug 19, 2024
1 of 2 checks passed
@htejun htejun deleted the htejun/scx_stats branch August 19, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants