-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_utils: Add GPU topology #575
Conversation
rust/scx_utils/src/topology.rs
Outdated
@@ -217,10 +220,48 @@ impl Cache { | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone)] | |||
pub struct GPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do Gpu
? We use Cpu
in other places.
rust/scx_utils/src/topology.rs
Outdated
max_graphics_clock: usize, | ||
// AMD uses CU for this value | ||
max_sm_clock: usize, | ||
memory: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just pub
these fields instead of adding accessors? That way, it's a lot easier to e.g. unpack the fields in match
arms.
for node in &self.nodes { | ||
gpus.extend(node.gpus.values().clone()); | ||
} | ||
gpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a bit misleading to give out vector which isn't indexed by IDs for entities w/ IDs. Maybe provide an iter
or return BTreeMap
instead? The IDs are supposed to be unique system-wide, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IDs are supposed to be unique system-wide, right?
I wasn't 100% sure on that, especially if a system has a mix of NVIDIA/AMD GPUs. For the NVIDIA case it's the same id that is used by device_by_index
. Maybe it's better to use the PCIe bus id instead? I think it would still work with the NVL helpers as well. My thought is that any scheduler specific use cases that need extra data should still be able to lookup the device by the id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.... it does node_gpus.insert(gpu.id(), gpu.clone());
, so it does assume the the ID is unique at least in the node. If nvidia/amd may overlap, maybe the ID should be an enum - ie. Vendor(u64)
? PCI ID is fine too but can be a bit unwieldy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the enum idea, will try that out!
so it does assume the the ID is unique at least in the node
Yeah, wasn't sure on that either. There's not great AMD libraries it seems, so maybe this is a problem for the future, but should try to do it right the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool, I'll do some tests with this if I can find some beefy NVIDIA machines tomorrow. Thanks for working on this!
This is cool! Does AMD GPU also support something similar with nvml-wrapper? |
Add GPU awareness to the topology crate. Signed-off-by: Daniel Hodges <[email protected]>
Had to fix the numa node lookup as it was incorrect, but it looks good now:
|
It does, I don't know how good the bindings are though. I found this one and will test it on some hardware when I'm at home. |
pub struct Gpu { | ||
pub index: GpuIndex, | ||
pub node_id: usize, | ||
pub max_graphics_clock: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields probably need some standardized units appended to them at some point...
Add GPU awareness to the topology crate.
tested on non gpu machine:
gpu machine: