scx_utils: Add GPU topology #575

hodgesds · 2024-08-27T21:44:53Z

Add GPU awareness to the topology crate.

tested on non gpu machine:

$ sudo ./bin_local/bin/scx_layered  f:user.json 
21:43:37 [INFO] CPUs: online/possible=80/80 nr_cores=40
GPUS: []
21:43:37 [INFO] configuring node 0, LLCs 1
21:43:37 [INFO] configuring llc 0 for node 0
21:43:37 [INFO] configuring node 1, LLCs 1
21:43:37 [INFO] configuring llc 1 for node 1

gpu machine:

$ ./scx_layered f:layered.json 
21:44:25 [INFO] CPUs: online/possible=224/224 nr_cores=112
GPUS: [GPU { id: 0, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 1, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 2, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 3, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 4, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 5, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 6, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, GPU { id: 7, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }]

htejun · 2024-08-27T22:28:15Z

rust/scx_utils/src/topology.rs

@@ -217,10 +220,48 @@ impl Cache {
    }
 }

+#[derive(Debug, Clone)]
+pub struct GPU {


Can we do Gpu? We use Cpu in other places.

htejun · 2024-08-27T22:29:46Z

rust/scx_utils/src/topology.rs

+    max_graphics_clock: usize,
+    // AMD uses CU for this value
+    max_sm_clock: usize,
+    memory: u64,


Can we just pub these fields instead of adding accessors? That way, it's a lot easier to e.g. unpack the fields in match arms.

htejun · 2024-08-27T22:36:52Z

rust/scx_utils/src/topology.rs

+        for node in &self.nodes {
+            gpus.extend(node.gpus.values().clone());
+        }
+        gpus


It can be a bit misleading to give out vector which isn't indexed by IDs for entities w/ IDs. Maybe provide an iter or return BTreeMap instead? The IDs are supposed to be unique system-wide, right?

The IDs are supposed to be unique system-wide, right?

I wasn't 100% sure on that, especially if a system has a mix of NVIDIA/AMD GPUs. For the NVIDIA case it's the same id that is used by device_by_index. Maybe it's better to use the PCIe bus id instead? I think it would still work with the NVL helpers as well. My thought is that any scheduler specific use cases that need extra data should still be able to lookup the device by the id.

Hmmm.... it does node_gpus.insert(gpu.id(), gpu.clone());, so it does assume the the ID is unique at least in the node. If nvidia/amd may overlap, maybe the ID should be an enum - ie. Vendor(u64)? PCI ID is fine too but can be a bit unwieldy.

I like the enum idea, will try that out!

so it does assume the the ID is unique at least in the node

Yeah, wasn't sure on that either. There's not great AMD libraries it seems, so maybe this is a problem for the future, but should try to do it right the first time.

arighi

This is really cool, I'll do some tests with this if I can find some beefy NVIDIA machines tomorrow. Thanks for working on this!

multics69 · 2024-08-28T05:07:35Z

This is cool! Does AMD GPU also support something similar with nvml-wrapper?

Add GPU awareness to the topology crate. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds · 2024-08-28T13:36:56Z

Had to fix the numa node lookup as it was incorrect, but it looks good now:

13:30:46 [INFO] CPUs: online/possible=224/224 nr_cores=112
GPUS: {Nvidia { nvml_id: 0 }: Gpu { index: Nvidia { nvml_id: 0 }, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 1 }: Gpu { index: Nvidia { nvml_id: 1 }, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 2 }: Gpu { index: Nvidia { nvml_id: 2 }, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 3 }: Gpu { index: Nvidia { nvml_id: 3 }, node_id: 0, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 4 }: Gpu { index: Nvidia { nvml_id: 4 }, node_id: 1, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 5 }: Gpu { index: Nvidia { nvml_id: 5 }, node_id: 1, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 6 }: Gpu { index: Nvidia { nvml_id: 6 }, node_id: 1, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }, Nvidia { nvml_id: 7 }: Gpu { index: Nvidia { nvml_id: 7 }, node_id: 1, max_graphics_clock: 1980, max_sm_clock: 1980, memory: 102625181696 }}

hodgesds · 2024-08-28T13:38:28Z

This is cool! Does AMD GPU also support something similar with nvml-wrapper?

It does, I don't know how good the bindings are though. I found this one and will test it on some hardware when I'm at home.

hodgesds · 2024-08-28T13:39:06Z

rust/scx_utils/src/topology.rs

+pub struct Gpu {
+    pub index: GpuIndex,
+    pub node_id: usize,
+    pub max_graphics_clock: usize,


These fields probably need some standardized units appended to them at some point...

hodgesds requested review from htejun, multics69 and arighi August 27, 2024 21:45

htejun approved these changes Aug 27, 2024

View reviewed changes

arighi approved these changes Aug 27, 2024

View reviewed changes

multics69 approved these changes Aug 28, 2024

View reviewed changes

hodgesds force-pushed the gpu-topo branch from 29f2d40 to fd7e712 Compare August 28, 2024 13:33

scx_utils: Add GPU topology

12f8cb7

Add GPU awareness to the topology crate. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the gpu-topo branch from fd7e712 to 12f8cb7 Compare August 28, 2024 13:35

hodgesds commented Aug 28, 2024

View reviewed changes

hodgesds merged commit 5391816 into sched-ext:main Aug 28, 2024
2 checks passed

hodgesds deleted the gpu-topo branch August 28, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_utils: Add GPU topology #575

scx_utils: Add GPU topology #575

hodgesds commented Aug 27, 2024

htejun Aug 27, 2024

htejun Aug 27, 2024

htejun Aug 27, 2024

hodgesds Aug 27, 2024

htejun Aug 27, 2024

hodgesds Aug 27, 2024

arighi left a comment

multics69 commented Aug 28, 2024

hodgesds commented Aug 28, 2024

hodgesds commented Aug 28, 2024

hodgesds Aug 28, 2024

scx_utils: Add GPU topology #575

scx_utils: Add GPU topology #575

Conversation

hodgesds commented Aug 27, 2024

htejun Aug 27, 2024

Choose a reason for hiding this comment

htejun Aug 27, 2024

Choose a reason for hiding this comment

htejun Aug 27, 2024

Choose a reason for hiding this comment

hodgesds Aug 27, 2024

Choose a reason for hiding this comment

htejun Aug 27, 2024

Choose a reason for hiding this comment

hodgesds Aug 27, 2024

Choose a reason for hiding this comment

arighi left a comment

Choose a reason for hiding this comment

multics69 commented Aug 28, 2024

hodgesds commented Aug 28, 2024

hodgesds commented Aug 28, 2024

hodgesds Aug 28, 2024

Choose a reason for hiding this comment