UCT/CUDA: treat stitched VA as managed memory #10459

Akshay-Venkatesh · 2025-01-28T21:56:22Z

What?

Detect VA ranges composed of multiple physical allocations and treat as managed memory to force pipeline protocols.

Initial tests on nvlink connected GPUs show protocols being selected correctly:

[1738101119.349005] [52879:0]   | ucp_context_0 inter-node cfg#2 | rendezvous data send(fast-completion|multi) from cuda-managed/GPU0 to cuda    |
[1738101119.349007] [52879:0]   +--------------------------------+---------------------------------------------------------------+---------------+
[1738101119.349010] [52879:0]   |                        0..8176 | fragmented copy-in copy-out                                   | tcp/enP2s2f1  |
[1738101119.349011] [52879:0]   |                       8177..4M | cuda_copy, flushed write to remote, frag cuda                 | cuda_ipc/cuda |
[1738101119.349012] [52879:0]   |                   4194305..inf | pipeline cuda_copy, write to remote, frag cuda                | cuda_ipc/cuda |
[1738101119.349013] [52879:0]   +--------------------------------+---------------------------------------------------------------+---------------+

Akshay-Venkatesh · 2025-01-28T21:57:22Z

@SeyedMir Can you please test these on x86 platforms and see if expected protocols are being chosen?

brminich · 2025-01-29T10:08:06Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

+        base_address   = (CUdeviceptr)address;
+        alloc_length   = length;


maybe goto out_default_range instead?
also some rephactoring could be done, e.g. to have a trace for all cases including the new one

SeyedMir · 2025-01-29T22:31:44Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

+        base_address   = (CUdeviceptr)address;
+        alloc_length   = length;
+    }
+


Is cuMemGetAddressRange support for VMM allocations documented somewhere? The driver API docs mention legacy allocators only.
Returns the base address in *pbase and size in *psize of the allocation by cuMemAlloc() or cuMemAllocPitch().

Indeed not well documented. The behavior we've seen is that it is able to provide base pointer and size corresponding to base address and length of mapped address space from the corresponding cuMemMap.

yosefe · 2025-01-30T07:40:02Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

@@ -689,6 +689,13 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address,
        return UCS_ERR_INVALID_ADDR;
    }

+    if ((uintptr_t)base_address + alloc_length < (uintptr_t)address + length) {


Suggested change

if ((uintptr_t)base_address + alloc_length < (uintptr_t)address + length) {

if (UCS_PTR_BYTE_OFFSET(base_address, alloc_length) <

UCS_PTR_BYTE_OFFSET(address, length)) {

yosefe · 2025-01-30T07:43:03Z

Some test failures seem relevant

UCT/CUDA: treat stitched VA as managed memory

99173dd

Akshay-Venkatesh requested review from yosefe, brminich and SeyedMir January 28, 2025 21:56

brminich reviewed Jan 29, 2025

View reviewed changes

SeyedMir reviewed Jan 29, 2025

View reviewed changes

yosefe reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/CUDA: treat stitched VA as managed memory #10459

UCT/CUDA: treat stitched VA as managed memory #10459

Akshay-Venkatesh commented Jan 28, 2025

Akshay-Venkatesh commented Jan 28, 2025

brminich Jan 29, 2025

SeyedMir Jan 29, 2025

Akshay-Venkatesh Jan 30, 2025

yosefe Jan 30, 2025

yosefe commented Jan 30, 2025

	if ((uintptr_t)base_address + alloc_length < (uintptr_t)address + length) {
	if (UCS_PTR_BYTE_OFFSET(base_address, alloc_length) <
	UCS_PTR_BYTE_OFFSET(address, length)) {

UCT/CUDA: treat stitched VA as managed memory #10459

Are you sure you want to change the base?

UCT/CUDA: treat stitched VA as managed memory #10459

Conversation

Akshay-Venkatesh commented Jan 28, 2025

What?

Akshay-Venkatesh commented Jan 28, 2025

brminich Jan 29, 2025

Choose a reason for hiding this comment

SeyedMir Jan 29, 2025

Choose a reason for hiding this comment

Akshay-Venkatesh Jan 30, 2025

Choose a reason for hiding this comment

yosefe Jan 30, 2025

Choose a reason for hiding this comment

yosefe commented Jan 30, 2025