Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: treat stitched VA as managed memory #10459

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

What?

Detect VA ranges composed of multiple physical allocations and treat as managed memory to force pipeline protocols.

Initial tests on nvlink connected GPUs show protocols being selected correctly:

[1738101119.349005] [52879:0]   | ucp_context_0 inter-node cfg#2 | rendezvous data send(fast-completion|multi) from cuda-managed/GPU0 to cuda    |
[1738101119.349007] [52879:0]   +--------------------------------+---------------------------------------------------------------+---------------+
[1738101119.349010] [52879:0]   |                        0..8176 | fragmented copy-in copy-out                                   | tcp/enP2s2f1  |
[1738101119.349011] [52879:0]   |                       8177..4M | cuda_copy, flushed write to remote, frag cuda                 | cuda_ipc/cuda |
[1738101119.349012] [52879:0]   |                   4194305..inf | pipeline cuda_copy, write to remote, frag cuda                | cuda_ipc/cuda |
[1738101119.349013] [52879:0]   +--------------------------------+---------------------------------------------------------------+---------------+

@Akshay-Venkatesh
Copy link
Contributor Author

@SeyedMir Can you please test these on x86 platforms and see if expected protocols are being chosen?

Comment on lines +695 to +696
base_address = (CUdeviceptr)address;
alloc_length = length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe goto out_default_range instead?
also some rephactoring could be done, e.g. to have a trace for all cases including the new one

base_address = (CUdeviceptr)address;
alloc_length = length;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cuMemGetAddressRange support for VMM allocations documented somewhere? The driver API docs mention legacy allocators only.
Returns the base address in *pbase and size in *psize of the allocation by cuMemAlloc() or cuMemAllocPitch().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed not well documented. The behavior we've seen is that it is able to provide base pointer and size corresponding to base address and length of mapped address space from the corresponding cuMemMap.

@@ -689,6 +689,13 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address,
return UCS_ERR_INVALID_ADDR;
}

if ((uintptr_t)base_address + alloc_length < (uintptr_t)address + length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((uintptr_t)base_address + alloc_length < (uintptr_t)address + length) {
if (UCS_PTR_BYTE_OFFSET(base_address, alloc_length) <
UCS_PTR_BYTE_OFFSET(address, length)) {

@yosefe
Copy link
Contributor

yosefe commented Jan 30, 2025

Some test failures seem relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants