Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_COPY: add multi-device support in cuda_copy #9645

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Jan 30, 2024

What/Why?

Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.

How ?

CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.

TODO

  1. Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context (not going to cleanup resources and leave it to the OS to handle it)
  2. Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync (removed)

ucs_status_t status;

pthread_mutex_lock(&lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain way a lock is necessary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but UCT is not thread-safe and in UCP we have global lock per operation.
is there a specific use-cases requiring this lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the lock can be removed.

src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
{
CUstream *stream = NULL;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
CUstream *stream = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NULL assignment necessary? Line 70 will always overwrite it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will leave it uninitialized.

} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code (lines 128-137) is repeated in put_short and get_short as well. Maybe put it in a function or macro?

UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return error or attempt to set the context as we have the buffers? Though, it may not be in the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

static UCS_CLASS_INIT_FUNC(uct_cuda_copy_iface_t, uct_md_h md, uct_worker_h worker,
const uct_iface_params_t *params,
const uct_iface_config_t *tl_config)
void uct_cuda_copy_cleanup_per_ctx_rscs(uct_cuda_copy_per_ctx_rsc_t *ctx_rsc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this can be a static function.

UCS_BITMAP_CLEAR(&self->streams_to_sync);
}

ucs_status_t uct_cuda_copy_init_per_ctx_rscs(uct_cuda_copy_iface_t *iface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this can be a static function.
Also, I think this function does not need the iface. The max_cuda_events value can be passed directly as an argument instead. If you decide to keep passing the iface, then let's add const.


return UCS_OK;
}

static UCS_CLASS_CLEANUP_FUNC(uct_cuda_copy_iface_t)
ucs_status_t uct_cuda_copy_get_ctx_rscs(uct_cuda_copy_iface_t *iface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have _per_ in the name to be consistent with init_per_ctx_rscs and cleanup_per_ctx_rscs functions?

UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
}

ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this mpool cleanup be called even if the ctx_rsc->cuda_ctx is not valid anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move cuCtxPopCurrent(&cuda_popped_ctx); to here? Because we want to pop the pushed context regardless of the success or failure of cuMemGetAddressRange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. @brminich made the same suggestion too.

Comment on lines 277 to 289
/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be a common inline function to get current ctx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to inline function.


/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (current_ctx == NULL) {
if (ucs_unlikely(current_ctx == NULL)) {

ucs_memory_type_t src, dst;
ucs_mpool_params_t mp_params;
unsigned long long ctx_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned long long ctx_id;
unsigned long long ctx_id;

Comment on lines 377 to 391
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
if (cu_err != CUDA_SUCCESS) {
cuCtxPopCurrent(&cuda_popped_ctx);
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}

cuCtxPopCurrent(&cuda_popped_ctx);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);
cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
if (cu_err != CUDA_SUCCESS) {
cuCtxPopCurrent(&cuda_popped_ctx);
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}
cuCtxPopCurrent(&cuda_popped_ctx);
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);
cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
cuCtxPopCurrent(&cuda_popped_ctx);
if (cu_err != CUDA_SUCCESS) {
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll make these changes today.

ucs_status_t status;

pthread_mutex_lock(&lock);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.

ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
{
CUstream *stream = NULL;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
CUstream *stream = NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will leave it uninitialized.

src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

Comment on lines 277 to 289
/* ensure context is set before creating events/streams */
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
return UCS_ERR_IO_ERROR;
} else {
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
if (UCS_OK != status) {
ucs_error("unable to get resources associated with cuda context");
return UCS_ERR_IO_ERROR;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to inline function.

UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
}

ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

* to push and pop the context associated with address (which should be
* non-NULL if we are at this point)*/
cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
(CUdeviceptr)address);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. @brminich made the same suggestion too.

@Akshay-Venkatesh Akshay-Venkatesh marked this pull request as ready for review February 22, 2024 20:52
@Akshay-Venkatesh
Copy link
Contributor Author

@brminich I see one of the commits had an extra colon and 2 commit style tests are failing because of that. Would it be ok to rebase? I can wait to do this until all the reviewers have had a chance to look at my comments and code changes.

cc @rakhmets @SeyedMir

@SeyedMir
Copy link
Contributor

@Akshay-Venkatesh Rebase is fine with me.

@brminich
Copy link
Contributor

@Akshay-Venkatesh, no problem from my side

Copy link
Contributor

@rakhmets rakhmets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase is OK for me too.

src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved
@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/cuda-copy-multi-dev branch from bb7c190 to fb1d3be Compare February 26, 2024 19:02
@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Feb 28, 2024

@brminich @rakhmets @SeyedMir

FYI, in dd8b66d I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global).

Also, it looks like cuCtxGetId is supported for CUDA >=12.0. Without context ID, we don't have a way to query which context we're trying to use and pick associated stream/event resources for transport operations. We cannot use CUcontext handle itself instead of context ID because we cannot assume that the handle returned by say cuCtxGetCurrent will always return the same handle as opposed to a handle that has the same properties. So it seems that multi-device support will need CUDA >= 12.0. We should discuss more about this.

ucs_status_t status;

pthread_mutex_lock(&lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the lock can be removed.

if (status != UCS_OK) {
return status;
} else if (current_ctx == NULL) {
ucs_error("attempt to perform cuda memcpy without active context");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log message can be confusing since the function can be called by different callers. It is better to print the following message: there is no cuda context bound to the calling cpu thread.

Comment on lines 106 to 108
static inline
ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline
ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)
static UCS_F_ALWAYS_INLINE ucs_status_t
uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)

Comment on lines 124 to 126
static inline
ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uct_cuda_copy_get_short_stream should have CUstream *stream parameter, and uct_cuda_copy_per_ctx_rsc_t *ctx_rsc local variable.

Suggested change
static inline
ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)
static UCS_F_ALWAYS_INLINE ucs_status_t
uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface, CUstream *stream)

ucs_free(ctx_rsc);
err_kh_del:
kh_del(cuda_copy_ctx_rscs, &iface->ctx_rscs, iter);
out:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
out:
err:

if (cu_err != CUDA_SUCCESS) {
ucs_error("cuMemGetAddressRange(%p) error: %s", address,
uct_cuda_base_cu_get_error_string(cu_err));
return UCS_ERR_INVALID_ADDR;
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove extra empty line.

Comment on lines 490 to 492
ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface,
CUcontext cuda_ctx,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the declaration from the .h file, and make the function static.

Suggested change
ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface,
CUcontext cuda_ctx,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)
static ucs_status_t
uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface, CUcontext cuda_ctx,
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)

@@ -48,23 +49,30 @@ typedef struct uct_cuda_copy_queue_desc {
ucs_queue_elem_t queue;
} uct_cuda_copy_queue_desc_t;

typedef struct uct_cuda_copy_per_ctx_rsc {
CUcontext cuda_ctx;
unsigned long long ctx_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove ctx_id filed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets But we use this as the key to get the right context from the hashmap. Are you suggesting that we use a different key?

@@ -131,6 +131,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface,
return UCS_OK;
}

#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused code.

} uct_cuda_copy_per_ctx_rsc_t;


KHASH_MAP_INIT_INT64(cuda_copy_ctx_rscs, struct uct_cuda_copy_per_ctx_rsc*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can store uct_cuda_copy_per_ctx_rsc (not the pointer), then you would need to do alloc/free during put.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich Will incorporate this change.

CUcontext cuda_ctx;
unsigned long long ctx_id;
/* pool of cuda events to check completion of memcpy operations */
ucs_mpool_t cuda_event_desc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to have this mpool per context? Maybe one common mpool is enough?

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Event and stream resources are associated with a context. We do need an mpool for each context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants