-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_COPY: add multi-device support in cuda_copy #9645
base: master
Are you sure you want to change the base?
UCT/CUDA_COPY: add multi-device support in cuda_copy #9645
Conversation
ucs_status_t status; | ||
|
||
pthread_mutex_lock(&lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain way a lock is necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but UCT is not thread-safe and in UCP we have global lock per operation.
is there a specific use-cases requiring this lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the lock can be removed.
ucs_memory_type_t src_type, ucs_memory_type_t dst_type) | ||
{ | ||
CUstream *stream = NULL; | ||
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; | ||
CUstream *stream = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the NULL assignment necessary? Line 70 will always overwrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Will leave it uninitialized.
} else { | ||
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc); | ||
if (UCS_OK != status) { | ||
ucs_error("unable to get resources associated with cuda context"); | ||
return UCS_ERR_IO_ERROR; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code (lines 128-137) is repeated in put_short and get_short as well. Maybe put it in a function or macro?
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(¤t_ctx)); | ||
if (current_ctx == NULL) { | ||
ucs_error("attempt to perform cuda memcpy without active context"); | ||
return UCS_ERR_IO_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return error or attempt to set the context as we have the buffers? Though, it may not be in the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.
static UCS_CLASS_INIT_FUNC(uct_cuda_copy_iface_t, uct_md_h md, uct_worker_h worker, | ||
const uct_iface_params_t *params, | ||
const uct_iface_config_t *tl_config) | ||
void uct_cuda_copy_cleanup_per_ctx_rscs(uct_cuda_copy_per_ctx_rsc_t *ctx_rsc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this can be a static function.
UCS_BITMAP_CLEAR(&self->streams_to_sync); | ||
} | ||
|
||
ucs_status_t uct_cuda_copy_init_per_ctx_rscs(uct_cuda_copy_iface_t *iface, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this can be a static function.
Also, I think this function does not need the iface
. The max_cuda_events
value can be passed directly as an argument instead. If you decide to keep passing the iface
, then let's add const
.
|
||
return UCS_OK; | ||
} | ||
|
||
static UCS_CLASS_CLEANUP_FUNC(uct_cuda_copy_iface_t) | ||
ucs_status_t uct_cuda_copy_get_ctx_rscs(uct_cuda_copy_iface_t *iface, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have _per_
in the name to be consistent with init_per_ctx_rscs
and cleanup_per_ctx_rscs
functions?
UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream)); | ||
} | ||
|
||
ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this mpool cleanup be called even if the ctx_rsc->cuda_ctx
is not valid anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.
* to push and pop the context associated with address (which should be | ||
* non-NULL if we are at this point)*/ | ||
cuCtxPushCurrent(cuda_mem_ctx); | ||
|
||
cu_err = cuMemGetAddressRange(&base_address, &alloc_length, | ||
(CUdeviceptr)address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move cuCtxPopCurrent(&cuda_popped_ctx);
to here? Because we want to pop the pushed context regardless of the success or failure of cuMemGetAddressRange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. @brminich made the same suggestion too.
/* ensure context is set before creating events/streams */ | ||
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(¤t_ctx)); | ||
if (current_ctx == NULL) { | ||
ucs_error("attempt to perform cuda memcpy without active context"); | ||
return UCS_ERR_IO_ERROR; | ||
} else { | ||
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc); | ||
if (UCS_OK != status) { | ||
ucs_error("unable to get resources associated with cuda context"); | ||
return UCS_ERR_IO_ERROR; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be a common inline function to get current ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to inline function.
|
||
/* ensure context is set before creating events/streams */ | ||
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(¤t_ctx)); | ||
if (current_ctx == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (current_ctx == NULL) { | |
if (ucs_unlikely(current_ctx == NULL)) { |
ucs_memory_type_t src, dst; | ||
ucs_mpool_params_t mp_params; | ||
unsigned long long ctx_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned long long ctx_id; | |
unsigned long long ctx_id; |
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us | ||
* to push and pop the context associated with address (which should be | ||
* non-NULL if we are at this point)*/ | ||
cuCtxPushCurrent(cuda_mem_ctx); | ||
|
||
cu_err = cuMemGetAddressRange(&base_address, &alloc_length, | ||
(CUdeviceptr)address); | ||
if (cu_err != CUDA_SUCCESS) { | ||
cuCtxPopCurrent(&cuda_popped_ctx); | ||
ucs_error("cuMemGetAddressRange(%p) error: %s", address, | ||
uct_cuda_base_cu_get_error_string(cu_err)); | ||
return UCS_ERR_INVALID_ADDR; | ||
} | ||
|
||
cuCtxPopCurrent(&cuda_popped_ctx); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us | |
* to push and pop the context associated with address (which should be | |
* non-NULL if we are at this point)*/ | |
cuCtxPushCurrent(cuda_mem_ctx); | |
cu_err = cuMemGetAddressRange(&base_address, &alloc_length, | |
(CUdeviceptr)address); | |
if (cu_err != CUDA_SUCCESS) { | |
cuCtxPopCurrent(&cuda_popped_ctx); | |
ucs_error("cuMemGetAddressRange(%p) error: %s", address, | |
uct_cuda_base_cu_get_error_string(cu_err)); | |
return UCS_ERR_INVALID_ADDR; | |
} | |
cuCtxPopCurrent(&cuda_popped_ctx); | |
/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us | |
* to push and pop the context associated with address (which should be | |
* non-NULL if we are at this point)*/ | |
cuCtxPushCurrent(cuda_mem_ctx); | |
cu_err = cuMemGetAddressRange(&base_address, &alloc_length, | |
(CUdeviceptr)address); | |
cuCtxPopCurrent(&cuda_popped_ctx); | |
if (cu_err != CUDA_SUCCESS) { | |
ucs_error("cuMemGetAddressRange(%p) error: %s", address, | |
uct_cuda_base_cu_get_error_string(cu_err)); | |
return UCS_ERR_INVALID_ADDR; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll make these changes today.
ucs_status_t status; | ||
|
||
pthread_mutex_lock(&lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.
ucs_memory_type_t src_type, ucs_memory_type_t dst_type) | ||
{ | ||
CUstream *stream = NULL; | ||
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; | ||
CUstream *stream = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Will leave it uninitialized.
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(¤t_ctx)); | ||
if (current_ctx == NULL) { | ||
ucs_error("attempt to perform cuda memcpy without active context"); | ||
return UCS_ERR_IO_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.
/* ensure context is set before creating events/streams */ | ||
UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(¤t_ctx)); | ||
if (current_ctx == NULL) { | ||
ucs_error("attempt to perform cuda memcpy without active context"); | ||
return UCS_ERR_IO_ERROR; | ||
} else { | ||
status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc); | ||
if (UCS_OK != status) { | ||
ucs_error("unable to get resources associated with cuda context"); | ||
return UCS_ERR_IO_ERROR; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to inline function.
UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream)); | ||
} | ||
|
||
ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.
* to push and pop the context associated with address (which should be | ||
* non-NULL if we are at this point)*/ | ||
cuCtxPushCurrent(cuda_mem_ctx); | ||
|
||
cu_err = cuMemGetAddressRange(&base_address, &alloc_length, | ||
(CUdeviceptr)address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. @brminich made the same suggestion too.
@Akshay-Venkatesh Rebase is fine with me. |
@Akshay-Venkatesh, no problem from my side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase is OK for me too.
bb7c190
to
fb1d3be
Compare
FYI, in dd8b66d I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global). Also, it looks like |
ucs_status_t status; | ||
|
||
pthread_mutex_lock(&lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the lock can be removed.
if (status != UCS_OK) { | ||
return status; | ||
} else if (current_ctx == NULL) { | ||
ucs_error("attempt to perform cuda memcpy without active context"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this log message can be confusing since the function can be called by different callers. It is better to print the following message: there is no cuda context bound to the calling cpu thread
.
static inline | ||
ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface, | ||
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static inline | |
ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface, | |
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc) | |
static UCS_F_ALWAYS_INLINE ucs_status_t | |
uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface, | |
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc) |
static inline | ||
ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface, | ||
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_cuda_copy_get_short_stream
should have CUstream *stream
parameter, and uct_cuda_copy_per_ctx_rsc_t *ctx_rsc
local variable.
static inline | |
ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface, | |
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc) | |
static UCS_F_ALWAYS_INLINE ucs_status_t | |
uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface, CUstream *stream) |
ucs_free(ctx_rsc); | ||
err_kh_del: | ||
kh_del(cuda_copy_ctx_rscs, &iface->ctx_rscs, iter); | ||
out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out: | |
err: |
if (cu_err != CUDA_SUCCESS) { | ||
ucs_error("cuMemGetAddressRange(%p) error: %s", address, | ||
uct_cuda_base_cu_get_error_string(cu_err)); | ||
return UCS_ERR_INVALID_ADDR; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove extra empty line.
ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface, | ||
CUcontext cuda_ctx, | ||
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the declaration from the .h file, and make the function static.
ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface, | |
CUcontext cuda_ctx, | |
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p) | |
static ucs_status_t | |
uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface, CUcontext cuda_ctx, | |
uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p) |
@@ -48,23 +49,30 @@ typedef struct uct_cuda_copy_queue_desc { | |||
ucs_queue_elem_t queue; | |||
} uct_cuda_copy_queue_desc_t; | |||
|
|||
typedef struct uct_cuda_copy_per_ctx_rsc { | |||
CUcontext cuda_ctx; | |||
unsigned long long ctx_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove ctx_id
filed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakhmets But we use this as the key to get the right context from the hashmap. Are you suggesting that we use a different key?
@@ -131,6 +131,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface, | |||
return UCS_OK; | |||
} | |||
|
|||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unused code.
} uct_cuda_copy_per_ctx_rsc_t; | ||
|
||
|
||
KHASH_MAP_INIT_INT64(cuda_copy_ctx_rscs, struct uct_cuda_copy_per_ctx_rsc*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can store uct_cuda_copy_per_ctx_rsc
(not the pointer), then you would need to do alloc/free during put.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich Will incorporate this change.
CUcontext cuda_ctx; | ||
unsigned long long ctx_id; | ||
/* pool of cuda events to check completion of memcpy operations */ | ||
ucs_mpool_t cuda_event_desc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to have this mpool per context? Maybe one common mpool is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Event and stream resources are associated with a context. We do need an mpool for each context.
What/Why?
Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.
How ?
CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.
TODO
Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context(not going to cleanup resources and leave it to the OS to handle it)Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync(removed)