UCT/CUDA_COPY: add multi-device support in cuda_copy #9645

Akshay-Venkatesh · 2024-01-30T18:19:43Z

What/Why?

Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.

How ?

CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.

TODO

~~Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context~~ (not going to cleanup resources and leave it to the OS to handle it)
~~Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync~~ (removed)

rakhmets · 2024-02-05T10:43:55Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

    ucs_status_t status;

+    pthread_mutex_lock(&lock);


Could you please explain way a lock is necessary here.

@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.

but UCT is not thread-safe and in UCP we have global lock per operation.
is there a specific use-cases requiring this lock?

I believe the lock can be removed.

src/uct/cuda/cuda_copy/cuda_copy_ep.c

src/uct/cuda/cuda_copy/cuda_copy_iface.c

src/uct/cuda/cuda_copy/cuda_copy_ep.c

SeyedMir · 2024-02-14T22:14:37Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

                         ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
 {
-    CUstream *stream = NULL;
+    static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+    CUstream *stream            = NULL;


Is the NULL assignment necessary? Line 70 will always overwrite it.

Not needed. Will leave it uninitialized.

SeyedMir · 2024-02-14T22:26:18Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    } else {
+        status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
+        if (UCS_OK != status) {
+            ucs_error("unable to get resources associated with cuda context");
            return UCS_ERR_IO_ERROR;
        }
    }


This block of code (lines 128-137) is repeated in put_short and get_short as well. Maybe put it in a function or macro?

SeyedMir · 2024-02-14T22:28:11Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
+    if (current_ctx == NULL) {
+        ucs_error("attempt to perform cuda memcpy without active context");
+        return UCS_ERR_IO_ERROR;


Should we return error or attempt to set the context as we have the buffers? Though, it may not be in the scope of this PR.

Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

SeyedMir · 2024-02-14T22:36:12Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

-static UCS_CLASS_INIT_FUNC(uct_cuda_copy_iface_t, uct_md_h md, uct_worker_h worker,
-                           const uct_iface_params_t *params,
-                           const uct_iface_config_t *tl_config)
+void uct_cuda_copy_cleanup_per_ctx_rscs(uct_cuda_copy_per_ctx_rsc_t *ctx_rsc)


Seems like this can be a static function.

SeyedMir · 2024-02-14T22:36:24Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

-    UCS_BITMAP_CLEAR(&self->streams_to_sync);
+}
+
+ucs_status_t uct_cuda_copy_init_per_ctx_rscs(uct_cuda_copy_iface_t *iface,


Seems like this can be a static function.
Also, I think this function does not need the iface. The max_cuda_events value can be passed directly as an argument instead. If you decide to keep passing the iface, then let's add const.

SeyedMir · 2024-02-14T22:38:18Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c


    return UCS_OK;
 }

-static UCS_CLASS_CLEANUP_FUNC(uct_cuda_copy_iface_t)
+ucs_status_t uct_cuda_copy_get_ctx_rscs(uct_cuda_copy_iface_t *iface,


Should this have _per_ in the name to be consistent with init_per_ctx_rscs and cleanup_per_ctx_rscs functions?

SeyedMir · 2024-02-14T22:46:01Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

+            UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
+        }
+
+        ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);


Shouldn't this mpool cleanup be called even if the ctx_rsc->cuda_ctx is not valid anymore?

If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

SeyedMir · 2024-02-14T22:49:11Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

+     * to push and pop the context associated with address (which should be
+     * non-NULL if we are at this point)*/
+    cuCtxPushCurrent(cuda_mem_ctx);
+
    cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
                                  (CUdeviceptr)address);


Shall we move cuCtxPopCurrent(&cuda_popped_ctx); to here? Because we want to pop the pushed context regardless of the success or failure of cuMemGetAddressRange

Thanks for the suggestion. @brminich made the same suggestion too.

brminich · 2024-02-09T12:12:22Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    /* ensure context is set before creating events/streams */
+    UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
+    if (current_ctx == NULL) {
+        ucs_error("attempt to perform cuda memcpy without active context");
+        return UCS_ERR_IO_ERROR;
+    } else {
+        status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
+        if (UCS_OK != status) {
+            ucs_error("unable to get resources associated with cuda context");
+            return UCS_ERR_IO_ERROR;
+        }
+    }



it could be a common inline function to get current ctx

Will change to inline function.

brminich · 2024-02-09T12:12:48Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+
+    /* ensure context is set before creating events/streams */
+    UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
+    if (current_ctx == NULL) {


Suggested change

if (current_ctx == NULL) {

if (ucs_unlikely(current_ctx == NULL)) {

brminich · 2024-02-09T13:10:59Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

    ucs_memory_type_t src, dst;
-    ucs_mpool_params_t mp_params;
+	unsigned long long ctx_id;


Suggested change

unsigned long long ctx_id;

unsigned long long ctx_id;

brminich · 2024-02-09T14:57:57Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

+    /* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us
+     * to push and pop the context associated with address (which should be
+     * non-NULL if we are at this point)*/
+    cuCtxPushCurrent(cuda_mem_ctx);
+
    cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
                                  (CUdeviceptr)address);
    if (cu_err != CUDA_SUCCESS) {
+        cuCtxPopCurrent(&cuda_popped_ctx);
        ucs_error("cuMemGetAddressRange(%p) error: %s", address,
                  uct_cuda_base_cu_get_error_string(cu_err));
        return UCS_ERR_INVALID_ADDR;
    }

+    cuCtxPopCurrent(&cuda_popped_ctx);
+


Suggested change

/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us

* to push and pop the context associated with address (which should be

* non-NULL if we are at this point)*/

cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,

(CUdeviceptr)address);

if (cu_err != CUDA_SUCCESS) {

cuCtxPopCurrent(&cuda_popped_ctx);

ucs_error("cuMemGetAddressRange(%p) error: %s", address,

uct_cuda_base_cu_get_error_string(cu_err));

return UCS_ERR_INVALID_ADDR;

}

cuCtxPopCurrent(&cuda_popped_ctx);

/* GetAddressRange requires context to be set. On DGXA100 it takes 0.03 us

* to push and pop the context associated with address (which should be

* non-NULL if we are at this point)*/

cuCtxPushCurrent(cuda_mem_ctx);

cu_err = cuMemGetAddressRange(&base_address, &alloc_length,

(CUdeviceptr)address);

cuCtxPopCurrent(&cuda_popped_ctx);

if (cu_err != CUDA_SUCCESS) {

ucs_error("cuMemGetAddressRange(%p) error: %s", address,

uct_cuda_base_cu_get_error_string(cu_err));

return UCS_ERR_INVALID_ADDR;

}

Akshay-Venkatesh

Thanks for the feedback. I'll make these changes today.

Akshay-Venkatesh · 2024-02-22T17:33:24Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

    ucs_status_t status;

+    pthread_mutex_lock(&lock);


@rakhmets As we're modifying context resource and multiple threads may want the same stream, we don't want the same stream variable to be initialized more than once.

Akshay-Venkatesh · 2024-02-22T17:34:17Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

                         ucs_memory_type_t src_type, ucs_memory_type_t dst_type)
 {
-    CUstream *stream = NULL;
+    static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+    CUstream *stream            = NULL;


Not needed. Will leave it uninitialized.

src/uct/cuda/cuda_copy/cuda_copy_ep.c

Akshay-Venkatesh · 2024-02-22T17:39:33Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
+    if (current_ctx == NULL) {
+        ucs_error("attempt to perform cuda memcpy without active context");
+        return UCS_ERR_IO_ERROR;


Scope for this PR is for the thread to have set the right context. Will address this in a follow up PR.

Akshay-Venkatesh · 2024-02-22T17:40:16Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    /* ensure context is set before creating events/streams */
+    UCT_CUDADRV_FUNC_LOG_ERR(cuCtxGetCurrent(&current_ctx));
+    if (current_ctx == NULL) {
+        ucs_error("attempt to perform cuda memcpy without active context");
+        return UCS_ERR_IO_ERROR;
+    } else {
+        status = uct_cuda_copy_get_ctx_rscs(iface, current_ctx, &ctx_rsc);
+        if (UCS_OK != status) {
+            ucs_error("unable to get resources associated with cuda context");
+            return UCS_ERR_IO_ERROR;
+        }
+    }



Will change to inline function.

Akshay-Venkatesh · 2024-02-22T17:42:02Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

+            UCT_CUDADRV_FUNC_LOG_ERR(cuStreamDestroy(ctx_rsc->short_stream));
+        }
+
+        ucs_mpool_cleanup(&ctx_rsc->cuda_event_desc, 1);


If context is not present, then context could've been destroyed by the user before ucp_context_destroy or MPI_Finalize.

Akshay-Venkatesh · 2024-02-22T17:43:50Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

+     * to push and pop the context associated with address (which should be
+     * non-NULL if we are at this point)*/
+    cuCtxPushCurrent(cuda_mem_ctx);
+
    cu_err = cuMemGetAddressRange(&base_address, &alloc_length,
                                  (CUdeviceptr)address);


Thanks for the suggestion. @brminich made the same suggestion too.

Akshay-Venkatesh · 2024-02-23T17:00:45Z

@brminich I see one of the commits had an extra colon and 2 commit style tests are failing because of that. Would it be ok to rebase? I can wait to do this until all the reviewers have had a chance to look at my comments and code changes.

cc @rakhmets @SeyedMir

SeyedMir · 2024-02-23T17:09:12Z

@Akshay-Venkatesh Rebase is fine with me.

brminich · 2024-02-23T17:24:45Z

@Akshay-Venkatesh, no problem from my side

rakhmets

Rebase is OK for me too.

src/uct/cuda/cuda_copy/cuda_copy_ep.c

Akshay-Venkatesh · 2024-02-28T23:18:55Z

@brminich @rakhmets @SeyedMir

FYI, in dd8b66d I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global).

Also, it looks like cuCtxGetId is supported for CUDA >=12.0. Without context ID, we don't have a way to query which context we're trying to use and pick associated stream/event resources for transport operations. We cannot use CUcontext handle itself instead of context ID because we cannot assume that the handle returned by say cuCtxGetCurrent will always return the same handle as opposed to a handle that has the same properties. So it seems that multi-device support will need CUDA >= 12.0. We should discuss more about this.

rakhmets · 2025-01-10T17:11:25Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

    ucs_status_t status;

+    pthread_mutex_lock(&lock);


I believe the lock can be removed.

rakhmets · 2025-01-10T17:28:17Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+    if (status != UCS_OK) {
+        return status;
+    } else if (current_ctx == NULL) {
+        ucs_error("attempt to perform cuda memcpy without active context");


I think this log message can be confusing since the function can be called by different callers. It is better to print the following message: there is no cuda context bound to the calling cpu thread.

rakhmets · 2025-01-10T17:32:01Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+static inline
+ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,
+                                       uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)


Suggested change

static inline

ucs_status_t uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,

uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)

static UCS_F_ALWAYS_INLINE ucs_status_t

uct_cuda_copy_get_ctx_rsc(uct_cuda_copy_iface_t *iface,

uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)

rakhmets · 2025-01-10T17:32:46Z

src/uct/cuda/cuda_copy/cuda_copy_ep.c

+static inline
+ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface,
+                                            uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)


uct_cuda_copy_get_short_stream should have CUstream *stream parameter, and uct_cuda_copy_per_ctx_rsc_t *ctx_rsc local variable.

Suggested change

static inline

ucs_status_t uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface,

uct_cuda_copy_per_ctx_rsc_t **ctx_rsc)

static UCS_F_ALWAYS_INLINE ucs_status_t

uct_cuda_copy_get_short_stream(uct_cuda_copy_iface_t *iface, CUstream *stream)

rakhmets · 2025-01-10T17:45:29Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

+    ucs_free(ctx_rsc);
+err_kh_del:
+    kh_del(cuda_copy_ctx_rscs, &iface->ctx_rscs, iter);
+out:


Suggested change

out:

err:

rakhmets · 2025-01-10T17:46:16Z

src/uct/cuda/cuda_copy/cuda_copy_md.c

    if (cu_err != CUDA_SUCCESS) {
        ucs_error("cuMemGetAddressRange(%p) error: %s", address,
                  uct_cuda_base_cu_get_error_string(cu_err));
        return UCS_ERR_INVALID_ADDR;
    }

+


Please remove extra empty line.

rakhmets · 2025-01-10T17:53:39Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

+ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface,
+                                            CUcontext cuda_ctx,
+                                            uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)


Please remove the declaration from the .h file, and make the function static.

Suggested change

ucs_status_t uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface,

CUcontext cuda_ctx,

uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)

static ucs_status_t

uct_cuda_copy_get_per_ctx_rscs(uct_cuda_copy_iface_t *iface, CUcontext cuda_ctx,

uct_cuda_copy_per_ctx_rsc_t **ctx_rsc_p)

rakhmets · 2025-01-10T17:55:10Z

src/uct/cuda/cuda_copy/cuda_copy_iface.h

@@ -48,23 +49,30 @@ typedef struct uct_cuda_copy_queue_desc {
    ucs_queue_elem_t            queue;
 } uct_cuda_copy_queue_desc_t;

+typedef struct uct_cuda_copy_per_ctx_rsc {
+    CUcontext                   cuda_ctx;
+    unsigned long long          ctx_id;


Please remove ctx_id filed.

@rakhmets But we use this as the key to get the right context from the hashmap. Are you suggesting that we use a different key?

rakhmets · 2025-01-10T17:55:48Z

src/uct/cuda/cuda_copy/cuda_copy_iface.c

@@ -131,6 +131,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface,
    return UCS_OK;
 }

+#if 0


Please remove unused code.

brminich · 2025-01-22T18:09:41Z

src/uct/cuda/cuda_copy/cuda_copy_iface.h

+} uct_cuda_copy_per_ctx_rsc_t;
+
+
+KHASH_MAP_INIT_INT64(cuda_copy_ctx_rscs, struct uct_cuda_copy_per_ctx_rsc*);


you can store uct_cuda_copy_per_ctx_rsc (not the pointer), then you would need to do alloc/free during put.

@brminich Will incorporate this change.

brminich · 2025-01-22T18:16:08Z

src/uct/cuda/cuda_copy/cuda_copy_iface.h

+    CUcontext                   cuda_ctx;
+    unsigned long long          ctx_id;
+    /* pool of cuda events to check completion of memcpy operations */
+    ucs_mpool_t                 cuda_event_desc;


do we really need to have this mpool per context? Maybe one common mpool is enough?

No. Event and stream resources are associated with a context. We do need an mpool for each context.

Akshay-Venkatesh requested review from yosefe, brminich and rakhmets January 30, 2024 18:20

rakhmets reviewed Feb 5, 2024

View reviewed changes

SeyedMir reviewed Feb 14, 2024

View reviewed changes

brminich reviewed Feb 21, 2024

View reviewed changes

Akshay-Venkatesh commented Feb 22, 2024

View reviewed changes

Akshay-Venkatesh marked this pull request as ready for review February 22, 2024 20:52

rakhmets reviewed Feb 26, 2024

View reviewed changes

src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved

src/uct/cuda/cuda_copy/cuda_copy_ep.c Outdated Show resolved Hide resolved

UCT/CUDA_COPY: add multi-device support in cuda_copy

fb1d3be

Akshay-Venkatesh force-pushed the topic/cuda-copy-multi-dev branch from bb7c190 to fb1d3be Compare February 26, 2024 19:02

UCT/CUDA_COPY: remove explicit cleaup of ctx resources as unsafe

dd8b66d

pascal-boeschoten-hapteon mentioned this pull request Nov 15, 2024

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

Open

rakhmets mentioned this pull request Dec 17, 2024

TEST/GTEST: Added cuda gpu switching testing. #10388

Open

rakhmets reviewed Jan 10, 2025

View reviewed changes

brminich reviewed Jan 22, 2025

View reviewed changes

UCT/CUDA_COPY: remove lock use; other feedback

aebede3

	if (current_ctx == NULL) {
	if (ucs_unlikely(current_ctx == NULL)) {

		} uct_cuda_copy_per_ctx_rsc_t;


		KHASH_MAP_INIT_INT64(cuda_copy_ctx_rscs, struct uct_cuda_copy_per_ctx_rsc*);

UCT/CUDA_COPY: add multi-device support in cuda_copy #9645

Are you sure you want to change the base?

UCT/CUDA_COPY: add multi-device support in cuda_copy #9645

Conversation

Akshay-Venkatesh commented Jan 30, 2024 • edited Loading

What/Why?

How ?

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshay-Venkatesh commented Feb 23, 2024

SeyedMir commented Feb 23, 2024

brminich commented Feb 23, 2024

rakhmets left a comment

Choose a reason for hiding this comment

Akshay-Venkatesh commented Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshay-Venkatesh Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Akshay-Venkatesh commented Jan 30, 2024 •

edited

Loading

Akshay-Venkatesh commented Feb 28, 2024 •

edited

Loading

Akshay-Venkatesh Jan 30, 2025 •

edited

Loading