Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

[Performance] Remove unnecessary synchronization using thrust::cuda::par_nosync policy #148

Open
chang-l opened this issue Mar 15, 2024 · 1 comment

Comments

@chang-l
Copy link
Contributor

chang-l commented Mar 15, 2024

We are always using asynchronous thrust launch on a cuda stream, which involves extra cudaStreamSync within thrust calls, e.g.,

thrust::cuda::par(allocator).on(stream), seq_indices, seq_indices + indices_desc.size, 0);
thrust::exclusive_scan(thrust::cuda::par(thrust_allocator).on(stream),

It would be better to change to thrust::cuda::par_nosync, to make it easier to overlap with other operations.

@linhu-nv
Copy link
Contributor

linhu-nv commented Apr 3, 2024

Sorry for the late reply. wg 24.04 is closing, is it ok if we fix this in 24.06?

rapids-bot bot referenced this issue Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants