[Performance] Remove unnecessary synchronization using thrust::cuda::par_nosync policy #148

chang-l · 2024-03-15T20:29:22Z

We are always using asynchronous thrust launch on a cuda stream, which involves extra cudaStreamSync within thrust calls, e.g.,

Line 63 in 9f290c4

    
           thrust::cuda::par(allocator).on(stream), seq_indices, seq_indices + indices_desc.size, 0);

Line 340 in 9f290c4

thrust::exclusive_scan(thrust::cuda::par(thrust_allocator).on(stream),

It would be better to change to thrust::cuda::par_nosync, to make it easier to overlap with other operations.

The text was updated successfully, but these errors were encountered:

linhu-nv · 2024-04-03T03:43:44Z

Sorry for the late reply. wg 24.04 is closing, is it ok if we fix this in 24.06?

Provide feedback