Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

kjetilkjeka · 2022-03-07T14:53:22Z

I found a codegen bug in the nvptx abi related to that args are passed as ptrs (see comment), this is not as specified in the ptx-interoperability doc or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.

This PR fixes the bug and add tests for passing structs to ptx kernels.

I observed that all nvptx assembly tests had been marked as ignore a long time ago. I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.

I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.

Perhaps @RDambrosio016 would have interest in taking a quick look at this.

rust-highfive · 2022-03-07T14:53:26Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @davidtwco (or someone else) soon.

Please see the contribution instructions for more information.

RDambrosio016 · 2022-03-07T17:39:04Z

I may be missing it in the PTX docs, but does PTX actually want aggregates less than 8 bytes to be extended to 8 bytes?

kjetilkjeka · 2022-03-07T21:38:42Z

I may be missing it in the PTX docs, but does PTX actually want aggregates less than 8 bytes to be extended to 8 bytes?

You are absolutely right. I though the issue with kernel codegen was limited to the kind of structs I had an issue with and kept some of the old code. I think I will need to expand the test cases a lot to cover more of Rusts types and the rules in PTX Interoperability. I will change this PR to draft now and continue to work on it tomorrow.

I also have to ask about the interest about getting this properly fixed in the compiler. I'm eager to do the work and would love to contribute to the Rust compiler. But it would be nice to know that if I get it right it will be most likely be merged as well. Also it's worth to mention that I have not yet contributed to the Rust compiler and will realistically need a bit more explaining than a seasoned contributor. Perhaps @davidtwco is the right person to give an indication of this interest.

RDambrosio016 · 2022-03-07T21:57:20Z

ptx-kernel has pretty much been a dead ABI until now because nvptx64 was broken and unusable. Fixing ptx-kernel is one of the steps to upstreaming the rust-cuda work back to rustc, so this needs to be done either ways. I don't currently use ptx-kernel because as mentioned, its broken, but in the future i'd like to transition to it because specifically, a cuda codegen cannot register a special kernel-marking proc macro (i mean, it could but its ugly and prob wouldn't be accepted), Therefore we need a special ABI to say that a function is a kernel. However the ptx-kernel ABI needs a lot more work, including:

Fixing how "regular" params are passed, i.e. pass scalars correctly and pass aggregates correctly.
Decide how to pass more exotic rust types such as arrays, tuples, slices, u128, etc. Im of the opinion of:
- arrays are passed as byte arrays, i.e. by value
- tuples are passed by value with the elements side by side
- u128 is passed as a byte array
- slices are passed by value as a struct of ptr and len. however, i do not currently do this in rust-cuda, i pass it as 2 params, ptr and len. I initially did this for potential FFI compatability, but im starting to think it causes more problems than it solves and it should prob just be passed by value, that way a cust DeviceSlice can just be passed directly instead of needing to pass ptr and len.
(optional) Add something to the typechecker to forbid a function calling a ptx-kernel function. However this needs some discussion because it might be beneficial to do a "mock" function call for dynamic parallelism to typecheck dynamic kernel calls.

kjetilkjeka · 2022-03-07T23:04:43Z

I'm glad to hear that it is useful for rust-cuda. I would love to help out on things related to improving ptx generation through llvm and preparing for upstreaming the nvcc backend. I'm quite new to both ptx and rust compiler contributions so all guidance is highly appreciated.

I don't disagree on any of your points on how values should be passed. Especially making it possible to pass arrays as values without wrapping them in structs is a huge improvement compared to C/C++. However I don't see the logic of treating tuples as multiple values vs as a struct, but perhaps with a little explanation. I also assume this can easily be done right now for this unstable calling convention, since it's practically unusable in its current state I assume nobody is relying on it.

On the other side, would it require a lot more time to stabilize the calling convention if all of these decisions had to be made immediately? I would believe stabilizing the calling convention with a stable abi only for the following types, integers up to 64bit, floating points, pointers and struct+array of these, would be quite simple. Then tuples, 128bits and slices could come as additional RFCs. This do still mean they should be implemented properly from the start.

Perhaps unrelated to the "ptx-kernel" abi but related to device code. The PTX interoperability says little about return values. I assume it is beneficial to be able to use some specified abi to call into .ptx device functions written in C/C++. Then it would require some standard ABI related to both arguments and returns. Is the idea here to simply follow the C abi?

RDambrosio016 · 2022-03-07T23:08:39Z

I meant that tuples should be passed as structs with their fields ordered logically one after the other.

Also, return convention is not a problem, ptx-kernel is for kernels which cannot return values. For calling into foreign functions, its just the same as normal FFI, it can use the C callconv, which CUDA C++ can generate just fine.

kjetilkjeka · 2022-03-08T21:31:47Z

After yesterdays uncovering that kernel args passing are more broken than initially expected i did some research.

First of all 128bit integers is defined in nvcc starting from 11.5. I did a quick test with nvcc and they are being packed as 16byte aligned byte arrays. That makes it a lot simpler for us as doing the thing you proposed is essentially the same as what nvcc does anyway.

What got me completely confused was checking how nvcc compiles arguments of the integer types compared to what the PTX interoperability document states. In section 3.3 it states that integral types of less than 32bits should be zero/sign extended to 32 bits depending on the type. And both .u32 and .s32 should be used depending on the types. What nvcc does in practice is to use exact size but do not differentiate between u/s and use u for everything.

Taking a function using uint8_t as an argument. The ptx interoperability states that the param should be .param .u32 foo_param_0 while what nvcc actually does is .param .u8 foo_param_0. Do you know if the ptx interoperability is only meant for device (non-kernel) function or if Nvidia is not following their own documentation?

RDambrosio016 · 2022-03-08T21:51:55Z

this is why i think the PTX interop guide is either outdated, wrong, or misleading. Extending any arguments would mean that the host code would also need to do so when calling the kernel, which is easily unsound and confusing. I think we should just go with what nvcc does in terms of sizes of everything. Signed-ness does not really matter, it does not affect the kernel caller. Size does however, since using incorrect param sizes when calling the kernel is UB (thanks for not doing the smallest of safety checks cuda).

kjetilkjeka · 2022-03-11T18:02:05Z

A small status update.

Since last I have thrown the PTX interoperability document in the trash and are looking at what nvcc and clang is actually doing.

I have confirmed that no 64bit width extension is required and removed it from the code from both kernels and device functions.

I have added more tests to confirm that the correct ptx code is generated. I'm not sure yet if it will be possible to have these tests enabled in the compiler by default. I must investigate more why the other nvptx tests are ignored.

To the contrary of NVCC and Clang there are no checks in Rust that a kernel does not have a return value. I'm investigating how to best add this to Rust. Perhaps this is outside the scope of this PR, but I would like to know a bit more what it boils down to before deciding to leave it out. If anyone with rustc experience could point me in the right direction that would be lovely.

@RDambrosio016 can you please explain a bit more what problems there will be by:

Passing slices as two separate u64 instead of as a byte array?
Expand each tuple member to an argument instead of passing as byte array?

RDambrosio016 · 2022-03-12T21:03:38Z

Passing slices as two separate u64 instead of as a byte array?

Nothing, i just thought passing it separately would be better in case someone wants to call the rust kernels from CUDA C++. But im starting to think its not worth it and we should just pass it as one 128-bit param.

Expand each tuple member to an argument instead of passing as byte array?

I think we are misunderstanding eachother, i meant that tuples should be passed as byte arrays precisely. Any other way of passing them would just be confusing.

bjorn3 · 2022-03-12T21:18:07Z

Slices don't have a stable abi for any calling convention currently. For the current rust abi any type whose layout says Abi::ScalarPair needs to be passed as if both fields are separate arguments (necessary for eg vtable calls). For any other abi it doesn't matter.

RDambrosio016 · 2022-03-12T21:20:31Z

Yeah, though a stable way of passing slices to gpu kernels is kind of a prereq to stable CUDA support in rust, since slices are at the core of passing immutable data to kernels. Especially since they result in noalias which is great for performance on GPUs

RDambrosio016 · 2022-03-12T21:22:41Z

Slices don't have a stable abi for any calling convention currently. For the current rust abi any type whose layout says Abi::ScalarPair needs to be passed as if both fields are separate arguments (necessary for eg vtable calls). For any other abi it doesn't matter.

Yeah this does not affect rustcall at all, this is what we do in rust-cuda currently. the rust ABI is left untouched for normal function calls in a kernel

bjorn3 · 2022-03-12T21:25:09Z

I would like to see slices be stabilized as passing pointer and length as two separate arguments on all platforms like the current implementation. I believe there is an open issue for this on the unsafe-code-guidelines repo. However I don't think nvptx should do this before a guarantee is made on all platforms.

RDambrosio016 · 2022-03-12T21:28:13Z

I would like to see slices be stabilized as passing pointer and length as two separate arguments on all platforms like the current implementation. I believe there is an open issue for this on the unsafe-code-guidelines repo. However I don't think nvptx should do this before a guarantee is made on all platforms.

This is fine for callconvs that are generally aimed at interacting with other languages, however ptx-kernel is mostly just aimed at calling rust kernels with rust wrappers to CUDA. Also, CUDA natively supports passing args of any size, unlike other calling conventions that don't and instead require more exotic things, so CUDA is a bit special in this regard.

bjorn3 · 2022-03-12T21:32:11Z

The thing is that currently the layout of slices is not guaranteed in any way. We are even allowed to add fields beyond a pointer and size (eg a value derived by encrypting the pointer and size for exploit mitigation) or have for example the size be measured in bytes rather than elements. I think we need to stabilize that we don't do these things before we stabilize how they are passed on any abi.

RDambrosio016 · 2022-03-12T21:33:12Z

Yeah that's true, it's kind of surprising that something so central to rust is still super unstable when it comes to layout/ffi

bjorn3 · 2022-03-12T21:35:30Z

Also I think slices should be passed as two i64 rather than a single i128. Doing anything else requires extra code in the call conv calculation and slightly bloats ir as rustc implements PassMode::Cast by storing to the stack and then loading again, while PassMode::Pair is implemented as extractelement/insertelement.

RDambrosio016 · 2022-03-12T21:36:53Z

Yeah theres pros and cons to both, id be fine with either one tho.

kjetilkjeka · 2022-03-12T22:03:09Z

Thanks for clearing up the confusion around slice passing. For the scope of this PR it's more about doing the most sensible thing and not about assuring stable behavior in regards to slices. The improper_ctypes_definitions lint will trigger when attempting to use a slice in a kernel which is consistent with the C abi. If all other targets currently passes slices as two seperate arguments. If I understand the conclusion correctly we should just pass them as two u64, but without any stability guarantee or encouragement to use for kernel args.

Yeah, though a stable way of passing slices to gpu kernels is kind of a prereq to stable CUDA support in rust

It's always possible to pass as ptr + len manually until slice is supported. Wouldn't it be UB to pass a &mut [T] to a kernel since it will be one alias to the memory for each thread anyway?

I think we are misunderstanding eachother, i meant that tuples should be passed as byte arrays precisely. Any other way of passing them would just be confusing.

What I'm talking about is that it seems like Rust handles tuples differently depending on the size of the tuple. For two elements it will pass them, much like the slice, as two seperate arguments. For three arguments Rust will pass it as a struct. See the tests below for a more concise explanation. I also think expanding the tuples to separate arguments, like rust do for tuples of size 2, would be quite reasonable behavior.

Since tuples also have an unstable ABI, perhaps it's best to do what @bjorn3 refers to in terms of slices and avoid doing anything special to avoid special handling in call conv code?

Test below:

// CHECK: .visible .entry f_u32_u16_tuple_arg(
// CHECK: .param .u32 f_u32_u16_tuple_arg_param_0
// CHECK: .param .u16 f_u32_u16_tuple_arg_param_1
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u32_u16_tuple_arg(a: (u32, u16)) {}

// CHECK: .visible .entry f_u8_u8_u32_tuple_arg(
// CHECK: .param .align 4 .b8 f_u8_u8_u32_tuple_arg_param_0[8]
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u8_u8_u32_tuple_arg(a: (u8, u8, u32)) {}

kjetilkjeka · 2022-03-16T18:21:17Z

Also I think slices should be passed as two i64 rather than a single i128. Doing anything else requires extra code in the call conv calculation and slightly bloats ir as rustc implements PassMode::Cast by storing to the stack and then loading again, while PassMode::Pair is implemented as extractelement/insertelement.

@bjorn3 it seems like also a struct with two elements (like the one below) will be implemented as PassMode::Pair even if it is marked as #[repr(C)]. I don't see any way to check if the type is actually a slice or an actual struct in compiler::rustc_target::abi::call::ArgAbi<'a, Ty> as it is generic over the type without being bound on any traits.

I think the reason this does not matter for C is that it's just stack memory anyway in an ordinary call. I looked at wasm for inspiration but didn't get that much wiser. For ptx kernels it is important that the number of arguments match up between caller and callee.

Are there any ways to keep slices as PassMode::Pair while casting structs that are also PassMode::Pair? Should this be special cased for kernels at a different place than rustc_target? Or is it simpler to just treat everything the same and also cast slices?

Here is an example of struct that is made into PassMode::pair

#[repr(C)]
pub struct DoubleFloat {
    f: f32,
    g: f32,
}

bjorn3 · 2022-03-16T18:35:14Z

I don't see any way to check if the type is actually a slice or an actual struct in compiler::rustc_target::abi::call::ArgAbi<'a, Ty> as it is generic over the type without being bound on any traits.

You could add a method to TyAbiInterface to check if something is a #[repr(C)] struct or something like that. The actual abi calculation impls have a Ty: TyAbiInterface bound.

bors · 2022-04-21T07:44:10Z

⌛ Testing commit cea692a01169bdcf5a109b856204c0c5f4e19c57 with merge 8783abad5edb3f8d2c6097066708ab49753d6d44...

bors · 2022-04-21T08:12:33Z

💔 Test failed - checks-actions

kjetilkjeka · 2022-04-21T08:30:44Z

It seems the CI fails due to not having ptx-linker. Should I ignore the test until the LLVM bug can be resolved and we can use emit-asm?

nagisa · 2022-04-24T16:59:54Z

Hm, curious. Other ptx-linker tests do indeed just ignore nvptx64. In that case I'd say ignore this test as well and work on making the test infrastructure for nvptx work can happen as a following, whether it is done through making the ptx-linker available somehow or by fixing LLVM asm emission.

kjetilkjeka · 2022-04-25T14:44:00Z

@nagisa // ignore-nvptx64 added for now. My next step is to try figure out how we can enable nvptx tests again.

nagisa · 2022-04-26T00:27:38Z

@bors r+

bors · 2022-04-26T00:27:39Z

📌 Commit 5bf5acc has been approved by nagisa

… r=nagisa Fix codegen bug in "ptx-kernel" abi related to arg passing I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware. This PR fixes the bug and add tests for passing structs to ptx kernels. I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR. I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated. Perhaps `@RDambrosio016` would have interest in taking a quick look at this.

…laumeGomez Rollup of 8 pull requests Successful merges: - rust-lang#94022 (Clarify that `Cow::into_owned` returns owned data) - rust-lang#94703 (Fix codegen bug in "ptx-kernel" abi related to arg passing) - rust-lang#95949 (Implement Default for AssertUnwindSafe) - rust-lang#96361 (Switch JS code to ES6) - rust-lang#96372 (Suggest calling method on nested field when struct is missing method) - rust-lang#96386 (simplify `describe_field` func in borrowck's diagnostics part) - rust-lang#96400 (Correct documentation for `Rvalue::ShallowInitBox`) - rust-lang#96415 (Remove references to git.io) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

bors · 2022-04-26T18:48:57Z

⌛ Testing commit 5bf5acc with merge 082e4ca...

…lang/rust#94703

Experiments with Rust Futures Implemented derive for RustToCudaAsync Implemented async kernel launch Fixed RustToCudaAsync derive LaunchPackage with non-mut Stream Moved stream to be an explicit kernel argument Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream Upgraded to fixed RustaCuda Added scratch-space methods for uni-directional CudaExchangeItem Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride Extended the CudaExchangeItem API with scratch and uMaybeUninit Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked Implemented #[cuda(crate)] and #[kernel(crate)] attributes Added simple thread-block shared memory support Fixed device utils doc tests Convert cuda thread-block-shared memory address to generic First steps towards better shared memory, including dynamic Revert derive changes + R2C-based approach start Some progress on shared slices Backup of progress on compile-time PTX checking Clean up the PTX JIT implementation Add convenience functions for ThreadBlockShared arrays Improve and fix CI Remove broken ThreadBlockShared RustToCuda impl Refactor kernel trait generation to push more safety constraints to the kernel definition Fixed SomeCudaAlloc import Added error handling to the compile-time PTX checking Add PTX lint parsing, no actual support yet Added lint checking support to monomorphised kernel impls Improve kernel checking + added cubin dump lint Fix kernel macro config parsing Explicitly fitting Device[Const|Mut]Ref into device registers Switched one std:: to core:: Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703 Simplified the kernel parameter layout extraction from PTX Fix up rebase issues Install CUDA in all CI steps Use CStr literals Simplify and document the safety traits Fix move_to_cuda bound Fix clippy for 1.76 Cleaned up the rust-cuda device macros with better print The implementation still uses String for dynamic formatting, which currently pulls in loads of formatting and panic machinery. While a custom String type that pre-allocated the exact format String length can avoid some of that, the formatting machinery even for e.g. usize is still large. If `format_args!` is ever optimised for better inlining, the more verbose and lower-level implementation could be reconsidered. Switch to using more vprintf in embedded CUDA kernel Make print example fully executable Clean up the print example ptr_from_ref is stable from 1.76 Exit on CUDA panic instead of abort to allow the host to handle the error Backup of early progress for switching from kernel traits to functions More work into kernel functions instead of traits Eliminate almost all ArgsTrait usages Some refactoring of the async kernel func type + wrap code Early sketch of extracting type wrapping from macro into types and traits Early work towards using trait for kernel type wrap, ptx jit workaround missing Lift complete CPU kernel wrapper from proc macro into public functions Add async launch helper Further cleanup of the new kernel param API Start cleaning up the public API Allow passing ThreadBlockShared to kernels again Remove unsound mutable lending to CUDA for now Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory Begin refactoring the public API with device feature Refactoring to prepare for better module structure Extract kernel module just for parameters Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda Large restructuring of the module layout for rust-cuda Split rust-cuda-kernel off from rust-cuda-derive Update codecov action to handle rust-cuda-kernel Fix clippy lint Far too much time spent getting rid of DeviceCopy More refactoring and auditing kernel param bounds First exploration towards a stricter async CUDA API More experiments with async API Further API experimentation Further async API experimentation Further async API design work Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T] Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls Progress on replacing ...Async with Async<...> Seal more implementation details Further small API improvements Add AsyncProj helper API struct for async projections Disable async derive in examples for now Implement RustToCudaAsync derive impls Further async API improvements to add drop behaviour First sketch of the safety constraints of a new NoSafeAliasing trait First steps towards reintroducing LendToCudaMut Fix no-std Box import for LendRustToCuda derive Re-add RustToCuda implementation for Final Remove redundant RustToCudaAsyncProxy More progress on less 'static bounds on kernel params Further investigation of less 'static bounds Remove 'static bounds from LendToCuda ref kernel params Make CudaExchangeBuffer Sync Make CudaExchangeBuffer Sync v2 Add AsyncProj proj_ref and proj_mut convenience methods Add RustToCudaWithPortableBitCloneSemantics adapter Fix invalid const fn bounds Add Deref[Mut] to the adapters Fix pointer type inference error Try removing __rust_cuda_ffi_safe_assert module Ensure async launch mutable borrow safety with barriers on use and stream move Fix uniqueness guarantee for Stream using branded types Try without ref proj Try add extract ref Fix doc link clean up kernel signature check Some cleanup before merging Fix some clippy lints, add FIXMEs for others Add docs for rust-cuda-derive Small refactoring + added docs for rust-cuda-kernel Bump MSRV to 1.77-nightly Try trait-based kernel signature check Try naming host kernel layout const Try match against byte literal for faster comparison Try with memcmp intrinsic Try out experimental const-type-layout with compression Try check Try check again

* Initial work on supporting some async memory transfers Experiments with Rust Futures Implemented derive for RustToCudaAsync Implemented async kernel launch Fixed RustToCudaAsync derive LaunchPackage with non-mut Stream Moved stream to be an explicit kernel argument Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream Upgraded to fixed RustaCuda Added scratch-space methods for uni-directional CudaExchangeItem Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride Extended the CudaExchangeItem API with scratch and uMaybeUninit Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked Implemented #[cuda(crate)] and #[kernel(crate)] attributes Added simple thread-block shared memory support Fixed device utils doc tests Convert cuda thread-block-shared memory address to generic First steps towards better shared memory, including dynamic Revert derive changes + R2C-based approach start Some progress on shared slices Backup of progress on compile-time PTX checking Clean up the PTX JIT implementation Add convenience functions for ThreadBlockShared arrays Improve and fix CI Remove broken ThreadBlockShared RustToCuda impl Refactor kernel trait generation to push more safety constraints to the kernel definition Fixed SomeCudaAlloc import Added error handling to the compile-time PTX checking Add PTX lint parsing, no actual support yet Added lint checking support to monomorphised kernel impls Improve kernel checking + added cubin dump lint Fix kernel macro config parsing Explicitly fitting Device[Const|Mut]Ref into device registers Switched one std:: to core:: Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703 Simplified the kernel parameter layout extraction from PTX Fix up rebase issues Install CUDA in all CI steps Use CStr literals Simplify and document the safety traits Fix move_to_cuda bound Fix clippy for 1.76 Cleaned up the rust-cuda device macros with better print The implementation still uses String for dynamic formatting, which currently pulls in loads of formatting and panic machinery. While a custom String type that pre-allocated the exact format String length can avoid some of that, the formatting machinery even for e.g. usize is still large. If `format_args!` is ever optimised for better inlining, the more verbose and lower-level implementation could be reconsidered. Switch to using more vprintf in embedded CUDA kernel Make print example fully executable Clean up the print example ptr_from_ref is stable from 1.76 Exit on CUDA panic instead of abort to allow the host to handle the error Backup of early progress for switching from kernel traits to functions More work into kernel functions instead of traits Eliminate almost all ArgsTrait usages Some refactoring of the async kernel func type + wrap code Early sketch of extracting type wrapping from macro into types and traits Early work towards using trait for kernel type wrap, ptx jit workaround missing Lift complete CPU kernel wrapper from proc macro into public functions Add async launch helper Further cleanup of the new kernel param API Start cleaning up the public API Allow passing ThreadBlockShared to kernels again Remove unsound mutable lending to CUDA for now Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory Begin refactoring the public API with device feature Refactoring to prepare for better module structure Extract kernel module just for parameters Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda Large restructuring of the module layout for rust-cuda Split rust-cuda-kernel off from rust-cuda-derive Update codecov action to handle rust-cuda-kernel Fix clippy lint Far too much time spent getting rid of DeviceCopy More refactoring and auditing kernel param bounds First exploration towards a stricter async CUDA API More experiments with async API Further API experimentation Further async API experimentation Further async API design work Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T] Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls Progress on replacing ...Async with Async<...> Seal more implementation details Further small API improvements Add AsyncProj helper API struct for async projections Disable async derive in examples for now Implement RustToCudaAsync derive impls Further async API improvements to add drop behaviour First sketch of the safety constraints of a new NoSafeAliasing trait First steps towards reintroducing LendToCudaMut Fix no-std Box import for LendRustToCuda derive Re-add RustToCuda implementation for Final Remove redundant RustToCudaAsyncProxy More progress on less 'static bounds on kernel params Further investigation of less 'static bounds Remove 'static bounds from LendToCuda ref kernel params Make CudaExchangeBuffer Sync Make CudaExchangeBuffer Sync v2 Add AsyncProj proj_ref and proj_mut convenience methods Add RustToCudaWithPortableBitCloneSemantics adapter Fix invalid const fn bounds Add Deref[Mut] to the adapters Fix pointer type inference error Try removing __rust_cuda_ffi_safe_assert module Ensure async launch mutable borrow safety with barriers on use and stream move Fix uniqueness guarantee for Stream using branded types Try without ref proj Try add extract ref Fix doc link clean up kernel signature check Some cleanup before merging Fix some clippy lints, add FIXMEs for others Add docs for rust-cuda-derive Small refactoring + added docs for rust-cuda-kernel Bump MSRV to 1.77-nightly Try trait-based kernel signature check Try naming host kernel layout const Try match against byte literal for faster comparison Try with memcmp intrinsic Try out experimental const-type-layout with compression Try check Try check again * Fix CUDA install in CI * Switch from kernel type signature check to random hash * Fix CI-identified failures * Use pinned nightly in CI * Try splitting the kernel func signature type check * Try with llvm-bitcode-linker * Upgrade to latest ptx-builder * Fix codecov by excluding ptx tests (codecov weirdly overrides linker)

rust-highfive assigned davidtwco Mar 7, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 7, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 7, 2022

kjetilkjeka changed the title ~~Nvptx kernel args abi2~~ Fix codegen bug in "ptx-kernel" abi related to arg passing Mar 7, 2022

bjorn3 added the O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html label Mar 7, 2022

kjetilkjeka marked this pull request as draft March 7, 2022 21:38

kjetilkjeka mentioned this pull request Mar 8, 2022

Removing codegen logic for nvptx-nvidia-cuda (32-bit target) rust-lang/compiler-team#496

Closed

3 tasks

kjetilkjeka force-pushed the nvptx-kernel-args-abi2 branch from 0fed1de to d2da999 Compare March 11, 2022 18:01

This comment has been minimized.

Sign in to view

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Apr 21, 2022

Add test for asserting correct generation of ptx-kernel args

5bf5acc

kjetilkjeka force-pushed the nvptx-kernel-args-abi2 branch from cea692a to 5bf5acc Compare April 25, 2022 14:36

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 26, 2022

This was referenced Apr 26, 2022

Rollup of 8 pull requests #96427

Closed

Rollup of 8 pull requests #96428

Merged

bors merged commit fe49981 into rust-lang:master Apr 26, 2022

rustbot added this to the 1.62.0 milestone Apr 26, 2022

github-actions bot pushed a commit to juntyr/rust-cuda that referenced this pull request Jul 4, 2023

Remove register-sized CUDA kernel args check, unnecessary since rust-…

c2874bf

…lang/rust#94703

github-actions bot pushed a commit to juntyr/rust-cuda that referenced this pull request Jul 25, 2023

Remove register-sized CUDA kernel args check, unnecessary since rust-…

fbe0c26

…lang/rust#94703

github-actions bot pushed a commit to juntyr/rust-cuda that referenced this pull request Aug 10, 2023

Remove register-sized CUDA kernel args check, unnecessary since rust-…

2e15aaf

…lang/rust#94703

github-actions bot pushed a commit to juntyr/rust-cuda that referenced this pull request Sep 14, 2023

Remove register-sized CUDA kernel args check, unnecessary since rust-…

0bf3e0d

…lang/rust#94703

juntyr added a commit to juntyr/rust-cuda that referenced this pull request Dec 12, 2023

Remove register-sized CUDA kernel args check, unnecessary since rust-…

fb9461a

…lang/rust#94703

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

kjetilkjeka commented Mar 7, 2022 •

edited

Loading

rust-highfive commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 8, 2022 •

edited

Loading

RDambrosio016 commented Mar 8, 2022

kjetilkjeka commented Mar 11, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022 •

edited

Loading

RDambrosio016 commented Mar 12, 2022 •

edited

Loading

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

kjetilkjeka commented Mar 12, 2022 •

edited

Loading

kjetilkjeka commented Mar 16, 2022 •

edited

Loading

bjorn3 commented Mar 16, 2022

bors commented Apr 21, 2022

This comment has been minimized.

bors commented Apr 21, 2022

kjetilkjeka commented Apr 21, 2022

nagisa commented Apr 24, 2022

kjetilkjeka commented Apr 25, 2022

nagisa commented Apr 26, 2022

bors commented Apr 26, 2022

bors commented Apr 26, 2022

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

Fix codegen bug in "ptx-kernel" abi related to arg passing #94703

Conversation

kjetilkjeka commented Mar 7, 2022 • edited Loading

rust-highfive commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 7, 2022

RDambrosio016 commented Mar 7, 2022

kjetilkjeka commented Mar 8, 2022 • edited Loading

RDambrosio016 commented Mar 8, 2022

kjetilkjeka commented Mar 11, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022 • edited Loading

RDambrosio016 commented Mar 12, 2022 • edited Loading

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

bjorn3 commented Mar 12, 2022

RDambrosio016 commented Mar 12, 2022

kjetilkjeka commented Mar 12, 2022 • edited Loading

kjetilkjeka commented Mar 16, 2022 • edited Loading

bjorn3 commented Mar 16, 2022

bors commented Apr 21, 2022

This comment has been minimized.

bors commented Apr 21, 2022

kjetilkjeka commented Apr 21, 2022

nagisa commented Apr 24, 2022

kjetilkjeka commented Apr 25, 2022

nagisa commented Apr 26, 2022

bors commented Apr 26, 2022

bors commented Apr 26, 2022

kjetilkjeka commented Mar 7, 2022 •

edited

Loading

kjetilkjeka commented Mar 8, 2022 •

edited

Loading

bjorn3 commented Mar 12, 2022 •

edited

Loading

RDambrosio016 commented Mar 12, 2022 •

edited

Loading

kjetilkjeka commented Mar 12, 2022 •

edited

Loading

kjetilkjeka commented Mar 16, 2022 •

edited

Loading