-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix codegen bug in "ptx-kernel" abi related to arg passing #94703
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @davidtwco (or someone else) soon. Please see the contribution instructions for more information. |
I may be missing it in the PTX docs, but does PTX actually want aggregates less than 8 bytes to be extended to 8 bytes? |
You are absolutely right. I though the issue with kernel codegen was limited to the kind of structs I had an issue with and kept some of the old code. I think I will need to expand the test cases a lot to cover more of Rusts types and the rules in PTX Interoperability. I will change this PR to draft now and continue to work on it tomorrow. I also have to ask about the interest about getting this properly fixed in the compiler. I'm eager to do the work and would love to contribute to the Rust compiler. But it would be nice to know that if I get it right it will be most likely be merged as well. Also it's worth to mention that I have not yet contributed to the Rust compiler and will realistically need a bit more explaining than a seasoned contributor. Perhaps @davidtwco is the right person to give an indication of this interest. |
ptx-kernel has pretty much been a dead ABI until now because nvptx64 was broken and unusable. Fixing ptx-kernel is one of the steps to upstreaming the rust-cuda work back to rustc, so this needs to be done either ways. I don't currently use ptx-kernel because as mentioned, its broken, but in the future i'd like to transition to it because specifically, a cuda codegen cannot register a special kernel-marking proc macro (i mean, it could but its ugly and prob wouldn't be accepted), Therefore we need a special ABI to say that a function is a kernel. However the ptx-kernel ABI needs a lot more work, including:
|
I'm glad to hear that it is useful for rust-cuda. I would love to help out on things related to improving ptx generation through llvm and preparing for upstreaming the nvcc backend. I'm quite new to both ptx and rust compiler contributions so all guidance is highly appreciated. I don't disagree on any of your points on how values should be passed. Especially making it possible to pass arrays as values without wrapping them in structs is a huge improvement compared to C/C++. However I don't see the logic of treating tuples as multiple values vs as a struct, but perhaps with a little explanation. I also assume this can easily be done right now for this unstable calling convention, since it's practically unusable in its current state I assume nobody is relying on it. On the other side, would it require a lot more time to stabilize the calling convention if all of these decisions had to be made immediately? I would believe stabilizing the calling convention with a stable abi only for the following types, integers up to 64bit, floating points, pointers and struct+array of these, would be quite simple. Then tuples, 128bits and slices could come as additional RFCs. This do still mean they should be implemented properly from the start. Perhaps unrelated to the "ptx-kernel" abi but related to device code. The PTX interoperability says little about return values. I assume it is beneficial to be able to use some specified abi to call into .ptx device functions written in C/C++. Then it would require some standard ABI related to both arguments and returns. Is the idea here to simply follow the C abi? |
I meant that tuples should be passed as structs with their fields ordered logically one after the other. Also, return convention is not a problem, ptx-kernel is for kernels which cannot return values. For calling into foreign functions, its just the same as normal FFI, it can use the C callconv, which CUDA C++ can generate just fine. |
After yesterdays uncovering that kernel args passing are more broken than initially expected i did some research. First of all 128bit integers is defined in nvcc starting from 11.5. I did a quick test with nvcc and they are being packed as 16byte aligned byte arrays. That makes it a lot simpler for us as doing the thing you proposed is essentially the same as what nvcc does anyway. What got me completely confused was checking how nvcc compiles arguments of the integer types compared to what the PTX interoperability document states. In section 3.3 it states that integral types of less than 32bits should be zero/sign extended to 32 bits depending on the type. And both .u32 and .s32 should be used depending on the types. What nvcc does in practice is to use exact size but do not differentiate between u/s and use u for everything. Taking a function using |
this is why i think the PTX interop guide is either outdated, wrong, or misleading. Extending any arguments would mean that the host code would also need to do so when calling the kernel, which is easily unsound and confusing. I think we should just go with what nvcc does in terms of sizes of everything. Signed-ness does not really matter, it does not affect the kernel caller. Size does however, since using incorrect param sizes when calling the kernel is UB (thanks for not doing the smallest of safety checks cuda). |
0fed1de
to
d2da999
Compare
A small status update. Since last I have thrown the PTX interoperability document in the trash and are looking at what nvcc and clang is actually doing. I have confirmed that no 64bit width extension is required and removed it from the code from both kernels and device functions. I have added more tests to confirm that the correct ptx code is generated. I'm not sure yet if it will be possible to have these tests enabled in the compiler by default. I must investigate more why the other nvptx tests are ignored. To the contrary of NVCC and Clang there are no checks in Rust that a kernel does not have a return value. I'm investigating how to best add this to Rust. Perhaps this is outside the scope of this PR, but I would like to know a bit more what it boils down to before deciding to leave it out. If anyone with rustc experience could point me in the right direction that would be lovely. @RDambrosio016 can you please explain a bit more what problems there will be by:
|
Nothing, i just thought passing it separately would be better in case someone wants to call the rust kernels from CUDA C++. But im starting to think its not worth it and we should just pass it as one 128-bit param.
I think we are misunderstanding eachother, i meant that tuples should be passed as byte arrays precisely. Any other way of passing them would just be confusing. |
Slices don't have a stable abi for any calling convention currently. For the current rust abi any type whose layout says |
Yeah, though a stable way of passing slices to gpu kernels is kind of a prereq to stable CUDA support in rust, since slices are at the core of passing immutable data to kernels. Especially since they result in |
Yeah this does not affect rustcall at all, this is what we do in rust-cuda currently. the rust ABI is left untouched for normal function calls in a kernel |
I would like to see slices be stabilized as passing pointer and length as two separate arguments on all platforms like the current implementation. I believe there is an open issue for this on the unsafe-code-guidelines repo. However I don't think nvptx should do this before a guarantee is made on all platforms. |
This is fine for callconvs that are generally aimed at interacting with other languages, however ptx-kernel is mostly just aimed at calling rust kernels with rust wrappers to CUDA. Also, CUDA natively supports passing args of any size, unlike other calling conventions that don't and instead require more exotic things, so CUDA is a bit special in this regard. |
The thing is that currently the layout of slices is not guaranteed in any way. We are even allowed to add fields beyond a pointer and size (eg a value derived by encrypting the pointer and size for exploit mitigation) or have for example the size be measured in bytes rather than elements. I think we need to stabilize that we don't do these things before we stabilize how they are passed on any abi. |
Yeah that's true, it's kind of surprising that something so central to rust is still super unstable when it comes to layout/ffi |
Also I think slices should be passed as two i64 rather than a single i128. Doing anything else requires extra code in the call conv calculation and slightly bloats ir as rustc implements PassMode::Cast by storing to the stack and then loading again, while PassMode::Pair is implemented as extractelement/insertelement. |
Yeah theres pros and cons to both, id be fine with either one tho. |
Thanks for clearing up the confusion around slice passing. For the scope of this PR it's more about doing the most sensible thing and not about assuring stable behavior in regards to slices. The
It's always possible to pass as ptr + len manually until slice is supported. Wouldn't it be UB to pass a
What I'm talking about is that it seems like Rust handles tuples differently depending on the size of the tuple. For two elements it will pass them, much like the slice, as two seperate arguments. For three arguments Rust will pass it as a struct. See the tests below for a more concise explanation. I also think expanding the tuples to separate arguments, like rust do for tuples of size 2, would be quite reasonable behavior. Since tuples also have an unstable ABI, perhaps it's best to do what @bjorn3 refers to in terms of slices and avoid doing anything special to avoid special handling in call conv code? Test below: // CHECK: .visible .entry f_u32_u16_tuple_arg(
// CHECK: .param .u32 f_u32_u16_tuple_arg_param_0
// CHECK: .param .u16 f_u32_u16_tuple_arg_param_1
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u32_u16_tuple_arg(a: (u32, u16)) {}
// CHECK: .visible .entry f_u8_u8_u32_tuple_arg(
// CHECK: .param .align 4 .b8 f_u8_u8_u32_tuple_arg_param_0[8]
#[no_mangle]
pub unsafe extern "ptx-kernel" fn f_u8_u8_u32_tuple_arg(a: (u8, u8, u32)) {} |
@bjorn3 it seems like also a struct with two elements (like the one below) will be implemented as I think the reason this does not matter for C is that it's just stack memory anyway in an ordinary call. I looked at Are there any ways to keep slices as Here is an example of struct that is made into #[repr(C)]
pub struct DoubleFloat {
f: f32,
g: f32,
} |
You could add a method to |
⌛ Testing commit cea692a01169bdcf5a109b856204c0c5f4e19c57 with merge 8783abad5edb3f8d2c6097066708ab49753d6d44... |
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
It seems the CI fails due to not having ptx-linker. Should I ignore the test until the LLVM bug can be resolved and we can use |
Hm, curious. Other |
cea692a
to
5bf5acc
Compare
@nagisa |
@bors r+ |
📌 Commit 5bf5acc has been approved by |
… r=nagisa Fix codegen bug in "ptx-kernel" abi related to arg passing I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware. This PR fixes the bug and add tests for passing structs to ptx kernels. I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR. I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated. Perhaps `@RDambrosio016` would have interest in taking a quick look at this.
…laumeGomez Rollup of 8 pull requests Successful merges: - rust-lang#94022 (Clarify that `Cow::into_owned` returns owned data) - rust-lang#94703 (Fix codegen bug in "ptx-kernel" abi related to arg passing) - rust-lang#95949 (Implement Default for AssertUnwindSafe) - rust-lang#96361 (Switch JS code to ES6) - rust-lang#96372 (Suggest calling method on nested field when struct is missing method) - rust-lang#96386 (simplify `describe_field` func in borrowck's diagnostics part) - rust-lang#96400 (Correct documentation for `Rvalue::ShallowInitBox`) - rust-lang#96415 (Remove references to git.io) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
Experiments with Rust Futures Implemented derive for RustToCudaAsync Implemented async kernel launch Fixed RustToCudaAsync derive LaunchPackage with non-mut Stream Moved stream to be an explicit kernel argument Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream Upgraded to fixed RustaCuda Added scratch-space methods for uni-directional CudaExchangeItem Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride Extended the CudaExchangeItem API with scratch and uMaybeUninit Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked Implemented #[cuda(crate)] and #[kernel(crate)] attributes Added simple thread-block shared memory support Fixed device utils doc tests Convert cuda thread-block-shared memory address to generic First steps towards better shared memory, including dynamic Revert derive changes + R2C-based approach start Some progress on shared slices Backup of progress on compile-time PTX checking Clean up the PTX JIT implementation Add convenience functions for ThreadBlockShared arrays Improve and fix CI Remove broken ThreadBlockShared RustToCuda impl Refactor kernel trait generation to push more safety constraints to the kernel definition Fixed SomeCudaAlloc import Added error handling to the compile-time PTX checking Add PTX lint parsing, no actual support yet Added lint checking support to monomorphised kernel impls Improve kernel checking + added cubin dump lint Fix kernel macro config parsing Explicitly fitting Device[Const|Mut]Ref into device registers Switched one std:: to core:: Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703 Simplified the kernel parameter layout extraction from PTX Fix up rebase issues Install CUDA in all CI steps Use CStr literals Simplify and document the safety traits Fix move_to_cuda bound Fix clippy for 1.76 Cleaned up the rust-cuda device macros with better print The implementation still uses String for dynamic formatting, which currently pulls in loads of formatting and panic machinery. While a custom String type that pre-allocated the exact format String length can avoid some of that, the formatting machinery even for e.g. usize is still large. If `format_args!` is ever optimised for better inlining, the more verbose and lower-level implementation could be reconsidered. Switch to using more vprintf in embedded CUDA kernel Make print example fully executable Clean up the print example ptr_from_ref is stable from 1.76 Exit on CUDA panic instead of abort to allow the host to handle the error Backup of early progress for switching from kernel traits to functions More work into kernel functions instead of traits Eliminate almost all ArgsTrait usages Some refactoring of the async kernel func type + wrap code Early sketch of extracting type wrapping from macro into types and traits Early work towards using trait for kernel type wrap, ptx jit workaround missing Lift complete CPU kernel wrapper from proc macro into public functions Add async launch helper Further cleanup of the new kernel param API Start cleaning up the public API Allow passing ThreadBlockShared to kernels again Remove unsound mutable lending to CUDA for now Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory Begin refactoring the public API with device feature Refactoring to prepare for better module structure Extract kernel module just for parameters Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda Large restructuring of the module layout for rust-cuda Split rust-cuda-kernel off from rust-cuda-derive Update codecov action to handle rust-cuda-kernel Fix clippy lint Far too much time spent getting rid of DeviceCopy More refactoring and auditing kernel param bounds First exploration towards a stricter async CUDA API More experiments with async API Further API experimentation Further async API experimentation Further async API design work Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T] Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls Progress on replacing ...Async with Async<...> Seal more implementation details Further small API improvements Add AsyncProj helper API struct for async projections Disable async derive in examples for now Implement RustToCudaAsync derive impls Further async API improvements to add drop behaviour First sketch of the safety constraints of a new NoSafeAliasing trait First steps towards reintroducing LendToCudaMut Fix no-std Box import for LendRustToCuda derive Re-add RustToCuda implementation for Final Remove redundant RustToCudaAsyncProxy More progress on less 'static bounds on kernel params Further investigation of less 'static bounds Remove 'static bounds from LendToCuda ref kernel params Make CudaExchangeBuffer Sync Make CudaExchangeBuffer Sync v2 Add AsyncProj proj_ref and proj_mut convenience methods Add RustToCudaWithPortableBitCloneSemantics adapter Fix invalid const fn bounds Add Deref[Mut] to the adapters Fix pointer type inference error Try removing __rust_cuda_ffi_safe_assert module Ensure async launch mutable borrow safety with barriers on use and stream move Fix uniqueness guarantee for Stream using branded types Try without ref proj Try add extract ref Fix doc link clean up kernel signature check Some cleanup before merging Fix some clippy lints, add FIXMEs for others Add docs for rust-cuda-derive Small refactoring + added docs for rust-cuda-kernel Bump MSRV to 1.77-nightly Try trait-based kernel signature check Try naming host kernel layout const Try match against byte literal for faster comparison Try with memcmp intrinsic Try out experimental const-type-layout with compression Try check Try check again
* Initial work on supporting some async memory transfers Experiments with Rust Futures Implemented derive for RustToCudaAsync Implemented async kernel launch Fixed RustToCudaAsync derive LaunchPackage with non-mut Stream Moved stream to be an explicit kernel argument Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream Upgraded to fixed RustaCuda Added scratch-space methods for uni-directional CudaExchangeItem Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride Extended the CudaExchangeItem API with scratch and uMaybeUninit Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked Implemented #[cuda(crate)] and #[kernel(crate)] attributes Added simple thread-block shared memory support Fixed device utils doc tests Convert cuda thread-block-shared memory address to generic First steps towards better shared memory, including dynamic Revert derive changes + R2C-based approach start Some progress on shared slices Backup of progress on compile-time PTX checking Clean up the PTX JIT implementation Add convenience functions for ThreadBlockShared arrays Improve and fix CI Remove broken ThreadBlockShared RustToCuda impl Refactor kernel trait generation to push more safety constraints to the kernel definition Fixed SomeCudaAlloc import Added error handling to the compile-time PTX checking Add PTX lint parsing, no actual support yet Added lint checking support to monomorphised kernel impls Improve kernel checking + added cubin dump lint Fix kernel macro config parsing Explicitly fitting Device[Const|Mut]Ref into device registers Switched one std:: to core:: Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703 Simplified the kernel parameter layout extraction from PTX Fix up rebase issues Install CUDA in all CI steps Use CStr literals Simplify and document the safety traits Fix move_to_cuda bound Fix clippy for 1.76 Cleaned up the rust-cuda device macros with better print The implementation still uses String for dynamic formatting, which currently pulls in loads of formatting and panic machinery. While a custom String type that pre-allocated the exact format String length can avoid some of that, the formatting machinery even for e.g. usize is still large. If `format_args!` is ever optimised for better inlining, the more verbose and lower-level implementation could be reconsidered. Switch to using more vprintf in embedded CUDA kernel Make print example fully executable Clean up the print example ptr_from_ref is stable from 1.76 Exit on CUDA panic instead of abort to allow the host to handle the error Backup of early progress for switching from kernel traits to functions More work into kernel functions instead of traits Eliminate almost all ArgsTrait usages Some refactoring of the async kernel func type + wrap code Early sketch of extracting type wrapping from macro into types and traits Early work towards using trait for kernel type wrap, ptx jit workaround missing Lift complete CPU kernel wrapper from proc macro into public functions Add async launch helper Further cleanup of the new kernel param API Start cleaning up the public API Allow passing ThreadBlockShared to kernels again Remove unsound mutable lending to CUDA for now Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory Begin refactoring the public API with device feature Refactoring to prepare for better module structure Extract kernel module just for parameters Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda Large restructuring of the module layout for rust-cuda Split rust-cuda-kernel off from rust-cuda-derive Update codecov action to handle rust-cuda-kernel Fix clippy lint Far too much time spent getting rid of DeviceCopy More refactoring and auditing kernel param bounds First exploration towards a stricter async CUDA API More experiments with async API Further API experimentation Further async API experimentation Further async API design work Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T] Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls Progress on replacing ...Async with Async<...> Seal more implementation details Further small API improvements Add AsyncProj helper API struct for async projections Disable async derive in examples for now Implement RustToCudaAsync derive impls Further async API improvements to add drop behaviour First sketch of the safety constraints of a new NoSafeAliasing trait First steps towards reintroducing LendToCudaMut Fix no-std Box import for LendRustToCuda derive Re-add RustToCuda implementation for Final Remove redundant RustToCudaAsyncProxy More progress on less 'static bounds on kernel params Further investigation of less 'static bounds Remove 'static bounds from LendToCuda ref kernel params Make CudaExchangeBuffer Sync Make CudaExchangeBuffer Sync v2 Add AsyncProj proj_ref and proj_mut convenience methods Add RustToCudaWithPortableBitCloneSemantics adapter Fix invalid const fn bounds Add Deref[Mut] to the adapters Fix pointer type inference error Try removing __rust_cuda_ffi_safe_assert module Ensure async launch mutable borrow safety with barriers on use and stream move Fix uniqueness guarantee for Stream using branded types Try without ref proj Try add extract ref Fix doc link clean up kernel signature check Some cleanup before merging Fix some clippy lints, add FIXMEs for others Add docs for rust-cuda-derive Small refactoring + added docs for rust-cuda-kernel Bump MSRV to 1.77-nightly Try trait-based kernel signature check Try naming host kernel layout const Try match against byte literal for faster comparison Try with memcmp intrinsic Try out experimental const-type-layout with compression Try check Try check again * Fix CUDA install in CI * Switch from kernel type signature check to random hash * Fix CI-identified failures * Use pinned nightly in CI * Try splitting the kernel func signature type check * Try with llvm-bitcode-linker * Upgrade to latest ptx-builder * Fix codecov by excluding ptx tests (codecov weirdly overrides linker)
I found a codegen bug in the nvptx abi related to that args are passed as ptrs (see comment), this is not as specified in the ptx-interoperability doc or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.
This PR fixes the bug and add tests for passing structs to ptx kernels.
I observed that all nvptx assembly tests had been marked as ignore a long time ago. I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.
I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.
Perhaps @RDambrosio016 would have interest in taking a quick look at this.