Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for the "ptx-kernel" ABI #38788

Open
1 of 14 tasks
japaric opened this issue Jan 2, 2017 · 6 comments
Open
1 of 14 tasks

Tracking issue for the "ptx-kernel" ABI #38788

japaric opened this issue Jan 2, 2017 · 6 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html S-tracking-design-concerns Status: There are blocking design concerns. S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@japaric
Copy link
Member

japaric commented Jan 2, 2017

Feature gate #![feature(abi_ptx)]

This ABI is intended to be used when generating code for device (GPU) targets like nvptx64-nvidia-cuda. It is used to generate kernels ("global functions") that work as an entry point from host (cpu) code. Functions that do not use the "ptx-kernel" ABI are "device functions" and only callable from kernels and device functions. Device functions are specifically not usable from host (cpu) code.

Public API

The following code

#![no_std]
#![feature(abi_ptx)]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}

Produces

.version 3.2
.target sm_30
.address_size 64

	// .globl	foo

.visible .entry foo()
{
	ret;
}

Steps / History

Unresolved Questions

  • Resolve what kind of stability guarantees can be made about the generated ptx.
    • The ABI of kernels have been previously changed for a major version bump and the ptx-interoperability doc is still outdated.
    • PTX is an ISA with many versions. The newest is major version 7. Do we need to reserve the possibility of breaking things when moving to a new major version?
    • Figure out what llvm does in relations to the nvptx64-nvidia-cuda target and the __global__ modifier.
  • What kind of types should be allowed to use as arguments in kernels. Should it be a hard error to use these types or only a warning (Global and device kernels are unsound rust-cuda/wg#11)
    • The most important part is to find a minimal but useful subset of Rust types that can be used in kernels. raw pointers, primitive types and #[repr(C)] types seems like a good start (no slices, tuples, references, etc).
    • Using mutable references is almost certain UB except for a few unusable special cases (spawning a single thread only)
    • There are many convenient types in Rust which do not have a stable ABI (&[T], (T, U), etc). Are there some types that do not have a stable representation but can be relied on having an identical representation for sequential compilation with a given rustc version? If so are there any way we could pass them safely between host and device code compiled with the same rustc version?
  • This unstable feature is one of the last stoppers to using nvptx64-nvidia-cuda on stable Rust. The target seems to still have a few bugs (NVPTX backend metabug #38789). Should this feature be kept unstable to avoid usage of nvptx64-nvidia-cuda until it has been verified to be usable.
  • How should shared be supported? Is it necessary to do that from the go?
  • nvptx "ptx-kernel" ABI (feature: abi_ptx) uses PassMode::Direct for Aggregates #117271

Notes

  • It is not possible to emulate kernels with #[naked] functions as the .entry directive needs to be emitted for nvptx kernels, which requires this ABI.
@japaric japaric added O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Jan 2, 2017
@sfackler sfackler added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 3, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 22, 2017
bors added a commit that referenced this issue May 27, 2018
…lexcrichton

Ensure every unstable language feature has a tracking issue.

Filled in the missing numbers:

* `abi_ptx` → #38788
* `generators` → #43122
* `global_allocator` → #27389

Reused existing tracking issues because they were decomposed from a larger feature

* `*_target_feature` → #44839 (reusing the old `target_feature` number)
* `proc_macros_*` → #38356 (reusing the to-be-stabilized `proc_macros` number)

Filed new issues

* `exhaustive_patterns` → #51085
* `pattern_parentheses` → #51087
* `wasm_custom_section` and `wasm_import_module` → #51088
@steveklabnik
Copy link
Member

Triage: not aware of any movement on stabilizing this.

@lahwran
Copy link

lahwran commented Mar 7, 2020

would be very interested in this! any of y'all know what needs doing to bring this back to life?

edit: following links through the metabug, and then through a project that mentioned the metabug, looks like there are in fact folks still interested. cool!

@kjetilkjeka
Copy link
Contributor

kjetilkjeka commented Feb 23, 2022

I'm curious if I might have found a codegen bug related to this feature. It is related to how the struct is assumed to be passed to the kernel.

When compiling the following Rust code
#![no_std]
#![feature(abi_ptx, stdsimd)]

#[panic_handler]
fn panic(info: &core::panic::PanicInfo) -> ! {
    unreachable!()
}

#[repr(C)]
pub struct Foo{
    array: [f32; 9],
}

#[no_mangle]
pub unsafe extern "ptx-kernel" fn add(a: *mut f32, b: Foo) {
    *a = b.array[5];
}
Rustc will produce the following

When compiling with rustc +nightly --target nvptx64-nvidia-cuda <filename> -C target-cpu=sm_52 --crate-type cdylib

//
// Generated by LLVM NVPTX Back-End
//

.version 4.1
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .u64 add_param_1
)
{
	.reg .f32 	%f<2>;
	.reg .b64 	%rd<5>;

	ld.param.u64 	%rd1, [add_param_0];
	ld.param.u64 	%rd2, [add_param_1];
	cvta.to.global.u64 	%rd3, %rd2;
	cvta.to.global.u64 	%rd4, %rd1;
	ld.global.f32 	%f1, [%rd3+32];
	st.global.f32 	[%rd4], %f1;
	ret;

}

On the other hand, when compiling the following C++ code
struct foo {
    float array[9];
};

extern "C" __global__ void add( float *a, struct foo b) {
    *a = b.array[5];
}
I get the following output for nvcc

When compiling with nvcc -ptx <filename>

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30794723
// Cuda compilation tools, release 11.6, V11.6.55
// Based on NVVM 7.0.1
//

.version 7.6
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .align 4 .b8 add_param_1[36]
)
{
	.reg .f32 	%f<2>;
	.reg .b64 	%rd<3>;


	ld.param.u64 	%rd1, [add_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	ld.param.f32 	%f1, [add_param_1+20];
	st.global.f32 	[%rd2], %f1;
	ret;

}


I get the following output for clang

When compiling with clang++ --cuda-device-only -nocudalib --cuda-gpu-arch=sm_52 <filename> -S -o <outfile>

//
// Generated by LLVM NVPTX Back-End
//

.version 4.1
.target sm_52
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .align 4 .b8 add_param_1[36]
)
{
	.local .align 8 .b8 	__local_depot0[48];
	.reg .b64 	%SP;
	.reg .b64 	%SPL;
	.reg .f32 	%f<11>;
	.reg .b64 	%rd<8>;

	mov.u64 	%SPL, __local_depot0;
	cvta.local.u64 	%SP, %SPL;
	mov.b64 	%rd2, add_param_1;
	ld.param.u64 	%rd1, [add_param_0];
	ld.param.f32 	%f1, [add_param_1];
	ld.param.f32 	%f2, [add_param_1+8];
	ld.param.f32 	%f3, [add_param_1+12];
	ld.param.f32 	%f4, [add_param_1+16];
	ld.param.f32 	%f5, [add_param_1+20];
	ld.param.f32 	%f6, [add_param_1+24];
	ld.param.f32 	%f7, [add_param_1+28];
	ld.param.f32 	%f8, [add_param_1+32];
	ld.param.f32 	%f9, [add_param_1+4];
	add.u64 	%rd3, %SP, 0;
	or.b64  	%rd4, %rd3, 4;
	st.f32 	[%rd4], %f9;
	st.f32 	[%SP+32], %f8;
	st.f32 	[%SP+28], %f7;
	st.f32 	[%SP+24], %f6;
	st.f32 	[%SP+20], %f5;
	st.f32 	[%SP+16], %f4;
	st.f32 	[%SP+12], %f3;
	st.f32 	[%SP+8], %f2;
	st.f32 	[%SP+0], %f1;
	cvta.to.global.u64 	%rd5, %rd1;
	cvta.global.u64 	%rd6, %rd5;
	st.u64 	[%SP+40], %rd6;
	ld.f32 	%f10, [%SP+20];
	ld.u64 	%rd7, [%SP+40];
	st.f32 	[%rd7], %f10;
	ret;

}
Here's the different compiler versions I'm using
$ rustc +nightly --version
rustc 1.61.0-nightly (68369a041 2022-02-22)
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
$ clang++ --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

I would be happy to try to figure out whats going on if anyone can confirm that this is a bug? If anyone would be able to point me in the right direction that would be a nice bonus as well.

@pnkfelix
Copy link
Member

Visiting as part of a T-compiler backlog bonanza prepass.

This at very least is in a "needs summary" state. But based on @RDambrosio016 's comment from 10 days ago, there may even be design concerns? (At least depending on what scope of potential parameter types we want this ABI to cover at the outset.)

@rustbot label: S-tracking-design-concerns S-tracking-needs-summary

@rustbot rustbot added S-tracking-design-concerns Status: There are blocking design concerns. S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. labels Mar 17, 2022
@kjetilkjeka
Copy link
Contributor

@pnkfelix I'm currently working on fixing bugs in ptx-kernel in #94703 and would like to help out providing a summary. I have also discovered some design concerns that I can contribute to the summary.

Do you have a template or example of how it should look like?

@kjetilkjeka
Copy link
Contributor

kjetilkjeka commented Mar 25, 2022

Here's a suggestion for an update to the tracking issue to include concerns. Partially copied for japaric's original post and added concerns from and links to relevant issues.

If you have the possibility you should take a look @RDambrosio016


Feature gate #![feature(abi_ptx)]

This ABI is intended to be used when generating code for device (GPU) targets like nvptx64-nvidia-cuda. It is used to generate kernels ("global functions") that work as an entry point from host (cpu) code. Functions that do not use the "ptx-kernel" ABI are "device functions" and only callable from kernels and device functions. Device functions are specifically not usable from host (cpu) code.

Public API

The following code

#![no_std]
#![feature(abi_ptx)]

#[no_mangle]
pub extern "ptx-kernel" fn foo() {}

Produces

.version 3.2
.target sm_30
.address_size 64

	// .globl	foo

.visible .entry foo()
{
	ret;
}

Steps / History

Unresolved Questions

  • Resolve what kind of stability guarantees can be made about the generated ptx.
    • The ABI of kernels have been previously changed for a major version bump and the ptx-interoperability doc is still outdated.
    • PTX is an ISA with many versions. The newest is major version 7. Do we need to reserve the possibility of breaking things when moving to a new major version?
    • Figure out what llvm does in relations to the nvptx64-nvidia-cuda target and the __global__ modifier.
  • What kind of types should be allowed to use as arguments in kernels. Should it be a hard error to use these types or only a warning (Global and device kernels are unsound rust-cuda/wg#11)
    • The most important part is to find a minimal but useful subset of Rust types that can be used in kernels. raw pointers, primitive types and #[repr(C)] types seems like a good start (no slices, tuples, references, etc).
    • Using mutable references is almost certain UB except for a few unusable special cases (spawning a single thread only)
    • There are many convenient types in Rust which do not have a stable ABI (&[T], (T, U), etc). Are there some types that do not have a stable representation but can be relied on having an identical representation for sequential compilation with a given rustc version? If so are there any way we could pass them safely between host and device code compiled with the same rustc version?
  • This unstable feature is one of the last stoppers to using nvptx64-nvidia-cuda on stable Rust. The target seems to still have a few bugs (NVPTX backend metabug #38789). Should this feature be kept unstable to avoid usage of nvptx64-nvidia-cuda until it has been verified to be usable.
  • How should shared be supported? Is it necessary to do that from the go?

Notes

  • It is not possible to emulate kernels with #[naked] functions as the .entry directive needs to be emited for nvptx kernels.

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Apr 26, 2022
… r=nagisa

Fix codegen bug in "ptx-kernel" abi related to arg passing

I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.

This PR fixes the bug and add tests for passing structs to ptx kernels.

I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.

I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.

Perhaps `@RDambrosio016` would have interest in taking a quick look at this.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Apr 26, 2022
… r=nagisa

Fix codegen bug in "ptx-kernel" abi related to arg passing

I found a codegen bug in the nvptx abi related to that args are passed as ptrs ([see comment](rust-lang#38788 (comment))), this is not as specified in the [ptx-interoperability doc](https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/) or how C/C++ does it. It will also almost always fail in practice since device/host uses different memory spaces for most hardware.

This PR fixes the bug and add tests for passing structs to ptx kernels.

I observed that all nvptx assembly tests had been marked as [ignore a long time ago](rust-lang#59752 (comment)). I'm not sure if the new one should be marked as ignore, it passed on my computer but it might fail if ptx-linker is missing on the server? I guess this is outside scope for this PR and should be looked at in a different issue/PR.

I only fixed the nvptx64-nvidia-cuda target and not the potential code paths for the non-existing 32bit target. Even though 32bit nvptx is not a supported target there are still some code under the hood supporting codegen for 32 bit ptx. I was advised to create an MCP to find out if this code should be removed or updated.

Perhaps ``@RDambrosio016`` would have interest in taking a quick look at this.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Oct 27, 2023
…exception, r=workingjubilee,RalfJung

NVPTX: Allow PassMode::Direct for ptx kernels for now

Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang#115666

It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU.

I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests.

I added  fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang#38788 (comment)

Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Oct 27, 2023
Rollup merge of rust-lang#117247 - kjetilkjeka:nvptx_direct_passmode_exception, r=workingjubilee,RalfJung

NVPTX: Allow PassMode::Direct for ptx kernels for now

Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang#115666

It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU.

I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests.

I added  fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang#38788 (comment)

Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Oct 28, 2023
…, r=workingjubilee,RalfJung

NVPTX: Allow PassMode::Direct for ptx kernels for now

Upgrading the nvptx toolchain to the newest nightly makes it hit the assert that links to rust-lang/rust#115666

It seems like most targets get around this by using `PassMode::Indirect`. That is impossible for the kernel as it's not a normal call, but instead the arguments are copied from CPU to GPU and the passed pointer would be invalid when it reached the GPU.

I also made an experiment with `PassMode::Cast` but at least the most simple version of this broke the assembly API tests.

I added  fixing the pass mode in my unofficial tracking issue list (I do not have the necessary permissions to update to official one). rust-lang/rust#38788 (comment)

Since the ptx_abi is currently unstable and have been working with `PassMode::Direct` for more than a year now, the steps above is hopefully sufficient to enable it as an exception until I can prioritize to fix it. I'm currently looking at steps to enable the CI for nvptx64 again and would prefer to finish that first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html S-tracking-design-concerns Status: There are blocking design concerns. S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants