Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more SIMD platform-intrinsics #117953

Merged
merged 1 commit into from
Dec 9, 2023
Merged

Conversation

farnoy
Copy link
Contributor

@farnoy farnoy commented Nov 15, 2023

  • simd_masked_load
    • LLVM codegen - llvm.masked.load
    • cranelift codegen - implemented but untested
  • simd_masked_store
    • LLVM codegen - llvm.masked.store
    • cranelift codegen

Also added a run-pass test to test both intrinsics, and additional build-fail & check-fail to cover validation for both intrinsics

@farnoy farnoy changed the title Add more SIMD platform-intrinsic Add more SIMD platform-intrinsics Nov 15, 2023
@rustbot
Copy link
Collaborator

rustbot commented Nov 15, 2023

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @b-naber (or someone else) soon.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 15, 2023
@farnoy
Copy link
Contributor Author

farnoy commented Nov 15, 2023

r? rust-lang/project-portable-simd

@rustbot
Copy link
Collaborator

rustbot commented Nov 15, 2023

Failed to set assignee to miguelraz: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

@farnoy
Copy link
Contributor Author

farnoy commented Nov 15, 2023

r? @calebzulawski @programmerjake

@rustbot
Copy link
Collaborator

rustbot commented Nov 15, 2023

Failed to set assignee to calebzulawski: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

@Noratrieb
Copy link
Member

r? @workingjubilee as a SIMD person that has bors perms :D

@rustbot rustbot assigned workingjubilee and unassigned b-naber Nov 16, 2023
@farnoy farnoy marked this pull request as ready for review November 17, 2023 12:50
@rustbot
Copy link
Collaborator

rustbot commented Nov 17, 2023

Some changes occurred in compiler/rustc_codegen_cranelift

cc @bjorn3

@farnoy
Copy link
Contributor Author

farnoy commented Nov 17, 2023

Marking this as ready for review. I tested it end to end in my project, with changes to the portable-simd crate linked above. It's been working well.

@farnoy
Copy link
Contributor Author

farnoy commented Nov 19, 2023

@RalfJung Letting you know about a new SIMD intrinsic because of rust-lang/miri#1912 (comment)

@RalfJung
Copy link
Member

RalfJung commented Nov 19, 2023

Please add somewhere documentation for the behavior of these intrinsics -- in particular documentation for any cases where it is UB.

@RalfJung
Copy link
Member

RalfJung commented Nov 19, 2023

I guess the question is where to add those docs... it's a bit annoying that the extern block that imports all these intrinsics is not in the same repository where they are defined :/ Maybe library/portable-simd/crates/core_simd/src/intrinsics.rs should be moved into the rustc repo? That would certainly be the natural place to put these docs.

@farnoy
Copy link
Contributor Author

farnoy commented Nov 19, 2023

I'll add it to portable-simd for now, then? This will be an interesting one for Miri because memory is not accessed when the corresponding mask element is disabled. For example, it's legal to use a memory address that is invalid if the mask is all zeros.

@RalfJung
Copy link
Member

RalfJung commented Nov 19, 2023

Yes that sounds very important to describe precisely. I assume you ensured that all codegen backends are implementing this in a way that matches that spec (i.e., avoids the UB when the mask is zero).

@farnoy
Copy link
Contributor Author

farnoy commented Nov 19, 2023

Yes, LLVM describes it this way:

The memory addresses corresponding to the “off” lanes are not accessed.

And for Cranelift, I implemented it with a scalar loop and guarded the memory access by an if statement.

@farnoy
Copy link
Contributor Author

farnoy commented Nov 29, 2023

Hi @workingjubilee , is there anything I can do to help expedite the review process?

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had like 200 notifications.

Now I have <100.

I agree with Ralf that the documentation for these intrinsics ought to live inside this repo. That we more-meticulously documented them is more just an accident of no one else beating us to it. A brief sentence and then "this corresponds roughly to XYZ instruction" is fine for now, we can coalesce them later.

compiler/rustc_codegen_cranelift/src/intrinsics/simd.rs Outdated Show resolved Hide resolved
compiler/rustc_codegen_llvm/src/intrinsic.rs Outdated Show resolved Hide resolved
@farnoy
Copy link
Contributor Author

farnoy commented Dec 3, 2023

@workingjubilee Thanks, I've pushed out a commit addressing your comments.

@farnoy
Copy link
Contributor Author

farnoy commented Dec 7, 2023

@workingjubilee OK I think that should cover it, I've added a check-fail test to verify return types and a build-fail test to verify the argument validation.

What's the final verdict on the order of these arguments? I think we want (mask, pointer, values) and like you said, this aligns with the simd_select taking the mask first and then "using" the later argument when the mask lane is true, and the last argument for the fallback value when it's false?

@workingjubilee
Copy link
Member

Yes, that ordering is good by me.

Hypothetically you could use this to argue that the proper order is

  • simd_masked_load(mask, pointer, values)
  • simd_masked_store(mask, values, pointer)

But I think (mask, pointer, values) for both is also fine and consistent.

@farnoy
Copy link
Contributor Author

farnoy commented Dec 8, 2023

@workingjubilee I'm happy with the state of the PR now. The validation coverage feels fine to me, the only downside are the diagnostic messages shown to the user.

I'm fairly sure that other intrinsics (especially the multi-argument ones) also suffer from this, but they don't have build-fail tests where we could see that directly.

I don't think this PR regresses on what we already have, I'm just highlighting a deficiency that already existed.

@workingjubilee
Copy link
Member

Probably! This is almost certainly an inadequately tested part of the compiler.

In any case, it is fine for monomorphization errors to be gross, or even simply unhelpful, we just want them to be correct if we are emitting them at all.

@farnoy
Copy link
Contributor Author

farnoy commented Dec 8, 2023

Are we good to merge then, or is there something else we should do in this PR?

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good now modulo a nit.

A test seems to be absent?
Please rebase to minimize the intervening history a bit.

r=me with that added.

@workingjubilee
Copy link
Member

@bors delegate=farnoy

@bors
Copy link
Contributor

bors commented Dec 9, 2023

✌️ @farnoy, you can now approve this pull request!

If @workingjubilee told you to "r=me" after making some further change, please make that change, then do @bors r=@workingjubilee

This maps to the LLVM intrinsics: llvm.masked.load and llvm.masked.store
@farnoy farnoy force-pushed the masked-load-store branch from e271119 to 97ae509 Compare December 9, 2023 11:40
@farnoy
Copy link
Contributor Author

farnoy commented Dec 9, 2023

@bors r=@workingjubilee

@bors
Copy link
Contributor

bors commented Dec 9, 2023

📌 Commit 97ae509 has been approved by workingjubilee

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 9, 2023
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 9, 2023
…llaumeGomez

Rollup of 6 pull requests

Successful merges:

 - rust-lang#117953 (Add more SIMD platform-intrinsics)
 - rust-lang#118057 (dedup for duplicate suggestions)
 - rust-lang#118638 (More `rustc_mir_dataflow` cleanups)
 - rust-lang#118702 (Strengthen well known check-cfg names and values test)
 - rust-lang#118734 (Unescaping cleanups)
 - rust-lang#118766 (Lower some forgotten spans)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit c57b054 into rust-lang:master Dec 9, 2023
11 checks passed
@rustbot rustbot added this to the 1.76.0 milestone Dec 9, 2023
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Dec 9, 2023
Rollup merge of rust-lang#117953 - farnoy:masked-load-store, r=workingjubilee

Add more SIMD platform-intrinsics

- [x] simd_masked_load
  - [x] LLVM codegen - llvm.masked.load
  - [x] cranelift codegen - implemented but untested
- [ ] simd_masked_store
  - [x] LLVM codegen - llvm.masked.store
  - [ ] cranelift codegen

Also added a run-pass test to test both intrinsics, and additional build-fail & check-fail to cover validation for both intrinsics
@RalfJung RalfJung mentioned this pull request Dec 10, 2023
bors added a commit to rust-lang/miri that referenced this pull request Dec 10, 2023
Rustup

Pulls in rust-lang/rust#117953 (in preparation for implementing those intrinsics)
// CHECK-LABEL: @load_f32x2
#[no_mangle]
pub unsafe fn load_f32x2(mask: Vec2<i32>, pointer: *const f32,
values: Vec2<f32>) -> Vec2<f32> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of "values" here? It seems strange that a load is told the values to load...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the fallback/passthrough values used for lanes that are masked off, so have the corresponding bits in mask all zeros

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the name values doesn't really convey that. Thanks!

#[no_mangle]
pub unsafe fn store_f32x2(mask: Vec2<i32>, pointer: *mut f32, values: Vec2<f32>) {
// CHECK: call void @llvm.masked.store.v2f32.p0(<2 x float> {{.*}}, ptr {{.*}}, i32 {{.*}}, <2 x i1> {{.*}})
simd_masked_store(mask, pointer, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the alignment requirements on the pointer for the store to be valid? (And same question for the load.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointer should be aligned to the individual element, not the whole vector. So when loading f32x2, size_of is 8, but alignment of f32 is 4, therefore the pointer should be aligned to 4. This is both for load & store

Copy link
Member

@RalfJung RalfJung Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the mask is all-0, i.e., no load/store actually happens. Is the pointer still required to be aligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with alignment checking, masked elements never generate an access for the load, thus no hardware implementation requires it. If you have a strong reasoning for arguing otherwise, we could make it UB regardless?

Otherwise: no, it's a non-event, like unsafe { invalid_ptr.add(0) }.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw x86-64 for these doesn't ever generate an alignment check exception even if you turn alignment checking on, while the others are explicit that the logic is that predication prevents the access, therefore the alignment check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe everything that's been raised so far has been addressed. Thanks @RalfJung and apologies if my answers earlier were unhelpful. I wasn't aware that the alignment requirement is only relaxed through the from_array/to_array path, relying on the optimizer to elide those later.

Copy link
Member

@RalfJung RalfJung Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farnoy can you link to the PR(s) that address this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I've been asking for documentation. #118853 addresses that, please double-check there that the docs for these new intrinsics match what you implemented now. :)


// Alignment of T, must be a constant integer value:
let alignment_ty = bx.type_i32();
let alignment = bx.const_i32(bx.align_of(values_ty).bytes() as i32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be bx.align_of(values_elem).bytes() not values_ty. Technically, the LLVM intrinsic accepts any power of two as alignment, so we could relax this down to 1.

https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html#packed-simd-vector-types

Am I understanding this correctly or does it not make a difference @workingjubilee?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having checked the Arm ARM, RISCV V spec, and Intel SDM, we should continue to require element alignment at least. We can expect any masked-load-supporting implementation to support masked loads and stores on element boundaries. However, they may reject accesses on byte boundaries (except where the element is u8, of course), or they may just implement the load/store inefficiently if unaligned, which is honestly about as bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely reasons to do unaligned loads (deserialization, packet framing, etc) that benefit from vectorization, so I think we should at least make writing {}_unaligned versions possible. Though I can't remember, what's the requirement for scatter/gather?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, element alignment.

Copy link
Member

@workingjubilee workingjubilee Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adjust them to take a const parameter for alignment, if we want, I guess?

Copy link
Member

@RalfJung RalfJung Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only guess the intent, but I am fairly sure about what Simd::load currently does. :)

If it is intended to allow entirely unaligned pointers, it should be implemented as ptr.cast::<Self>().read_unaligned(). Though I guess that would run into issues with non-power-of-2 vectors as the comment indicates, but those aren't properly supported currently anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened rust-lang/portable-simd#382 for the load question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a PR to fix the alignment passed down to LLVM IR. The optimizer converts the masked load to an umasked, aligned load because the alignment is too high

https://rust.godbolt.org/z/KEeGbevbb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with element alignment for now, does this require a code change? The documentation makes it seem like bx.align_of(values_ty) is not guaranteed to equal bx.align_of(values_elem) (implementation-defined), so this line should probably be corrected?

Correct.

I don't have opinions either way here. It's just important that we pick a rule and then apply it consistently everywhere. (Which is also IMO why we shouldn't land new intrinsics without documentation of their exact safety requirements. Just because portable SIMD intrinsics got a pass on this in the past doesn't mean we should continue doing that.)

I certainly wasn't intending on being freewheeling here, or I would not have asked a zillion small questions myself. :^) The problem is slightly of the "remembering all the questions I should be asking" nature, especially when there has already been multiple rounds of review.

I guess the main list of questions to keep in mind is:

  1. For memory (i.e. pointer) operations: What is the alignment asserted by the op? (And for SIMD types, you should almost always be picking the alignment of the element or the alignment of the vector.)
  2. For memory operations, is there a condition in which that alignment is not asserted by the intrinsic? (Also, please answer no, especially if there is any doubt about how LLVM will interpret it.)
  3. For memory operations, what happens if you operate on uninit memory? It's UB, right?
  4. For masked operations, what happens for masks derived from alternating [true, false, ..]? (i.e. "what is the normal behavior?")
  5. For masked operations, what happens on all-1s? Is there any special event, or does it strengthen any assertions?
  6. For masked operations, what happens on all-0s? Is there any special event, or does it weaken any assertions?
  7. For indexing operations, what happens when your index is "out-of-bounds"?
  8. Sooo... have you checked the LLVMIR for the answers to all of the above questions? How about after optimizations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For memory operations, what happens if you operate on uninit memory? It's UB, right?

Note that we do not ever treat uninit memory specially for memory operations in Rust. The UB on uninit derives entirely from the fact that uninit memory violates the validity invariant of most types. So the question that should be asked is: is this an untyped operation, working on raw abstract machine bytes, or is it a typed operation, and at which type?

workingjubilee added a commit to workingjubilee/rustc that referenced this pull request Dec 13, 2023
…workingjubilee

Fix alignment passed down to LLVM for simd_masked_load

Follow up to rust-lang#117953

The alignment for a masked load operation should be that of the element/lane, not the vector as a whole

It can produce miscompilations after the LLVM optimizer notices the higher alignment and promotes this to an unmasked, aligned load followed up by blend/select - https://rust.godbolt.org/z/KEeGbevbb
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Dec 13, 2023
Rollup merge of rust-lang#118864 - farnoy:masked-load-store-fixes, r=workingjubilee

Fix alignment passed down to LLVM for simd_masked_load

Follow up to rust-lang#117953

The alignment for a masked load operation should be that of the element/lane, not the vector as a whole

It can produce miscompilations after the LLVM optimizer notices the higher alignment and promotes this to an unmasked, aligned load followed up by blend/select - https://rust.godbolt.org/z/KEeGbevbb
@workingjubilee workingjubilee added A-SIMD Area: SIMD (Single Instruction Multiple Data) PG-portable-simd Project group: Portable SIMD (https://github.com/rust-lang/project-portable-simd) labels Dec 14, 2023
RalfJung pushed a commit to RalfJung/rust that referenced this pull request Dec 17, 2023
Rustup

Pulls in rust-lang#117953 (in preparation for implementing those intrinsics)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) PG-portable-simd Project group: Portable SIMD (https://github.com/rust-lang/project-portable-simd) S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants