Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Vector saturating truncation not using VNCLIP #73424

Closed
boomshroom opened this issue Nov 26, 2023 · 7 comments · Fixed by #75145
Closed

[RISCV] Vector saturating truncation not using VNCLIP #73424

boomshroom opened this issue Nov 26, 2023 · 7 comments · Fixed by #75145

Comments

@boomshroom
Copy link

Closely related to #68466 and #58266, but for a different platform.

Much like x86's (V)PACK(U,S)S and ARM's (U,S)QXTN(2), RISC-V's V extension includes VNCLIP(U).W(V,X,I), which performs a right shift and then performs either signed or unsigned saturation to vector of integers with half the width. With a shift value of 0 (whether from an immediate or the register; the V extension specification doesn't specifically talk about this use case), the instruction can be used to perform a saturating truncation from vectors of one integer type, the the type with half the width.

Given the complexity of matching saturating truncation, and the apparent duplication between targets of this functionality, it could be worth considering matching it earlier in the pipeline. Matching a saturating truncation preceded by a shift should be much simpler and can be done with TableGen.

Example code with current code gen, next to x86 and AArch64 equivalents.

https://rust.godbolt.org/z/bKdsq8GGM

@llvmbot
Copy link
Member

llvmbot commented Nov 26, 2023

@llvm/issue-subscribers-backend-risc-v

Author: Angelo Bulfone (boomshroom)

Closely related to #68466 and #58266, but for a different platform.

Much like x86's (V)PACK(U,S)S and ARM's (U,S)QXTN(2), RISC-V's V extension includes VNCLIP(U).W(V,X,I), which performs a right shift and then performs either signed or unsigned saturation to vector of integers with half the width. With a shift value of 0 (whether from an immediate or the register; the V extension specification doesn't specifically talk about this use case), the instruction can be used to perform a saturating truncation from vectors of one integer type, the the type with half the width.

Given the complexity of matching saturating truncation, and the apparent duplication between targets of this functionality, it could be worth considering matching it earlier in the pipeline. Matching a saturating truncation preceded by a shift should be much simpler and can be done with TableGen.

Example code with current code gen, next to x86 and AArch64 equivalents.

https://rust.godbolt.org/z/bKdsq8GGM

@sun-jacobi
Copy link
Member

sun-jacobi commented Dec 7, 2023

@boomshroom Hi, can I work on this issue?

@boomshroom
Copy link
Author

boomshroom commented Dec 7, 2023

Sure. It probably wouldn't be too hard to copy and paste the logic from one of the several other targets that match this pattern, though the fact that it does seem to be duplicated so many times makes me suspicious that doing so is really the best idea. I personally think adding a target-independent "saturating truncation" node would be better, though I'm less sure how to go about that and it would entail a larger change to the overall infrastructure.

@sun-jacobi
Copy link
Member

@boomshroom Can you assign this issue to me?

@dtcxzyw dtcxzyw assigned dtcxzyw and sun-jacobi and unassigned dtcxzyw Dec 9, 2023
@sun-jacobi
Copy link
Member

sun-jacobi commented Dec 11, 2023

The previous pattern matching around the trunc mostly uses vnsrl.
I think using vnsrl might be a better choice, since vnclip may bring new complexity due to the vxrm.

@boomshroom
Copy link
Author

If the shift is 0, then it should never have to round and thus vxrm shouldn't pose an issue.

Also vnsrl always floors when inexact and wraps when the value is out of range. Saturating when out of range is what vnclip does, like the other instructions on other architectures do. Worth noting is that it only saturates unsigned to unsigned and signed to signed, so a min or max is still required to reduce the range to what's representable in the desired signedness.

I still wonder why such an operation that seems fairly useful, relatively common in ISAs, and is non-trivial to match, doesn't have a more target-independent representation in LLVM. Could save some work matching the patterns before lowering to individual ISAs and reduce the duplication between them.

sun-jacobi added a commit to sun-jacobi/llvm-project that referenced this issue Dec 29, 2023
Fixes llvm#73424.

If the range created by a min and max is precisely the range of trunc target,
the min/max could be removed.
ChunyuLiao pushed a commit that referenced this issue Dec 29, 2023
This patch closed #73424, which is also a missed-optimization case
similar to #68466 on X86.

## Source Code
```
define void @trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load <8 x i16>, ptr %x, align 16
  %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>)
  %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>)
  %4 = trunc <8 x i16> %3 to <8 x i8>
  store <8 x i8> %4, ptr %y, align 8
  ret void
}
```
## Before this patch: 
```
trunc_sat_i8i16:                  # @trunc_maxmin_id_i8i16
        vsetivli        zero, 8, e16, m1, ta, ma
        vle16.v v8, (a0)
        li      a0, -128
        vmax.vx v8, v8, a0
        li      a0, 127
        vmin.vx v8, v8, a0
        vsetvli zero, zero, e8, mf2, ta, ma
        vnsrl.wi        v8, v8, 0
        vse8.v  v8, (a1)
        ret
```

## After this patch: 
```
trunc_sat_i8i16:                  # @trunc_maxmin_id_i8i16
	vsetivli	zero, 8, e8, mf2, ta, ma
	vle16.v	v8, (a0)
	csrwi	vxrm, 0
	vnclip.wi	v8, v8, 0
	vse8.v	v8, (a1)
	ret
```
@sun-jacobi
Copy link
Member

sun-jacobi commented Jan 18, 2024

I'm planning to add a new opcode ISD::TRUNCSAT to SelectionDAG, and do a target-independent combine for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants