-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Draft for Vector SIMD Codegen Enhancements #268
base: main
Are you sure you want to change the base?
Conversation
CC. @JulieLeeMSFT, @jeffhandley Also CC. @davidwrighton for CG2/R2R considerations CC. @BruceForstall and @dakersnar who are part of the working group |
cc @dotnet/jit-contrib. |
|
||
In this design document, we propose to extend `Vector<T>` to serve as a vessel for frictionless SIMD adoption, both internal to .NET libraries, and to external .NET developers. As a realization of this goal, we propose the following: | ||
|
||
1. Upgrading `Vector<T>` to serve as a sufficiently powerful interface for writing both internal hardware accelerated libraries and external developer code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this approach rely on JIT to work with the "template" or Source Generators (sounds much much easier to implement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have less experience with the source generators as a concept, so when I wrote this, I envisioned the JIT took care of this.
In the spirit of the document, would a source generator approach require the developer to "regenerate" their code as future ISAs become available/implemented, or are the source generators able to be tightly integrated into the pipeline that it's not necessarily a burden for the developer to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"source generators" in the Roslyn sense are components that plug directly into the compiler; they're handed all the information the compiler has about the source in the project and are able to add code to the compilation unit. A key limitation is they can't replace code, only add code, at least today (and the expectation is that even if some replacement is enabled in the future, it'll be very constrained). One of the most common forms of this is writing a partial class or method, often with an attribute applied to it, and the generator then fills in the implementation. You can see this, for example, with the JsonSerialization generator that shipped in .NET 6, where a developer just writes a partial class attributed in a certain way, and the generator emits into it all of the logic for serializing the relevant types, or the new LibraryImport and RegexGenerators in .NET 7, where the developer writes a partial method and the generator fills in the implementation of that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ComputeSharp by @Sergio0694 does something that is conceptually similar in that it takes C# code and generates code that can run on the GPU.
I had also prototyped a very basic source generator a while back which handles directive driven vectorization.
For vectorized
code we need a software fallback for when vectorization isn't supported so it's feasible for a dev to do something like:
public static partial int Sum(ReadOnlySpan<int> values);
[Vectorize(nameof(Sum), ...)]
private static int Sum_SoftwareFallback(ReadOnlySpan<int> values)
{
// ...
}
The generator can then process Sum_SoftwareFallback
and provide an implementation of the public Sum
method powered by Vector64/128/256/512<T>
or Vector<T>
.
It would be quite a bit of work to enable, but would have some benefits over typical "auto-vectorization" approaches. An analyzer that looks for potentially vectorizable code and suggests a refactoring to put it in the above shape would also be feasible to help suggest to users where vectorization is feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this sense, we are really talking about source-level auto-vectorization as opposed to JIT-level template generation, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source level directive-driven vectorization
(rather than auto-vectorization
). It could still be "template driven" (rather than relying on analyzing scalar code patterns) if we felt that was the best approach and there are multiple options/possibilities (particularly when viewed in combination with the IL trimmer/linker).
That is, in all cases (involving a source generator) it requires some level of "user opt-in" (such as via a Vectorize
attribute the generator recognizes). Noting that we could do some form assistance in here in recognizing vectorizable patterns and suggesting to the user they add the Vectorize
attribute.
Whether we then do the vectorization based on some Vector<T>
template (including generating a scalar path) or do recognition of a scalar algorithm and convert it to Vector64/128/256/512<T>
or Vector<T>
code is something that we could do either of.
|
||
#### 2. PGO Codegen from `Vector<T>` | ||
|
||
We propose to introduce a `#[Vectorize]` attribute which instructs the JIT to dynamically profile a method to select an optimal length for `Vector<T>`. Returning to the example above, we add `#[Vectorize]` to the `NarrowUtf16ToAscii` method like so: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this attribute? I think it can be just an API e.g.
if (RuntimeHelpers.MostlyTrue(len > 128)) // JIT will insert a probe for the argument condition
{
}
else
{
// this block will be eliminated by jit
}
and teach the JIT to probe argument of MostlyTrue
in tier0 for PGO - we don't currently have infrastructure for that but it shouldn't be hard to implement and it opens opportunities for other optimizations (e.g. recognized never-negative signed types etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have commented below.
In order to perform profile guided optimization for methods annotated with `#[Vectorize]`, the JIT must detect which method parameters to sample and what thresholds should trigger recompilation based on those sample points. | ||
##### Detecting Instrumentation Points and Thresholds | ||
|
||
The presence of a `#[Vectorize]` attribute instructs the JIT to perform a dependence analysis upon first encountering the method to determine what method parameters to sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A of probing scheme that requires data flow is going to be a challenge.
The current implementation of PGO instruments at Tier0 where there is no ability to do any sort of data flow.
As @EgorBo noted above it might be more feasible to do this sort of analysis either by hand at the source level or within Roslyn and express the results as intrinsic calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have commented below.
|
||
##### Transitive Method Codegen | ||
|
||
As both proposals allow `Vector<T>` to be specialized per-method, we cannot simply pass `Vector<T>` as an argument to helper methods as before. To address this issue, we propose an additional `[Vectorizeable]` attribute which allows to JIT to specialize the method per selected vector width if used in a `Vectorize.If` or `[Vectorize]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specialization would likely have to be driven by VM -- the JIT has no way of producing multiple method bodies per method.
This begs the question of how we would reconcile this with the current requirement that all Vector within a given runtime instance are the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather expect Vector<T>
to be lowered to Vector128, Vector256, Vector512 by some IL processing/SG - it's much simpler to do. And yes, good point about variadic Vector<> - if it escapes current method or exposed as an argument then we can't do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are fair points --- how to reconcile a PGO driven approach for Vector with the current requirement that Vector be a chosen size for the duration of a runtime instance is an open question.
@EgorBo part of the reason I came up with the #[Vectorize]
attribute was for this very reason: If a method has #[Vectorize]
, its use of Vector<T>
is of a Vector<T>
whose size is selected by PGO. Without #[Vectorize]
, Vector<T>
will codegen to the size determined by the runtime process as it currently does now.
I think @EgorBo suggestion about a runtime probe RuntimeHelpers.MostylTrue
as a way to mitigate the need for a dataflow analysis for PGO, and the #[Vectorize]
attribute can probably co-exist. The former drives where the probes happen, the latter indicates that this Vector<T>
is selected per-method, not per-process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variadic Vector<> - if it escapes current method or exposed as an argument then we can't do it
Noting that VectorMask<T>
will have similar considerations. SVE
for Arm64
as well, where the technical description disallows its usage as a field and various other scenarios.
Vector<T>
is our current "best fit" for SVE on Arm64 and it would be good if we can continue using it there and here to solve the Vector128/256/512 considerations for x64 as well.
If we can make it so that dealing with leading/trailing elements is efficient the exact size of the Vector isn't a huge concern as we no longer lose vectorization on "small data" when the hardware register size increases.
We then just need to consider the scenario where the "largest register" isn't the "best choice" for a given scenario and where using a smaller vector would be better. We could potentially handle this with block/loop cloning, special handling in the JIT/tiered compilation process, or one of several other ways. The biggest "pit of failure" then becomes how Vector<T>
behaves when encountered as a field, pointer, or when passed/returned between methods.
In particular, SVE limits it accordingly:
Because of their unknown size at compile time, SVE types must not be used:
- to declare or define a static or thread-local storage variable
- as the type of an array element
- as the operand to a new expression
- as the type of object deleted by a delete expression
- as the argument to sizeof and _Alignof
- with pointer arithmetic on pointers to SVE objects (this affects the +, -, ++, and -- operators)
- as members of unions, structures and classes
- in standard library containers like std::vector.
Naturally, due to back-compat Vector<T>
violates most/all of these but some of them don't matter as much in the context of a JIT
environment, only for an AOT environment. But, if we can resolve how to handle many of these such that the JIT can enable selecting Vector128/256/512<T>
as the backing "per method" (with limitations), then we can likely use an analyzer and other functionality to help drive users toward success.
```C# | ||
for (int i = 0; i < s1.Length; i++) | ||
{ | ||
if (s1[i] < 0 && s2[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this supposed to be if (s1[i] < s2[i])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was meant to be if (s1[i] < s2[i])
|
||
Where `VectorMask<T>` expresses that the number of elements the condition applies to is variable-length, and determined by the JIT at runtime (though it must be compatible with `Vector<T>` selected length). | ||
|
||
Lastly, we propose to create a `VectorMask` using builtin C# boolean expressions by passing a lambda to a special `MaskExpr` API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is MaskExpr
going to be actually useful? To me, it seems to be:
- Fairly hard to use: How do users figure out what is a valid argument for
MaskExpr
? - Fragile: Is it going to work when some future version of C# changes how lambdas are emitted? Is it going to work for F#, which emits different IL for lambdas than C#?
- More limited than the
ByVectorMask
methods, since it seems it can't express e.g. thev1.LessThanByVectorMask(v2)
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this, MaskExpr
represents an idea to move the masking/condition simd processing into an embedded DSL that might be more easy to manipulate, particularly those who are less used to lower-level SIMD processing. To your points:
- We can document this or refine it, so I don't see it as a barrier or hard to use.
- As it's written it depends on Lambdas but that isn't a strict requirement if there are alternative ideas.
- This isn't necessarily true, for example...
v1.MaskExpr(x => v2.MaskExpr(y => x < y))
could encode "a mask where each element of v1 is less than each element of v2"
Now, all that being said, I am not advocating strongly for this idea, but proposing it as a consideration for the developer from a language design standpoint. Ideally, I see this work as making lower-level SIMD optimization more attainable for more developers, and I feel these kind of ideas are at least worth thinking on (hence my response to your post below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few issues with MaskExpr
related to how the JIT operates today and what optimizations it can enable.
I expect we'd have something "simpler" and easier to migrate to if we simply had a .AsMask()
API which effectively creates a VectorMask<T>
from the most significant bits of each element.
This would then be:
Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero) & (v1 != Vector256.Create(5));
VectorMask256<int> mask = vmask.AsMask();
This would be fairly "natural" to translate over from:
int mask = vmask.ExtractMostSignificantBits();
but would work for variable length vectors and would provide the mask as an abstraction rather than strictly as an int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat unfortunate that ==
, !=
, <
, <=
, >
, and >=
return bool
rather than Vector###<T>
and so we can't express it as (v1 > Vector128<int>.Zero) && (v1 != Vector256.Create(5))
, but that was done to follow existing .NET guidelines/conventions and since overloading by return type isn't feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still expect to be able to compare masks directly as well right, or do you see that going away with the AsMask()
API, i.e., would we allow something like...
Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero).AsMask();
Vector256<int> vmask2 = Vector256.NotEquals(v1, Vector256.Create(5).AsMask();
VectorMask256<int> mask = vmask & vmask2;
Edit: I see your comment below now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still expect to be able to compare masks directly as well right
Right, I believe everything you proposed being possible with VectorMask
will still be possible, its only a difference in how you get the mask.
Doing:
Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero);
Vector256<int> vmask2 = Vector256.NotEquals(v1, Vector256.Create(5));
VectorMask256<int> mask = (vmask & vmask2).AsMask();
-or- doing:
VectorMask256<int> mask1 = Vector256.GreaterThan(v1, Vector256<int>.Zero).AsMask();
VectorMask256<int> mask2 = Vector256.NotEquals(v1, Vector256.Create(5).AsMask();
VectorMask256<int> mask = mask1 & mask2;
should be basically identical, the only difference is when you create the VectorMask256
type. I'd expect them to be "equally performant" on AVX-512. I'd expect the former to be "more performant" on AVX2 and prior (assuming the JIT handled the operations as specified without other optimizations).
|
||
Logically, the lambda passed to `MaskExpr` selects for which elements of `v1` to include in the `VectorMask`, and allows developers to program conditional SIMD logic with familiar boolean condition operations. | ||
|
||
### Leading/Trailing Element Processing with `VectorMask<T>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably outside the scope here, but what I would like to see is a simple, uniform way to process a Span<T>
of any length using Vector<T>
. E.g.:
int SumVector(ReadOnlySpan<int> source)
{
Vector<int> vresult = Vector<int>.Zero;
foreach (Vector<int> slice in source.SliceAsVector())
{
vresult += slice;
}
return vresult.Sum();
}
The JIT would then do whatever it needs to make this code efficient (including e.g. templated codegen).
If this is not feasible, maybe one could get close to that using VectorMask<T>
?
int SumVector(ReadOnlySpan<int> source)
{
Vector<int> vresult = Vector<int>.Zero;
foreach ((Vector<int> slice, VectorMask<int> sliceMask) in source.SliceAsMaskedVector())
{
vresult = Vector<int>.Add(vresult, slice, sliceMask);
}
return vresult.Sum();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree with these ideas, but each represents a slight "layer above" the current way Vector
API is used. I like the idea of allowing the JIT to perform more powerful vectorization from more declarative programing construct --- the slice, sliceMask you propose is a nice idea --- but it seems that this could be a next step if others agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that APIs providing "micro-kernels" for key functionality like Sum
is a different layer and likely "out of scope" of the primitive building block considerations here.
You should feel free to open a proposal for such APIs as they are independent of the work expressed here. LINQ does provide some acceleration today but doesn't support Span<T>
or ROSpan<T>
.
| `VectorMask<T> VectorMask<T>.And(VectorMask<T>, VectorMask<T>)` | | ||
| `VectorMask<T> VectorMask<T>.Or(VectorMask<T>, VectorMask<T>)` | | ||
| `VectorMask<T> VectorMask<T>.Not(VectorMask<T>, VectorMask<T>)` | | ||
| `VectorMask<T> VectorMask<T>.Xor(VectorMask<T>, VectorMask<T>)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should likely be operators
where the "friendly names" are provided for parity with Vector<T>
and Vector64/128/256<T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
| `VectorMask<T> VectorMask<T>.FirstIndexOf(VectorMask<T> mask, bool val)` | | ||
| `VectorMask<T> VectorMask<T>.SetElemntCond(VectorMask<T> mask, ulong pos, bool val)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if rather than FirstIndexOf
this could be simply LeadingZeroCount
, TrailingZeroCount
, and PopCount
. While not necessarily as "intuitive" as a named API, it is much more extensible overall and allows the most common operations that are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you envision LeadingZeroCount
here returning the actual leading zero count? Because I had in mind that FirstIndexOf
would return an index into the VectorMask by type, e.g.,
this
VectorMask<short> mask = ...;
int idx = VectorMask<short>.FirstIndexOf(mask, true);
vs
VectorMask<short> mask = ...;
int idx = VectorMask<short>.LeadingZeroCount(mask) / Unsafe.SizeOf(short);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be the LeadingZeroCount
based on type.
That is a VectorMask128<byte>
would assume a mask of 16-bits
and so the lzcnt
would be "0-15". A VectorMask128<int>
on the other hand would assume a mask of 4-bits
and so the lzcnt
would be 0-3
.
If we instead always returned a 32-bit based lzcnt
for VectorMask128
it would be easier to use with ExtractMostSignificantBits
but less usable with the variable sized mask and with the abstract mask more generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we are in agreement over the behavior roughly, just instead of allowing to check first index of true/false condition, we are working with the leading zero count per type.
I think that's pretty reasonable.
|
||
| Method | | ||
| ------ | | ||
| `VectorMask<T> Vector<T>.EqualsByVectorMask(Vector<T> v1, Vector<T> v2)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we went with the AsMask()
proposal from above, then these just become Vector<T>.Equals(x, y).AsMask()
, which while slighlty more verbose doesn't require an "explosion" of new APIs and can be more easily integrated into existing code relying on ExtractMostSignificantBits
and MoveMask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an API standpoint I think this is fine and intuitive, so long as the JIT can potentially optimize Vector<T>.Equals(x, y).AsMask()
into something that takes advantage of the most performant masking features available, e.g., masking registers for AVX512, e.g., Vector<T>.Equals(x,y).AsMask()
can be lowered to vpcmpX
if available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, definitely. We also do and rely on this for a couple similar cases like ToScalar
today so I don't think it will be particularly problematic.
| `VectorMask<T> Vector<T>.LoadTrailing(Span<T> v1, VectorMask<T> mask)` | | ||
| `VectorMask<T> Vector<T>.StoreTrailing(Span<T> v1, VectorMask<T> mask)` | | ||
|
||
### Internals Upgrades for EVEX/AVX512 Enabling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work should be coordinated with @JulieLeeMSFT. I would expect that Microsoft is providing code review and answering JIT design questions here. It would be great to confirm if the actual implementation is expected to be a collaborative effort or primarily driven by Intel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak is the most likely person to loop in for the register support.
There are a few people that semi-regularly work on the emitter including myself, @kunalspathak, and @EgorBo.
Someone needs to be looped in for the debugger work.
The VM work is expected to be small and likely doesn't need anyone dedicated, the exception being extending CG2 to support tracking the additional ISA flags (@davidwrighton).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How AVX512-VL
is tracked by the ISA flags might be of particular interest since its a single CPUID flag but is somewhat like the x64
flag in that it impacts multiple ISAs
Co-authored-by: Tanner Gooding <[email protected]>
Please take a moment and add a bullet point list of teams and individuals you | ||
think should be involved in the design process and ensure they are involved | ||
(which might mean being tagged on GitHub issues, invited to meetings, or sent | ||
early drafts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone from Arm should look at this. Support for Arm SVE would require a Vector like interface, so it'd be good to ensure the design is suitable if SVE support gets implemented. I suspect most concerns will be around the mask implementation, and I see that Tanner has some comments below too. I'll see who we've got available to take a look....
Hi, Arm person here... and I apologize now for really not knowing anything about .net, so please pardon my ignorance! I really like the sound of focusing on the generic vector API, it seems most architectures are at least considering old-school vectors... I've been having a look through the existing Vector APIs and it looks like I'm not sure I follow why you'd need all off the Continuing on the theme of predication, what is the defined behavior for 'false' lanes, can it be programmed? I'm thinking in the context of SVE, where predication can enable zeroing or merging lanes. Does AVX512 enable something similar? |
So, what happens if the compiler decides that sometimes we should use AVX-512 for some operations, but AVX2 for others? Is there enough type information to enable this with I also have concerns that this approach will not be performant for SVE in an AOT setting, as everything depends on being able eventually have a compile-time fixed size vector. It's possible to set the width SVE at runtime, but I understand this is only possible at the kernel level, and we can't depend on all platforms providing hooks to do this. The other option would be to use predication to fix the width in the generated code, but this is likely to produce slower code. Flexible (sizeless) vectors could be coming to WebAssembly and we've (hopefully) made the necessary changes in cranelift to enable this for both AVX and SVE. That approach is really based on the idea of using WebAssembly's 128-bit vectors with the addition of a scaling factor to produce a target-defined vector type. This there any way that dotnet could handle a notion of a scaling factor, that could be compile- or runtime-defined? |
I'd typically expect a On the other hand, a
Yes. AVX512 has a concept of both "merge-masking" and "zero-masking". We would likely have functionality exposed in the generic APIs to enable the same.
This would likely end up PGO or data driven in some other fashion. It would default to "Max" (or "ReasonableDefault") otherwise. The types by themselves could never provide enough data.
While we could enable something like Users are able to trivially query for type support, hardware acceleration, and have it treated as a JIT time constant so that they can detect and choose the based fit explicitly whether that be a specific sized vector or an agnostic vector that "grows" to the maximum capabilities of the platform.
Yes. AOT in general is a problematic consideration for such code. In C/C++ SVE puts restrictions on how the underlying "vector" type can be used. For example, it disallows its usage as a field, in sizeof expressions, and a few other scenarios.
Simply scaling up a I think source generators or an approach similar to |
Fair enough wrt to masking vs leading/trailing, thanks for the clarification.
So, IIUC, that means that any path in the API explicitly needs to handle and treat any
Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.
|
Correct.
Could you clarify on what you mean by "stronger" types here? There are multiple ways this could be interpreted and I'd like to ensure we're on the same page. Some examples on what you believe might be error prone would be good as well.
How is this different from |
I mean enough information so that a compiler would pick up any type mismatches, and I'm mainly thinking in the context of using different vector sizes. Is a user able to use Narrowing/widening operations are maybe a good example of where not having a single size for To help my understanding, how would this currently get vectorized with
|
This is not the case for SVE though. In SVE the direction is reversed. You are given a mask to use by the control flow managing instructions such as The only time you have any cross page file issues are when you have an unknown number of iterations, so e.g. a
But for that we have first faulting loads.
The problem is that for SVE none of this is needed at all. At an instruction level something like a That's why I think the comment made at #268 (comment) is actually quite a salient one. I agree with that this PR is mostly about building blocks, but I'd hope that no actual code is written directly with these blocks but only a higher level API. In particular manually managing trailing loops will always be suboptimal for SVE. and in that sense int SumVector(ReadOnlySpan<int> source)
{
Vector<int> vresult = Vector<int>.Zero;
foreach (Vector<int> slice in source.SliceAsVector())
{
vresult += slice;
}
return vresult.Sum();
} is a better API for both SVE and AVX I think. The expansion of the That said I personally hope more for something like this int unsafe SumVector(int* source, int n)
{
Vector<int> vresult = Vector<int>.Zero;
foreach (auto iter in new VectorSource (source, n))
{
Vector<int> val = Vector<int>.Load (source, iter);
vresult.Add (val, iter);
}
return vresult.Sum();
} where |
We'll ultimately need something that works reasonably well across a range of hardware. That includes For the case where developers want the utmost fine-grained control, we'll have the platform specific APIs available so that developers get raw access and can fully take advantage of the functionality available.
In the sense of a single "micro-kernel", yes. Once you start getting into more complex algorithms, it ends up worse off. Consider an algorithm that needs to Lerp and Sum. If you have 1 Lerp and 1 Sum API, then you get code that is generally "more efficient" while still being extremely suboptimal. This is particularly if you need to store intermediate results as you start needing to access and walk Effectively any logic that ends up outside the "simple operation" path is in a similar boat, where you ultimately need/want some customized logic to better account for the sequence of operations you need.
This ends up being "more expensive" for the "core" body of the loop because it will constantly have to check if It is likely simpler and cheaper to just have the main loop where |
Agreed, so the expectation is that you expect users to still write ISA specific loops? The current Vector128 generic functions work well enough on all ISAs yes. I'm however not entirely convinced the definition "Well enough" on fully masked ISAs and not/partially masked ones can't mean the same.
I'm not sure I follow you here.. The only thing I was suggesting is that all APIs should have a fully masked version, or a way for the JIT to recognize the loop mask. With an iterator the loop's governing mask is clear, so at least you can mask the operations appropriately. With an explicit counter you have to pass a true predicate to every SVE call. We don't for instance, have unmasked loads.
I'm guessing you mean here for non-fully masked ISAs. But I don't follow why? I'm guessing this has to do with how IEnumerator is lowered? I would have expected to be able to transform the iterator into a naïve counted loop for ISAs that don't fully support predication, that's also why the example gave the number of elements. I'm guessing you can't do this because the semantics of the iterator have to be preserved in case the iterator escapes the loop? (genuine question). In which case, you can do the same without the iterator by instead using a custom class with a while loop? I'd have expected the same number of comparisons as you would normally for the same for loop. I was also expecting that for these ISA the jit could simple peel the loop for the "remainder". But perhaps that not something that can be done? |
This PR describes two proposals meant to enhance SIMD codegen in RyuJIT:
Enhance the capability of
Vector<T>
to serve as either a template to generate multiple SIMD ISA pathways or dynamically recompile its chosen SIMD ISA at runtime via performance guided optimization: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enhance-vector-codegen.mdIntroduce additional functionality to
Vector<T>
through new abstractions (VectorMask
) and 512-bit vectors: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enable-512-vectors.mdLooking forward to feedback and discussion on the ideas.