-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Cranelift sizeless vector types #19
RFC: Cranelift sizeless vector types #19
Conversation
Copyright (c) 2021, Arm Limited.
cranelift-sizeless-vector.md
Outdated
- The new types don't report themselves as vectors, so ty.is\_vector() = false, but are explicitly reported via ty.is\_sizeless\_vector(). | ||
- The TypeSet and ValueTypeSet structs gain a bool to represent whether the type is sizeless. | ||
- TypeSetBuilder also gains a bool to control the building of those types. | ||
- At the encoding level, a bit is used to represent whether the type is sizeless, this bit has been taken from the max number of vector lanes supported, so they'd be reduced to 128 from 256. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take it from the special types range instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest version.
cranelift-sizeless-vector.md
Outdated
The proposal is to introduce new sizeless vector types into Cranelift, that: | ||
- Express a vector, with a lane type and size, but a target-defined number of lanes. | ||
- Are denoted by prefixing the lane type with 'sv' (sizeless vector): svi16, svi32, svf32, etc... | ||
- Have a minimum width of 128-bits, meaning we can simply map to existing simd-128 implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Riscv allows smaller vectors, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apparently so, but the flexible vector spec does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the flexible vectors specification might introduce the requirement that the underlying vector length is a power of 2, even though the Arm SVE and the RISC-V V extension, I suppose, do not have similar restrictions. Whether our IR should cover more than the common use cases (such as the Wasm flexible vectors, power-of-2 vector sizes, etc.) is something to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that supporting more general expressivity at the CLIF level is a good thing to establish early on if we plan to do it at all, or else we'll bake in a lot of assumptions to code that operates on these types, over time. So non-power-of-2 sizes is something to include (and test for) early. It does seem valuable to me to support this if underlying hardware commonly does.
On the other hand, we don't want to come up with a design that requires handling difficult edge-cases; emulating smaller vector support than the fundamental unit seems like a good example of that. (Edge-cases if one were to try to emulate with bigger ops involve e.g. ensuring loads/stores don't run beyond the end of the heap, and having smaller alignments than usual.) So scaling down to 64 bits seems like an unreasonable goal just because one possible future platform supports it. So, I'd tentatively suggest a 128-bit minimum size.
cranelift-sizeless-vector.md
Outdated
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will operations like iadd
work just like with regular vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think most polymorphic instructions can just be extended for sizeless types too. I've currently implemented: iadd, isub, imul, fadd, fsub and splat.
What new instructions will be introduced for creating, loading and storing such sizeless vectors? |
I think we may need to add stack related instructions, since we have explicit ones, though I'm still not completely sure why. But I expect that the main changes will be encapsulated in new heap_addr and stack_addr operations. We could probably modify the existing ones, but new ones would help isolate the ambiguous semantics of the sizeless nature. |
One thing to note related to the stack: both stack address computation ( |
cc: @penzn |
@cfallin I've only just started looking into stack handling, so I don't have any concrete answers for you. For the backend part, my hand wavey plan was to try and do what is done in LLVM for SVE, where fixed-size and sizeless slots are assigned in different areas, using different pointers and the runtime value of 'VL' can be used to scale offsets into sizeless areas. I've only just looked at the machinst ABI layer though, and I will continue with that this week. At an IR level, the slots would be as sizeless (minimum 128-bits, not sure about maximum) as any other sizeless value. But, as I don't know the IR well, this is the main contention that I am concerned about :) From my basic understanding, it looks like it would be easiest to create sizeless-specific instructions to handle anything stack related, so that we can preserve the current semantics for fixed-size objects, and avoid having to worry about immediate offsets into a slot of unknown size. But I'm also assuming that the different slots could be freely mingled because they're just SSA values...? Please feel free to point me at any areas of the code which you think could be problematic. If the ambiguous size of objects is going to be a serious problem for us, then there is the option of forcing a lowering to a fixed size, though this could be sub-optimal for architectures like SVE when we're not functioning as a JIT. |
@sparker-arm Thanks for the clarifications! I think that we can definitely find a way to accommodate variable-sized types; we just need to consider the problem explicitly and make sure to fix the places it breaks any assumptions :-) A quick-and-dirty way of getting a feel for that would be to make Just for clarity -- I'm not actually sure I could say for certain despite being deep into this conversation! -- when exactly is the vector size made concrete? Do we ultimately compile the code while knowing the size (e.g., we decide we're compiling for microarchitecture X, and so we choose for vectors to be 512-bits wide)? Or are we actually generating code that will only know at runtime? I had been assuming the former, but your mention of computed offsets, etc., makes me think possibly it's the latter. If we ultimately know the size statically, then it's just a phase-ordering/staging problem: we might need to rework how some of the ABI code operates, but it's not fundamentally incompatible with our model. If the latter, then it sounds like it's actually |
It determined at runtime. For arm I believe it is fixed for each cpu, while for riscv the code can ask for any length (up to a certain maximum) and then the minimum of what the user requested and the cpu supports is chosen as far as I understand. The whole point is to make executables agnostic to the vector size supported by the cpu and thus not require recompilation when a new cpu is released with bigger vectors. |
It should be the former, in a sense that it would be known what SIMD length a given CPU supports and then the code can be generated with the right instructions, and ideally without dynamic handling when the generated code executes. "Determined at runtime" is from the point of view of the developer, not wasm runtime. |
That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on. |
Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.
@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime? Either way, both these options are available, and will be both be implemented. For architectures that support runtime vector lengths, we can generate agnostic code (SVE uses the 'VL' runtime variable for most of this) while Neon, AVX-2 and AVX-512 will have to choose a size during codegen. I'm currently implementing SVE and Neon, in tandem, to support for these types, and my idea is, still, that we'd have a legalisation pass that can convert sizeless to fixed sizes which would enable all the SIMD compatible backends to add (basic) support with little or no effort. |
And thanks for this @cfallin |
Right, it can be made to work; it's just a bit of an API change that we'll need to design and coordinate. Right now both regalloc.rs and regalloc2 are built around the notion that they manage spill space (in units of slots) and can ask how many slots a given value will need. We'll need to have some sort of notion of a separate user-managed spill space for type/regclass X. Re: stack layout more generally: none of this is impossible, but it's going to need some careful design rework in the ABI code. Specifically, right now the ABI implementation is mostly shared between aarch64, x64 and s390x, and on all architectures, If part of the frame is variable, we'll need to either start accessing the fixed part via negative offsets from So we can definitely work this out, but we'll need to probably add "modes" to the ABI: where is FP, from which base register are (i) args, (ii) fixed slots, (iii) variable slots accessed, how unwind info is emitted correctly in all cases, etc. Lots of interacting moving parts. All of the above is needed for e.g. |
Ah, another option I missed: variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots. I think that satisfies all the constraints we have today (but not |
@sparker-arm, there is an idea to support setting 'current' length at runtime, RISC-V style, but the maximum available was always meant to be determined before, in the spirit of straight-forward compilation support. Doing something like that isn't impossible, but would most likely require dynamic dispatch, and was considered out of scope.
@bjorn3, valid point - how does AOT compilation currently handle target features? |
There is a list of target features. When doing AOT compilation you can choose a list of allowed target features. The produced executable will then run on all cpus supporting all these target features. This list is generally chosen very conservatively to maximize the amount of cpus it works on. For example only sse and sse2 are enabled by default by rustc on x86_64. This allows at most 128 bit vectors despite the fact that modern cpus support 512 bit vectors through avx512. You can enable avx (for 256 bit vectors) or avx512 (for 512 bit vectors) but then it won't run on older cpus. Arm's SVE however allows you to compile once with the SVE target feature enabled and then it will use eg 512 bit vectors on cpus that support them while retaining compatibility with cpus that only support 128 bit vectors. |
It sounds like the same would work here, when the user requests AVX instructions, they are going to get 256-bit vectors, SSE - 128-bit and so on. |
@cfallin I think you're generally describing how AArch64 is currently handling this, with third register used in the presence of dynamic objects : https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp @penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various type.length operations, or are these available to just be more user friendly? |
Yes, those are there to express things in number of elements instead of bytes, which would be more natural for loops and arrays (also cuts one instruction out, but that isn't a big win really). Spec does not say whether or not different types can have different byte length, but practically hardware length is the same for architectures that use same SIMD registers for different types. |
Thanks @penzn, though I feel having the size defined in the spec would be very useful in reducing the amount of ambiguity for any target independent parts of the compiler; the case of stack objects is a good example where, at the IR level, we can have homogeneously unsized slot type which can be (re)used by any flexible type. When it comes closer to codegen, it also makes the layout of the frame easier/smaller/efficient as, if they potentially had different sizes, we'd likely have to pad objects to a fixed alignment or have expensive ways of calculating the address of each object. |
But now I remember that, for SVE, the sizeless registers aren't always equal - the predicate registers are x8 smaller than the data regs... EDIT: This actually shouldn't matter at the moment, while comparisons produce vector masks in the vector regs, but is something we should consider if predicates types are introduced. |
cranelift-sizeless-vector.md
Outdated
|
||
We can add sizeless vector types by modifying existing structs to hold an extra bit of information to represent the sizeless nature: | ||
|
||
- The new types don't report themselves as vectors, so ty.is\_vector() = false, but are explicitly reported via ty.is\_sizeless\_vector(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid ambiguity, we should rename is_vector
to is_sized_vector
if we move forward with this.
@cfallin I've spent the last few days playing with the ABI layer, and I think I've stumbled upon (again) the limitation with the RegClass types, specifically while looking at callee saves... according to the AAPCS, in many cases SVE regs are treated the same as Neon but there are cases where the whole 'Z' register needs to be saved, and I'm not sure how to handle this while we just have shared V128 registers for Neon and SVE. So two questions really: can you think of a way around this, and/or do we need to follow the AAPCS? RegClass is used quite a bit in the ABI layer so I'm assuming there's going to be more problems like this. |
Hmm, interesting; can you say more about what factors influence prologue/epilogue register save details? Is it just "if you clobber the upper bits, save them" or something like that? I think it might be reasonable to, as we scan the function body building up the clobber set during ABI processing, look at some information provided by an instruction ("requires special save/restore") and use that info as well. The basic abstraction at the regalloc layer is that we have disjoint classes of units to allocate; overlapping classes are a major refactor that would require quite a significant investment of time (a month or so in regalloc code, with significant correctness risk); so I think that we likely need to find a way to make this work without two different register classes that mean "Z/vec reg as traditional vec reg" and "Z/vec reg as Z reg". But it's fundamentally a single register, so this doesn't seem wrong to me. In a sense, the way that we use it is type information layered on top of the allocation itself. |
It is when the routine receives and/or returns Z or predicate registers that the callee needs to save the whole Z register (z8-z23). From what you said though, I guess we could just grab the type information from the instruction so that we don't have to rely on RegClass? I definitely have no interest in trying to bend/break regalloc to the will of SVE, but if we can pass the high-level type information down then I think this will be fine. I still haven't looked at spill/restores though :) edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful. edit: And the only place that I've found which doesn't provide a virtual reg for spill/reload is in regalloc's get_stackmap_artefacts_at function... |
@cfallin Do you think adding something like 'add_sizeless_def' to RegUsageCollector would be a suitable way to communicate between regalloc and cranelfit? I think this would allow us to convert 'clobbered_registers' to hold a RealReg and bool to represent whether it is sizeless. |
Not anymore actually; see the recent fuzzbug fix (and followup issue #3645) where we determined that this type information can be inaccurate in the presence of moves and move elision. There's a bit of subtlety here: the register allocator should not know about IR types in a non-opaque way; that's an abstraction leak and a correctness risk (what if it tries to do something with that info?). The allocator is a thing that hands out registers; registers are just black boxes that hold bits; modern aarch64 machines have Z registers; Cranelift can decide to put a Z-register-sized value, or a v128-sized value, or an f64-sized value, or anything else into that register. That's a clear delineation of responsibilities and if we blur that line, I fear we could have bigger correctness problems later (similar to the one CVE and another almost-CVE this area has given us when we do blur the line). The suggested fix in the issue above involves having the regalloc record type info alongside registers, but only to verify that moves are valid; we don't want it to actually peek into that type info. But, the ABI code is free to reason about how the function body uses registers. So what I imagine could work is that when we generate prologue/epilogue code, we can scan over parameter/return types ("do we take or return a Z reg" condition above) and determine what we actually clobber and what we need to save. If something also depends on whether instructions actually e.g. touch high bits or whatnot, we can scan instructions for that too. But that all stays within Cranelift; it doesn't involve regalloc changes. It's possible I'm missing some requirement here though -- is there some reason that we can't determine the info we need from scanning the code on the Cranelift side? |
I don't think we do anything of the sort currently -- it's a new idea :-) But, yes, it certainly seems possible to me. This will change some of the internal signatures, e.g.
I think so, yes. Just to make sure I understand, the
Yes, I think this actually makes more sense overall: associating the slot with a type feels more appropriate than requiring the IR producer to match the size given to the slot entity with the known size of the loads and stores used to access it. |
Okay :) sounds like I'll be kept busy for a while hacking on the type system!
Yes, in most cases it will just be a constant set in the target backend. If/when we add fully dynamic support for SVE, it will be an instruction or two, in the prologue, that reads the runtime VL. |
Hi @cfallin, now that I've had some time to revisit the meta level of cranelift, I'm still struggling to see how these dynamically created types would work. It just seems to be too against how types and instructions are currently implemented. During the parsing we can check that dyn_scale and our dynamic type have been declared with the same base vector type, that's okay...At the IR level, the mechanical bits seem fine too: Types are just u8 values, we reserve some of those for dynamic types, as I have done for the original sizeless types, and these should be named such as I8X16XN. Implementing methods to handle the dynamic addition is not an issue. But this doesn't seem to tie dyn_scale value to the type in the way you would like, these are concrete types and not connected to a global value. I actually got so lost in the meta layers that I actually can't remember which part made my mind finally fall over :) I think my problem comes when implementing the polymorphic instruction generation, when we want to iterate through concrete types. From the description above, they seem like concrete types (just a u8 with a bit for dynamic) but from the user perspective they are not. In the meta level, I've tried to generate a set of types that have a reference to an IR entitty, which then seems superficial since that entity isn't actually part of the type. Does any of this successfully convey my pain? |
@sparker-arm , I'm happy to help work out implementation/prototype details -- do you have a WIP branch you can point me to that demonstrates what you're thinking / what issues you're running into?
I don't think that the IR literally needs to have a
struct Function {
// ...
dyn_types: Vec<DynType>,
}
/// A dynamically-sized type. Type `dtN` refers to the definition in `dyn_types[N]`.
struct DynType {
/// The dynamically-sized type is defined in terms of a base type and a global value that indicates
/// how many of those base-type elements this type contains.
base_type: Type,
length: GlobalValue,
} It isn't really an issue that the Possibly I'm missing some other difficulty here though? Are there places where it would be awkward to look up the
When generating code, I imagine there would be a case that matches on dynamic types, and then switches on the base type; so just as we today have cases for |
Okay, thanks, this sounds much more feasible and sorry I kinda missed this suggestion in your previous comment. One thing regarding the global value names, is it a fundamental property of clif that we need a prefix followed by a number? I was wondering whether we could have
I greatly appreciate the offer, but I think your above suggestion is enough to get me moving forward again. I will be on holiday for a couple weeks though, starting Friday, so please don't interpret my silence as being stuck again. |
The CLIF parser seems to be built assuming the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally happy with this now; thanks for the patience in iterating on the high-level approach here!
I too am eager to see the code; some of the details here I think will be most clear once we have a concrete thing in front of us, and I think everything is solidified to the point that we won't have too much design churn that wastes effort. The biggest question to me is how the pipeline lowers this to the existing stackslot abstractions; as we discussed earlier if all the sizes are initially compile-time constants then in theory the ABI implementation could remain unchanged, if we legalize beforehand. But for full generality maybe the ABI needs to be aware of the separate category of slots. In any case, we can iterate that as we need -- let's hopefully get this RFC accepted soon, if others agree!
cranelift-sizeless-vector.md
Outdated
@@ -96,17 +99,19 @@ With the notion of a sizeless stack slot possible, but not a sizeless spill slot | |||
# Rationale and alternatives | |||
[rationale-and-alternatives]: #rationale-and-alternatives | |||
|
|||
The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations. | |||
The main change here is the introduction of dynamically created types, using an existing vector type as a base and a scaling factor represented by a global value. Using a global value fits with clif IR in that we have a value which is not expected (allowed?) to change during the execution of the function. The alternative is to add types which have an implicit scaling factor which could make verification more complicated, or impossible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, "not allowed" rather than just "not expected" -- the semantics are as-if loaded once in the prologue.
cranelift-sizeless-vector.md
Outdated
The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations. | ||
The main change here is the introduction of dynamically created types, using an existing vector type as a base and a scaling factor represented by a global value. Using a global value fits with clif IR in that we have a value which is not expected (allowed?) to change during the execution of the function. The alternative is to add types which have an implicit scaling factor which could make verification more complicated, or impossible. | ||
|
||
The new vector types also aren't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/accidently/accidentally/
ca2e00b
to
b7cb4c6
Compare
Thanks @cfallin !
Indeed, the current ABI changes that I've made aren't really needed for this initial fixed size implementation, but I've tried to implement it with true dynamic sizes, and general SVE, in mind. Speaking of which, my current plan for the code RFC is to avoid including any SVE codegen, except some changes to the ABI API, purely to avoid distractions from the IR and ABI changes (it already feels big enough to me). Do you think this is wise, or would you also prefer to see some SVE codegen too? |
I think your incremental approach sounds good: best to get support for the dynamically-sized types in, handling/storing/spilling them, then we can add ISA support as a followup. Actually maybe there are (at least) three pieces: the dynamically-sized types and ABI support; then a "polyfill" without SVE, just NEON (and others could build the same with Intel SIMD if interested); then actually using the new hardware instructions. But, we don't need to fix the details here, they can change as needed I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I agree with the @cfallin's sentiment above that this can be incrementally improved but that this is a good place to start.
Hi, it's been a week since I updated this so I'd now like to move this RFC into the final comment period. Thanks! |
@sparker-arm could you post a final-comment-period approval checklist? Then we can start to collect approvals and hopefully get this in soon! (+1 from me as well, i.e. feel free to mark me as checked already) |
Stakeholders sign-offArm Fastly Intel
Unaffliated IBM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC looks good from spec point of veiw.
I don't consider myself a stakeholder w.r.t cranelift, my opinion should carry much lower weight than others :) I would be interested in following this development one way or the other though.
I believe enough time has now passed? Can this now be merged? |
Indeed, the FCP has now elapsed with no objections, so this RFC should now be merged! (I'm happy to click the button if you don't have permissions to do so, but we should also fix that if so) |
Great, I didn't think I had permissions! Thanks for all the help with this @cfallin |
It was just added apparently (thanks to Till); you're part of the Cranelift core org group now so it shouldn't be an issue in the future :-) |
Copyright (c) 2021, Arm Limited.