-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Dynamic Vector Support #4200
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start -- thanks for the patience as we've worked out the design details! It seems we're pretty close now. I read over this PR to get an initial sense of the implementation approach, but I'll do a more detailed pass once we resolve a higher-level question that I have.
Also it seems there are some blocks of commented-out code, and a few FIXMEs -- happy to help think about unresolved questions if needed, otherwise let us know when those are ready for review I guess?
The high-level question has to do with the "minimum lane count" and the 2-bit encoding scheme this PR adds to the type index -- see below:
@@ -76,6 +76,13 @@ pub enum GlobalValueData { | |||
/// Does this symbol refer to a thread local storage value? | |||
tls: bool, | |||
}, | |||
|
|||
/// Value is a multiple of how many instances of `vector_type` will fit in | |||
/// a target vector register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly unclear to me: does this mean that it takes the size of the target's register width and divides by vector_type
? i.e. that it has a concrete value determined by the compiler configuration / target microarchitecture? Or is it passed in / determined in some other way?
Or I guess my confusion comes from: I would expect DynScale
as a name to imply that this is some generic sort of scale parameter, while this seems to say that it is a specific scale value determined by some algorithm ("what will fit in ..."). So maybe a better name would help: DynScaleTargetReg
? And in the future we may have some other DynScale...
options based on loads of parameters from vmctx, or a special vector-length register, or a DynScaleConstant
, or ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that it takes the size of the target's register width and divides by vector_type?
Yes. So for a given dynamic type, the backend will scale the fixed part of the type by DynScale
. The name is likely misleading.
I feel we shouldn't need to prescribe different global values for constant scale or truly dynamic ones via a register, as it's not immediately obvious, to me, why this would be helpful at the IR level. I feel we should be able to encapsulate the semantics we want with a single construct and allow a backend to lower it how it wishes.... but...
It would almost certainly make it far easier to implement the machinst ABI and, as that is a particular point of pain, it would be good to make that as simple as possible :)
For this first implementation we could rename to DynScaleConstant
and it would much better describe the current limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think DynScaleTargetWidth
or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.
Anyway with the above rename I think this is settled for now!
cranelift/codegen/src/ir/types.rs
Outdated
@@ -22,6 +22,11 @@ use target_lexicon::{PointerWidth, Triple}; | |||
/// | |||
/// SIMD vector types have power-of-two lanes, up to 256. Lanes can be any int/float/bool type. | |||
/// | |||
/// Dynamic SIMD vector types have a power-of-two minimum lane count, which is scaled by a target-defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about the "minimum lane count" concept -- why it exists, and cases where one might provide different minimum lane counts? And why only {1/2, 1, 2} as factors?
In particular if this comes from a machine restriction, I'm somewhat concerned that we would be baking a specific set of details into the type-system framework forever. I'd rather have a somewhat higher-level notion of dynamic types in CLIF, e.g. i32xN
; and then when we evaluate the dyn_scale_...
global value, we can either make whatever guarantees we need for N
if the value is at the discretion of the ISA backend, or for values loaded from the user, we can place additional restrictions as needed (e.g., a backend provides the embedder a "minimum multiple for dynamic vector size" and the embedder uses this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum lane count comes from the fact that we build a dynamic vector type from a fixed base vector type. The fixed base type provides the minimum lane count.
I added the halving/doubling to try to mimic the functionality of the existing SIMD types, and the ones commonly used, but I don't think it's needed and possibly doesn't make sense. But the main reason was for encoding space...
My problem here was that cranelift has a very large range of SIMD types, which made it hard, for me, to add another group of types with a reasonable range. What I wanted to do was just to use a single bit, along with the existing SIMD types, to express a dynamic vector type but I struggled and failed there.... So I'm very much happy to hear suggestions, I would really prefer to not use this 2-bit system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we define dynamic vector types not on top of a base fixed-vector type, but on top of the lane type alone, and then make the dynamic factor (the symbolic N
) provide a bit width, rather than a multiplier? This is sort of starting from I was getting at with i32xN
above (rather than i32x4xN
) but then shifting the meaning of N
a bit so that we can have the same N
(symbolically at least, in practice this is all resolved at compile time) for every lane type.
Maybe we need a slightly different notation to make this more clear: perhaps (please suggestion other punctuation, anyone!) i32/N
(i32
s over N
bits). Then we could have
N = dyn_width_constant 1024
dt0 = i32/N
dt1 = i64/N
v1 := load.dt0 ...
v2 = bitcast.dt1 v1
store.dt2 v2, ...
I like this because it removes any notion of "minimum lane count" or some other factor (why 4
in i32x4xN
, other than the fact that historically we started with 128-bit vectors and we're expanding from there?). We just have the important bits: the lane type (i32
), and the machine width we're scaling to (N
). Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow... for one, with dyn_width_constant
taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.
The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not. I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow... for one, with
dyn_width_constant
taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.
Ah, so I didn't mean for the dyn_width_constant
to be the interesting part of that example; perhaps it results from a pass that legalizes to a particular microarchitecture with a known width. The point was to have the global value that defines machine width, as we've had so far; maybe dyn_width_vl
for a vector-length register or somesuch...
The basic idea I was going for was to have a parameter for the machine width, and say "vector of lane type T up to width N
". Whereas the i64x2xN
sort of syntax says "an N
-times multiple of a base 128-bit vector machine". Just holding a different set of two of the three variables in "machine width = lane width * lanes" constant, I guess.
Said another way, the existing vector types i8x16
, i16x8
, i32x4
, i64x2
can be written as i8/128
, i16/128
, i32/128
, i64/128
; and then this is making that width configurable, rather than the multiplier over 128 bits base width configurable.
That said...
The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not.
I also see the advantages of the i64x2xN
approach with respect to the existing SIMD types: we're basically "slapping a multiplier on it" and everything else remains the same. The downside is the encoding and from-first-principles comprehensibility of the types.
All else being equal, I think either could work, and an easier migration path is important; and you've done the work to prototype the latter; so i64x2xN
is totally workable. I won't push the above ideas too hard against it :-)
I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?
Ah so here it actually gets interesting: what happens to the {1/2, 1, 2} multipliers if we do two narrow or widen ops in a row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I missed that the enum discriminant in ValueData
is an implicit u8
, so right now it fits in u64
but making Type
a u16
inflates that (probably at least to 12 bytes if not 16?).
I have some thoughts on bitpacking and giving Type
14 bits instead; I'll do a PR and do some measurements, and get this in as a separate refactor if it works to pave the way for your work :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sparker-arm I took a look at this just now in #4269. The impact on compiling SpiderMonkey.wasm was actually 4.5%, which is quite serious and so we definitely need to do something here -- the bitpacking in that PR gets it down to 1.0% but I'm hoping we can do better (ideally I wouldn't want to pay more than 0.1% for this, given that compile-time wins are generally hard to come by).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I re-ran my experiments and looking at reported time, instead of raw cycles, and I see u16 causing a 1.1% or 1.3% slowdown, depending on whether I use the mean or minimum values. (I don't understand why cycles and time are so different...) And I'm a bit concerned that your results are significantly different than mine. What are machine you running on? And are you running sightglass in a way to mimic a 'real-life' scenario, with the commands --iterations-per-process 10 --processes 2
?
To add to my confusion, when I benchmark your 14-bit patch I confirm your results, with a 0.78% slowdown...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results for me were on a desktop Ryzen 3900X 12-core system. The iteration count and process count were a tradeoff to get some results and not wait forever, but all else equal, larger counts are better (it lets Sightglass get a tighter confidence interval). If run with just two engines rather than three it'll do a statistical-significance test too.
Compilation input can make a big difference in slowdowns observed (e.g. if functions get big enough for IR data structures to exceed L1), are you testing with SpiderMonkey.wasm or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm using spidermonkey.wasm and I'm on a big and shiny ampere altra. Maybe this has something to do with x64 vs aarch64, because I assume that the size of L1 caches varies very little between CPUs these days. Considering the benchmark takes < 10 seconds, is the tradeoff really necessary..?
82c43a6
to
fbd0e5a
Compare
self.stack_slots.push(data) | ||
/// Creates a sized stack slot in the function, to be used by `stack_load`, `stack_store` | ||
/// and `stack_addr` instructions. | ||
pub fn create_sized_stack_slot(&mut self, data: StackSlotData) -> StackSlot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please keep the original name? This is an unnecessary breaking change that would break the nightly cranelift tests of cg_clif.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion, but opposing feedback that I've previously received from @fitzgen (I can't remember in which area) was that we should make it clear it clear to differentiate between fixed and dynamic. If that was just around type naming, then it is a non-issue and I'm happy to revert. It's already been a problem for me while rebasing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer we have the explicit name as well; @bjorn3 each Cranelift upgrade is a semver break currently and our APIs are still considered unstable. We shouldn't needlessly break compatibility, but getting things right and finding clearer names and less error-prone APIs is a higher priority right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so @cfallin.
) | ||
.operands_in(vec![DSS]) | ||
.operands_out(vec![addr]), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed this at some point, but why do we need new stack manipulation instructions? Wouldn't the old instructions have the attached dynamic type which would be enough to determine where the dynamic slots exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is the operand typing -- these take a DynamicStackSlot
rather than a StackSlot
, and that distinction in turn is needed for the ABI code. We could have made each StackSlot
be polymorphic over statically-sized or dynamically-sized type, though that adds a bit of complication. Given that we do the layout differently for each, and lower each slightly differently, I think it's reasonable to keep them separated, but i don't feel too strongly about that.
fbd0e5a
to
160588a
Compare
@cfallin I've added widen and narrow support and I've maintained the existing constraint system by sinking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- we're definitely getting closer!
At a top level, in addition to all the comments below, I noticed that Type
became u16
but I didn't see anything like the tricks in #4269 -- given my measurements I don't think we can just make it a u16
and be done with it. Could you incorporate either that bitpacking, or a sentinel-value-and-sparse-hashmap approach?
@@ -76,6 +76,13 @@ pub enum GlobalValueData { | |||
/// Does this symbol refer to a thread local storage value? | |||
tls: bool, | |||
}, | |||
|
|||
/// Value is a multiple of how many instances of `vector_type` will fit in | |||
/// a target vector register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think DynScaleTargetWidth
or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.
Anyway with the above rename I think this is settled for now!
@@ -558,6 +565,7 @@ impl ABIMachineSpec for AArch64MachineDeps { | |||
} | |||
|
|||
fn gen_get_stack_addr(mem: StackAMode, into_reg: Writable<Reg>, _ty: Type) -> Inst { | |||
// FIXME: Do something different for dynamic types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file a followup issue to track what to do about dynamic vector types as args?
(Am I assuming right that usually they won't be args, but rather the args themselves will be pointers to the data we use with dynamically-sized vectors?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, yes. It's either in a scalable register (data or predicate) or we pass a pointer.
let ty = f | ||
.get_concrete_dynamic_ty(data.dyn_ty) | ||
.unwrap_or_else(|| panic!("invalid dynamic vector type: {}", data.dyn_ty)); | ||
dynamic_stack_offset += isa.vector_scale(ty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing it but where does the global value (vector scale) actually come in here to determine the size of the dynamically-sized type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a problem, as they are currently disjoint. I've currently renamed this to dynamic_vector_bytes
to differentiate it from the scaling factor. The only way to tie them together, that I can think of, is to pass the value from TargetIsa when legalizing the GlobalValue, which I'll try out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is fine to start, but let's add a TODO and a tracking issue for truly dynamic stack frame layout based on some runtime value. At least, if I understand correctly, one option all along has been for runtime-determined size (from a vmcontext configuration field or some other input), which is what motivated us to take the GV-based approach; so we should build that eventually. I'm happy to see it done incrementally though in this case.
160588a
to
a74f032
Compare
Would you mind committing those changes so I can rebase on top? I'm happy to add my optimization on top of it. |
a74f032
to
c7e7496
Compare
Squashed the commits and rebased with |
c7e7496
to
d09db01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sparker-arm I've been over most of this again recently and I think it's looking good. I think the extended versions of some of the features (truly dynamic stackframe layout based on runtime size configuration, at least) can come in as subsequent PRs, now that the basic framework is here, as on balance what's here is self-consistent and I don't want to keep the PR hanging for too long and have you pay the rebase cost.
So, with some tracking issues filed as per the outstanding codereview comments (and with the merge conflict in the interpreter resolved), I think we're good to merge this!
Introduce a new concept in the IR that allows a producer to create dynamic vector types. An IR function can now contain global value(s) that represent a dynamic scaling factor, for a given fixed-width vector type. A dynamic type is then created by 'multiplying' the corresponding global value with a fixed-width type. These new types can be used just like the existing types and the type system has a set of hard-coded dynamic types, such as I32X4XN, which the user defined types map onto. The dynamic types are also used explicitly to create dynamic stack slots, which have no set size like their existing counterparts. New IR instructions are added to access these new stack entities. Currently, during codegen, the dynamic scaling factor has to be lowered to a constant so the dynamic slots do eventually have a compile-time known size, as do spill slots. The current lowering for aarch64 just targets Neon, using a dynamic scale of 1. Copyright (c) 2022, Arm Limited.
d09db01
to
1eba452
Compare
I spoke to @sparker-arm and he is planning to file followup tracking issues tomorrow; for now, this is (finally) good to merge. Thanks for the patience throughout the process! |
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.