[RFC] Dynamic Vector Support #4200

sparker-arm · 2022-06-01T10:21:23Z

Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.

Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.

The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.

cfallin

This is a great start -- thanks for the patience as we've worked out the design details! It seems we're pretty close now. I read over this PR to get an initial sense of the implementation approach, but I'll do a more detailed pass once we resolve a higher-level question that I have.

Also it seems there are some blocks of commented-out code, and a few FIXMEs -- happy to help think about unresolved questions if needed, otherwise let us know when those are ready for review I guess?

The high-level question has to do with the "minimum lane count" and the 2-bit encoding scheme this PR adds to the type index -- see below:

cfallin · 2022-06-01T18:32:27Z

cranelift/codegen/src/ir/globalvalue.rs

@@ -76,6 +76,13 @@ pub enum GlobalValueData {
        /// Does this symbol refer to a thread local storage value?
        tls: bool,
    },
+
+    /// Value is a multiple of how many instances of `vector_type` will fit in
+    /// a target vector register.


This is slightly unclear to me: does this mean that it takes the size of the target's register width and divides by vector_type? i.e. that it has a concrete value determined by the compiler configuration / target microarchitecture? Or is it passed in / determined in some other way?

Or I guess my confusion comes from: I would expect DynScale as a name to imply that this is some generic sort of scale parameter, while this seems to say that it is a specific scale value determined by some algorithm ("what will fit in ..."). So maybe a better name would help: DynScaleTargetReg? And in the future we may have some other DynScale... options based on loads of parameters from vmctx, or a special vector-length register, or a DynScaleConstant, or ...

does this mean that it takes the size of the target's register width and divides by vector_type?

Yes. So for a given dynamic type, the backend will scale the fixed part of the type by DynScale. The name is likely misleading.

I feel we shouldn't need to prescribe different global values for constant scale or truly dynamic ones via a register, as it's not immediately obvious, to me, why this would be helpful at the IR level. I feel we should be able to encapsulate the semantics we want with a single construct and allow a backend to lower it how it wishes.... but...

It would almost certainly make it far easier to implement the machinst ABI and, as that is a particular point of pain, it would be good to make that as simple as possible :)

For this first implementation we could rename to DynScaleConstant and it would much better describe the current limitations.

Yes, I think DynScaleTargetWidth or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.

Anyway with the above rename I think this is settled for now!

cfallin · 2022-06-01T18:38:21Z

cranelift/codegen/src/ir/types.rs

@@ -22,6 +22,11 @@ use target_lexicon::{PointerWidth, Triple};
 ///
 /// SIMD vector types have power-of-two lanes, up to 256. Lanes can be any int/float/bool type.
 ///
+/// Dynamic SIMD vector types have a power-of-two minimum lane count, which is scaled by a target-defined


Can you say more about the "minimum lane count" concept -- why it exists, and cases where one might provide different minimum lane counts? And why only {1/2, 1, 2} as factors?

In particular if this comes from a machine restriction, I'm somewhat concerned that we would be baking a specific set of details into the type-system framework forever. I'd rather have a somewhat higher-level notion of dynamic types in CLIF, e.g. i32xN; and then when we evaluate the dyn_scale_... global value, we can either make whatever guarantees we need for N if the value is at the discretion of the ISA backend, or for values loaded from the user, we can place additional restrictions as needed (e.g., a backend provides the embedder a "minimum multiple for dynamic vector size" and the embedder uses this).

The minimum lane count comes from the fact that we build a dynamic vector type from a fixed base vector type. The fixed base type provides the minimum lane count.

I added the halving/doubling to try to mimic the functionality of the existing SIMD types, and the ones commonly used, but I don't think it's needed and possibly doesn't make sense. But the main reason was for encoding space...

My problem here was that cranelift has a very large range of SIMD types, which made it hard, for me, to add another group of types with a reasonable range. What I wanted to do was just to use a single bit, along with the existing SIMD types, to express a dynamic vector type but I struggled and failed there.... So I'm very much happy to hear suggestions, I would really prefer to not use this 2-bit system.

What if we define dynamic vector types not on top of a base fixed-vector type, but on top of the lane type alone, and then make the dynamic factor (the symbolic N) provide a bit width, rather than a multiplier? This is sort of starting from I was getting at with i32xN above (rather than i32x4xN) but then shifting the meaning of N a bit so that we can have the same N (symbolically at least, in practice this is all resolved at compile time) for every lane type.

Maybe we need a slightly different notation to make this more clear: perhaps (please suggestion other punctuation, anyone!) i32/N (i32s over N bits). Then we could have

N = dyn_width_constant 1024 dt0 = i32/N dt1 = i64/N v1 := load.dt0 ... v2 = bitcast.dt1 v1 store.dt2 v2, ...

I like this because it removes any notion of "minimum lane count" or some other factor (why 4 in i32x4xN, other than the fact that historically we started with 128-bit vectors and we're expanding from there?). We just have the important bits: the lane type (i32), and the machine width we're scaling to (N). Thoughts?

I'm not sure I follow... for one, with dyn_width_constant taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.

The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not. I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?

I'm not sure I follow... for one, with dyn_width_constant taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.

Ah, so I didn't mean for the dyn_width_constant to be the interesting part of that example; perhaps it results from a pass that legalizes to a particular microarchitecture with a known width. The point was to have the global value that defines machine width, as we've had so far; maybe dyn_width_vl for a vector-length register or somesuch...

The basic idea I was going for was to have a parameter for the machine width, and say "vector of lane type T up to width N". Whereas the i64x2xN sort of syntax says "an N-times multiple of a base 128-bit vector machine". Just holding a different set of two of the three variables in "machine width = lane width * lanes" constant, I guess.

Said another way, the existing vector types i8x16, i16x8, i32x4, i64x2 can be written as i8/128, i16/128, i32/128, i64/128; and then this is making that width configurable, rather than the multiplier over 128 bits base width configurable.

That said...

The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not.

I also see the advantages of the i64x2xN approach with respect to the existing SIMD types: we're basically "slapping a multiplier on it" and everything else remains the same. The downside is the encoding and from-first-principles comprehensibility of the types.

All else being equal, I think either could work, and an easier migration path is important; and you've done the work to prototype the latter; so i64x2xN is totally workable. I won't push the above ideas too hard against it :-)

I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?

Ah so here it actually gets interesting: what happens to the {1/2, 1, 2} multipliers if we do two narrow or widen ops in a row?

Ah, I think I missed that the enum discriminant in ValueData is an implicit u8, so right now it fits in u64 but making Type a u16 inflates that (probably at least to 12 bytes if not 16?).

I have some thoughts on bitpacking and giving Type 14 bits instead; I'll do a PR and do some measurements, and get this in as a separate refactor if it works to pave the way for your work :-)

@sparker-arm I took a look at this just now in #4269. The impact on compiling SpiderMonkey.wasm was actually 4.5%, which is quite serious and so we definitely need to do something here -- the bitpacking in that PR gets it down to 1.0% but I'm hoping we can do better (ideally I wouldn't want to pay more than 0.1% for this, given that compile-time wins are generally hard to come by).

So, I re-ran my experiments and looking at reported time, instead of raw cycles, and I see u16 causing a 1.1% or 1.3% slowdown, depending on whether I use the mean or minimum values. (I don't understand why cycles and time are so different...) And I'm a bit concerned that your results are significantly different than mine. What are machine you running on? And are you running sightglass in a way to mimic a 'real-life' scenario, with the commands --iterations-per-process 10 --processes 2?

To add to my confusion, when I benchmark your 14-bit patch I confirm your results, with a 0.78% slowdown...

Results for me were on a desktop Ryzen 3900X 12-core system. The iteration count and process count were a tradeoff to get some results and not wait forever, but all else equal, larger counts are better (it lets Sightglass get a tighter confidence interval). If run with just two engines rather than three it'll do a statistical-significance test too.

Compilation input can make a big difference in slowdowns observed (e.g. if functions get big enough for IR data structures to exceed L1), are you testing with SpiderMonkey.wasm or something else?

Yeah, I'm using spidermonkey.wasm and I'm on a big and shiny ampere altra. Maybe this has something to do with x64 vs aarch64, because I assume that the size of L1 caches varies very little between CPUs these days. Considering the benchmark takes < 10 seconds, is the tradeoff really necessary..?

bjorn3 · 2022-06-10T14:47:01Z

cranelift/codegen/src/ir/function.rs

-        self.stack_slots.push(data)
+    /// Creates a sized stack slot in the function, to be used by `stack_load`, `stack_store`
+    /// and `stack_addr` instructions.
+    pub fn create_sized_stack_slot(&mut self, data: StackSlotData) -> StackSlot {


Could you please keep the original name? This is an unnecessary breaking change that would break the nightly cranelift tests of cg_clif.

I have no strong opinion, but opposing feedback that I've previously received from @fitzgen (I can't remember in which area) was that we should make it clear it clear to differentiate between fixed and dynamic. If that was just around type naming, then it is a non-issue and I'm happy to revert. It's already been a problem for me while rebasing!

I'd prefer we have the explicit name as well; @bjorn3 each Cranelift upgrade is a semver break currently and our APIs are still considered unstable. We shouldn't needlessly break compatibility, but getting things right and finding clearer names and less error-prone APIs is a higher priority right now.

I guess so @cfallin.

abrown · 2022-06-14T18:00:58Z

cranelift/codegen/meta/src/shared/instructions.rs

+        )
+        .operands_in(vec![DSS])
+        .operands_out(vec![addr]),
+    );


Maybe I missed this at some point, but why do we need new stack manipulation instructions? Wouldn't the old instructions have the attached dynamic type which would be enough to determine where the dynamic slots exist?

The issue is the operand typing -- these take a DynamicStackSlot rather than a StackSlot, and that distinction in turn is needed for the ABI code. We could have made each StackSlot be polymorphic over statically-sized or dynamically-sized type, though that adds a bit of complication. Given that we do the layout differently for each, and lower each slightly differently, I think it's reasonable to keep them separated, but i don't feel too strongly about that.

sparker-arm · 2022-06-21T09:58:01Z

@cfallin I've added widen and narrow support and I've maintained the existing constraint system by sinking DynamicTypes from Function into DataFlowGraph. This requires the user to declare any necessary dynamic types but they don't need to be explicit about their output types.

cfallin

Thanks -- we're definitely getting closer!

At a top level, in addition to all the comments below, I noticed that Type became u16 but I didn't see anything like the tricks in #4269 -- given my measurements I don't think we can just make it a u16 and be done with it. Could you incorporate either that bitpacking, or a sentinel-value-and-sparse-hashmap approach?

cfallin · 2022-06-30T18:40:57Z

cranelift/codegen/src/ir/globalvalue.rs

@@ -76,6 +76,13 @@ pub enum GlobalValueData {
        /// Does this symbol refer to a thread local storage value?
        tls: bool,
    },
+
+    /// Value is a multiple of how many instances of `vector_type` will fit in
+    /// a target vector register.


Yes, I think DynScaleTargetWidth or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.

Anyway with the above rename I think this is settled for now!

cranelift/codegen/src/ir/types.rs

cfallin · 2022-06-30T18:44:53Z

cranelift/codegen/src/isa/aarch64/abi.rs

@@ -558,6 +565,7 @@ impl ABIMachineSpec for AArch64MachineDeps {
    }

    fn gen_get_stack_addr(mem: StackAMode, into_reg: Writable<Reg>, _ty: Type) -> Inst {
+        // FIXME: Do something different for dynamic types?


Can we file a followup issue to track what to do about dynamic vector types as args?

(Am I assuming right that usually they won't be args, but rather the args themselves will be pointers to the data we use with dynamically-sized vectors?)

AFAIK, yes. It's either in a scalable register (data or predicate) or we pass a pointer.

cranelift/codegen/src/isa/aarch64/inst/args.rs

cranelift/codegen/src/isa/aarch64/lower_inst.rs

cranelift/codegen/src/isa/aarch64/mod.rs

cranelift/codegen/src/legalizer/mod.rs

cranelift/codegen/src/machinst/abi_impl.rs

cfallin · 2022-06-30T18:51:37Z

cranelift/codegen/src/machinst/abi_impl.rs

+            let ty = f
+                .get_concrete_dynamic_ty(data.dyn_ty)
+                .unwrap_or_else(|| panic!("invalid dynamic vector type: {}", data.dyn_ty));
+            dynamic_stack_offset += isa.vector_scale(ty);


I may be missing it but where does the global value (vector scale) actually come in here to determine the size of the dynamically-sized type?

Yes, this is a problem, as they are currently disjoint. I've currently renamed this to dynamic_vector_bytes to differentiate it from the scaling factor. The only way to tie them together, that I can think of, is to pass the value from TargetIsa when legalizing the GlobalValue, which I'll try out.

OK, I think this is fine to start, but let's add a TODO and a tracking issue for truly dynamic stack frame layout based on some runtime value. At least, if I understand correctly, one option all along has been for runtime-determined size (from a vmcontext configuration field or some other input), which is what motivated us to take the GV-based approach; so we should build that eventually. I'm happy to see it done incrementally though in this case.

cranelift/codegen/src/machinst/abi_impl.rs

sparker-arm · 2022-07-05T12:31:30Z

I noticed that Type became u16 but I didn't see anything like the tricks in #4269

Would you mind committing those changes so I can rebase on top? I'm happy to add my optimization on top of it.

sparker-arm · 2022-07-06T12:22:09Z

Squashed the commits and rebased with Type as u16. Also corrected the global legalize code and updated the 'test' that should have caught the problem.

cfallin

@sparker-arm I've been over most of this again recently and I think it's looking good. I think the extended versions of some of the features (truly dynamic stackframe layout based on runtime size configuration, at least) can come in as subsequent PRs, now that the basic framework is here, as on balance what's here is self-consistent and I don't want to keep the PR hanging for too long and have you pay the rebase cost.

So, with some tracking issues filed as per the outstanding codereview comments (and with the merge conflict in the interpreter resolved), I think we're good to merge this!

Introduce a new concept in the IR that allows a producer to create dynamic vector types. An IR function can now contain global value(s) that represent a dynamic scaling factor, for a given fixed-width vector type. A dynamic type is then created by 'multiplying' the corresponding global value with a fixed-width type. These new types can be used just like the existing types and the type system has a set of hard-coded dynamic types, such as I32X4XN, which the user defined types map onto. The dynamic types are also used explicitly to create dynamic stack slots, which have no set size like their existing counterparts. New IR instructions are added to access these new stack entities. Currently, during codegen, the dynamic scaling factor has to be lowered to a constant so the dynamic slots do eventually have a compile-time known size, as do spill slots. The current lowering for aarch64 just targets Neon, using a dynamic scale of 1. Copyright (c) 2022, Arm Limited.

cfallin · 2022-07-07T19:54:31Z

I spoke to @sparker-arm and he is planning to file followup tracking issues tomorrow; for now, this is (finally) good to merge. Thanks for the patience throughout the process!

sparker-arm · 2022-07-08T13:27:49Z

Thanks @cfallin for all your help! I've raised an issue for the next steps.

sparker-arm added cranelift Issues related to the Cranelift code generator cranelift:meta Everything related to the meta-language. cranelift:area:aarch64 Issues related to AArch64 backend. labels Jun 1, 2022

sparker-arm requested review from cfallin and abrown June 1, 2022 10:23

sparker-arm added the cranelift:area:clif label Jun 1, 2022

cfallin reviewed Jun 1, 2022

View reviewed changes

sparker-arm force-pushed the dynamic-types branch from 82c43a6 to fbd0e5a Compare June 10, 2022 14:26

bjorn3 reviewed Jun 10, 2022

View reviewed changes

abrown reviewed Jun 14, 2022

View reviewed changes

sparker-arm force-pushed the dynamic-types branch from fbd0e5a to 160588a Compare June 21, 2022 09:52

cfallin reviewed Jun 30, 2022

View reviewed changes

sparker-arm force-pushed the dynamic-types branch from 160588a to a74f032 Compare July 5, 2022 11:57

sparker-arm force-pushed the dynamic-types branch from a74f032 to c7e7496 Compare July 6, 2022 12:20

sparker-arm force-pushed the dynamic-types branch from c7e7496 to d09db01 Compare July 6, 2022 13:03

cfallin approved these changes Jul 7, 2022

View reviewed changes

sparker-arm force-pushed the dynamic-types branch from d09db01 to 1eba452 Compare July 7, 2022 18:25

cfallin merged commit 9c43749 into bytecodealliance:main Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Dynamic Vector Support #4200

[RFC] Dynamic Vector Support #4200

sparker-arm commented Jun 1, 2022

cfallin left a comment

cfallin Jun 1, 2022

sparker-arm Jun 6, 2022

cfallin Jun 30, 2022

cfallin Jun 1, 2022

sparker-arm Jun 6, 2022

cfallin Jun 6, 2022

sparker-arm Jun 7, 2022

cfallin Jun 7, 2022

cfallin Jun 14, 2022 •

edited

Loading

cfallin Jun 14, 2022

sparker-arm Jun 15, 2022

cfallin Jun 15, 2022

sparker-arm Jun 15, 2022

bjorn3 Jun 10, 2022

sparker-arm Jun 10, 2022

cfallin Jun 10, 2022

bjorn3 Jun 10, 2022

abrown Jun 14, 2022

cfallin Jun 14, 2022

sparker-arm commented Jun 21, 2022

cfallin left a comment

cfallin Jun 30, 2022

cfallin Jun 30, 2022

sparker-arm Jul 5, 2022

cfallin Jun 30, 2022

sparker-arm Jul 5, 2022

cfallin Jul 7, 2022

sparker-arm commented Jul 5, 2022

sparker-arm commented Jul 6, 2022

cfallin left a comment

cfallin commented Jul 7, 2022

sparker-arm commented Jul 8, 2022

[RFC] Dynamic Vector Support #4200

[RFC] Dynamic Vector Support #4200

Conversation

sparker-arm commented Jun 1, 2022

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfallin Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sparker-arm commented Jun 21, 2022

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sparker-arm commented Jul 5, 2022

sparker-arm commented Jul 6, 2022

cfallin left a comment

Choose a reason for hiding this comment

cfallin commented Jul 7, 2022

sparker-arm commented Jul 8, 2022

cfallin Jun 14, 2022 •

edited

Loading