Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Dynamic Vector Support #4200

Merged
merged 1 commit into from
Jul 7, 2022

Conversation

sparker-arm
Copy link
Contributor

Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.

Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.

The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.

@sparker-arm sparker-arm added cranelift Issues related to the Cranelift code generator cranelift:meta Everything related to the meta-language. cranelift:area:aarch64 Issues related to AArch64 backend. labels Jun 1, 2022
@sparker-arm sparker-arm requested review from cfallin and abrown June 1, 2022 10:23
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start -- thanks for the patience as we've worked out the design details! It seems we're pretty close now. I read over this PR to get an initial sense of the implementation approach, but I'll do a more detailed pass once we resolve a higher-level question that I have.

Also it seems there are some blocks of commented-out code, and a few FIXMEs -- happy to help think about unresolved questions if needed, otherwise let us know when those are ready for review I guess?

The high-level question has to do with the "minimum lane count" and the 2-bit encoding scheme this PR adds to the type index -- see below:

@@ -76,6 +76,13 @@ pub enum GlobalValueData {
/// Does this symbol refer to a thread local storage value?
tls: bool,
},

/// Value is a multiple of how many instances of `vector_type` will fit in
/// a target vector register.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly unclear to me: does this mean that it takes the size of the target's register width and divides by vector_type? i.e. that it has a concrete value determined by the compiler configuration / target microarchitecture? Or is it passed in / determined in some other way?

Or I guess my confusion comes from: I would expect DynScale as a name to imply that this is some generic sort of scale parameter, while this seems to say that it is a specific scale value determined by some algorithm ("what will fit in ..."). So maybe a better name would help: DynScaleTargetReg? And in the future we may have some other DynScale... options based on loads of parameters from vmctx, or a special vector-length register, or a DynScaleConstant, or ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that it takes the size of the target's register width and divides by vector_type?

Yes. So for a given dynamic type, the backend will scale the fixed part of the type by DynScale. The name is likely misleading.

I feel we shouldn't need to prescribe different global values for constant scale or truly dynamic ones via a register, as it's not immediately obvious, to me, why this would be helpful at the IR level. I feel we should be able to encapsulate the semantics we want with a single construct and allow a backend to lower it how it wishes.... but...

It would almost certainly make it far easier to implement the machinst ABI and, as that is a particular point of pain, it would be good to make that as simple as possible :)

For this first implementation we could rename to DynScaleConstant and it would much better describe the current limitations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think DynScaleTargetWidth or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.

Anyway with the above rename I think this is settled for now!

@@ -22,6 +22,11 @@ use target_lexicon::{PointerWidth, Triple};
///
/// SIMD vector types have power-of-two lanes, up to 256. Lanes can be any int/float/bool type.
///
/// Dynamic SIMD vector types have a power-of-two minimum lane count, which is scaled by a target-defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about the "minimum lane count" concept -- why it exists, and cases where one might provide different minimum lane counts? And why only {1/2, 1, 2} as factors?

In particular if this comes from a machine restriction, I'm somewhat concerned that we would be baking a specific set of details into the type-system framework forever. I'd rather have a somewhat higher-level notion of dynamic types in CLIF, e.g. i32xN; and then when we evaluate the dyn_scale_... global value, we can either make whatever guarantees we need for N if the value is at the discretion of the ISA backend, or for values loaded from the user, we can place additional restrictions as needed (e.g., a backend provides the embedder a "minimum multiple for dynamic vector size" and the embedder uses this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum lane count comes from the fact that we build a dynamic vector type from a fixed base vector type. The fixed base type provides the minimum lane count.

I added the halving/doubling to try to mimic the functionality of the existing SIMD types, and the ones commonly used, but I don't think it's needed and possibly doesn't make sense. But the main reason was for encoding space...

My problem here was that cranelift has a very large range of SIMD types, which made it hard, for me, to add another group of types with a reasonable range. What I wanted to do was just to use a single bit, along with the existing SIMD types, to express a dynamic vector type but I struggled and failed there.... So I'm very much happy to hear suggestions, I would really prefer to not use this 2-bit system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we define dynamic vector types not on top of a base fixed-vector type, but on top of the lane type alone, and then make the dynamic factor (the symbolic N) provide a bit width, rather than a multiplier? This is sort of starting from I was getting at with i32xN above (rather than i32x4xN) but then shifting the meaning of N a bit so that we can have the same N (symbolically at least, in practice this is all resolved at compile time) for every lane type.

Maybe we need a slightly different notation to make this more clear: perhaps (please suggestion other punctuation, anyone!) i32/N (i32s over N bits). Then we could have

  N = dyn_width_constant 1024
  dt0 = i32/N
  dt1 = i64/N

  v1 := load.dt0 ...
  v2 = bitcast.dt1 v1
  store.dt2 v2, ...

I like this because it removes any notion of "minimum lane count" or some other factor (why 4 in i32x4xN, other than the fact that historically we started with 128-bit vectors and we're expanding from there?). We just have the important bits: the lane type (i32), and the machine width we're scaling to (N). Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow... for one, with dyn_width_constant taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.

The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not. I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow... for one, with dyn_width_constant taking an immediate value, shouldn't we really should just use fixed sizes at the IR level? This doesn't appear to enable anything dynamic.

Ah, so I didn't mean for the dyn_width_constant to be the interesting part of that example; perhaps it results from a pass that legalizes to a particular microarchitecture with a known width. The point was to have the global value that defines machine width, as we've had so far; maybe dyn_width_vl for a vector-length register or somesuch...

The basic idea I was going for was to have a parameter for the machine width, and say "vector of lane type T up to width N". Whereas the i64x2xN sort of syntax says "an N-times multiple of a base 128-bit vector machine". Just holding a different set of two of the three variables in "machine width = lane width * lanes" constant, I guess.

Said another way, the existing vector types i8x16, i16x8, i32x4, i64x2 can be written as i8/128, i16/128, i32/128, i64/128; and then this is making that width configurable, rather than the multiplier over 128 bits base width configurable.

That said...

The other problem is that I think we should aim to enable writing clif in dynamic terms without losing too much the functionality of fixed widths - otherwise people have to make a hard choice of whether to use it or not.

I also see the advantages of the i64x2xN approach with respect to the existing SIMD types: we're basically "slapping a multiplier on it" and everything else remains the same. The downside is the encoding and from-first-principles comprehensibility of the types.

All else being equal, I think either could work, and an easier migration path is important; and you've done the work to prototype the latter; so i64x2xN is totally workable. I won't push the above ideas too hard against it :-)

I think with your suggestion, we now only have types that use the full width of the register and so we wouldn't be able to support widen and narrow operations?

Ah so here it actually gets interesting: what happens to the {1/2, 1, 2} multipliers if we do two narrow or widen ops in a row?

Copy link
Member

@cfallin cfallin Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I missed that the enum discriminant in ValueData is an implicit u8, so right now it fits in u64 but making Type a u16 inflates that (probably at least to 12 bytes if not 16?).

I have some thoughts on bitpacking and giving Type 14 bits instead; I'll do a PR and do some measurements, and get this in as a separate refactor if it works to pave the way for your work :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sparker-arm I took a look at this just now in #4269. The impact on compiling SpiderMonkey.wasm was actually 4.5%, which is quite serious and so we definitely need to do something here -- the bitpacking in that PR gets it down to 1.0% but I'm hoping we can do better (ideally I wouldn't want to pay more than 0.1% for this, given that compile-time wins are generally hard to come by).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I re-ran my experiments and looking at reported time, instead of raw cycles, and I see u16 causing a 1.1% or 1.3% slowdown, depending on whether I use the mean or minimum values. (I don't understand why cycles and time are so different...) And I'm a bit concerned that your results are significantly different than mine. What are machine you running on? And are you running sightglass in a way to mimic a 'real-life' scenario, with the commands --iterations-per-process 10 --processes 2?

To add to my confusion, when I benchmark your 14-bit patch I confirm your results, with a 0.78% slowdown...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results for me were on a desktop Ryzen 3900X 12-core system. The iteration count and process count were a tradeoff to get some results and not wait forever, but all else equal, larger counts are better (it lets Sightglass get a tighter confidence interval). If run with just two engines rather than three it'll do a statistical-significance test too.

Compilation input can make a big difference in slowdowns observed (e.g. if functions get big enough for IR data structures to exceed L1), are you testing with SpiderMonkey.wasm or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm using spidermonkey.wasm and I'm on a big and shiny ampere altra. Maybe this has something to do with x64 vs aarch64, because I assume that the size of L1 caches varies very little between CPUs these days. Considering the benchmark takes < 10 seconds, is the tradeoff really necessary..?

self.stack_slots.push(data)
/// Creates a sized stack slot in the function, to be used by `stack_load`, `stack_store`
/// and `stack_addr` instructions.
pub fn create_sized_stack_slot(&mut self, data: StackSlotData) -> StackSlot {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please keep the original name? This is an unnecessary breaking change that would break the nightly cranelift tests of cg_clif.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion, but opposing feedback that I've previously received from @fitzgen (I can't remember in which area) was that we should make it clear it clear to differentiate between fixed and dynamic. If that was just around type naming, then it is a non-issue and I'm happy to revert. It's already been a problem for me while rebasing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we have the explicit name as well; @bjorn3 each Cranelift upgrade is a semver break currently and our APIs are still considered unstable. We shouldn't needlessly break compatibility, but getting things right and finding clearer names and less error-prone APIs is a higher priority right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so @cfallin.

)
.operands_in(vec![DSS])
.operands_out(vec![addr]),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed this at some point, but why do we need new stack manipulation instructions? Wouldn't the old instructions have the attached dynamic type which would be enough to determine where the dynamic slots exist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the operand typing -- these take a DynamicStackSlot rather than a StackSlot, and that distinction in turn is needed for the ABI code. We could have made each StackSlot be polymorphic over statically-sized or dynamically-sized type, though that adds a bit of complication. Given that we do the layout differently for each, and lower each slightly differently, I think it's reasonable to keep them separated, but i don't feel too strongly about that.

@sparker-arm
Copy link
Contributor Author

@cfallin I've added widen and narrow support and I've maintained the existing constraint system by sinking DynamicTypes from Function into DataFlowGraph. This requires the user to declare any necessary dynamic types but they don't need to be explicit about their output types.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- we're definitely getting closer!

At a top level, in addition to all the comments below, I noticed that Type became u16 but I didn't see anything like the tricks in #4269 -- given my measurements I don't think we can just make it a u16 and be done with it. Could you incorporate either that bitpacking, or a sentinel-value-and-sparse-hashmap approach?

@@ -76,6 +76,13 @@ pub enum GlobalValueData {
/// Does this symbol refer to a thread local storage value?
tls: bool,
},

/// Value is a multiple of how many instances of `vector_type` will fit in
/// a target vector register.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think DynScaleTargetWidth or something might be best here. As to why different global values for the different cases -- if I'm understanding the question right, the answer is that the details are really ABI-visible (e.g., is there a hidden register that controls how much data we read at once that must be set up? is it part of a vmcontext in a runtime?) so they should be surfaced in the IR. And having it clearly here also lets us define CLIF semantics for it, which is useful for interpreters and for formal verification.

Anyway with the above rename I think this is settled for now!

cranelift/codegen/src/ir/types.rs Outdated Show resolved Hide resolved
@@ -558,6 +565,7 @@ impl ABIMachineSpec for AArch64MachineDeps {
}

fn gen_get_stack_addr(mem: StackAMode, into_reg: Writable<Reg>, _ty: Type) -> Inst {
// FIXME: Do something different for dynamic types?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file a followup issue to track what to do about dynamic vector types as args?

(Am I assuming right that usually they won't be args, but rather the args themselves will be pointers to the data we use with dynamically-sized vectors?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, yes. It's either in a scalable register (data or predicate) or we pass a pointer.

cranelift/codegen/src/isa/aarch64/inst/args.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/lower_inst.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/legalizer/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/machinst/abi_impl.rs Outdated Show resolved Hide resolved
let ty = f
.get_concrete_dynamic_ty(data.dyn_ty)
.unwrap_or_else(|| panic!("invalid dynamic vector type: {}", data.dyn_ty));
dynamic_stack_offset += isa.vector_scale(ty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing it but where does the global value (vector scale) actually come in here to determine the size of the dynamically-sized type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem, as they are currently disjoint. I've currently renamed this to dynamic_vector_bytes to differentiate it from the scaling factor. The only way to tie them together, that I can think of, is to pass the value from TargetIsa when legalizing the GlobalValue, which I'll try out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is fine to start, but let's add a TODO and a tracking issue for truly dynamic stack frame layout based on some runtime value. At least, if I understand correctly, one option all along has been for runtime-determined size (from a vmcontext configuration field or some other input), which is what motivated us to take the GV-based approach; so we should build that eventually. I'm happy to see it done incrementally though in this case.

cranelift/codegen/src/machinst/abi_impl.rs Outdated Show resolved Hide resolved
@sparker-arm
Copy link
Contributor Author

I noticed that Type became u16 but I didn't see anything like the tricks in #4269

Would you mind committing those changes so I can rebase on top? I'm happy to add my optimization on top of it.

@sparker-arm
Copy link
Contributor Author

Squashed the commits and rebased with Type as u16. Also corrected the global legalize code and updated the 'test' that should have caught the problem.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sparker-arm I've been over most of this again recently and I think it's looking good. I think the extended versions of some of the features (truly dynamic stackframe layout based on runtime size configuration, at least) can come in as subsequent PRs, now that the basic framework is here, as on balance what's here is self-consistent and I don't want to keep the PR hanging for too long and have you pay the rebase cost.

So, with some tracking issues filed as per the outstanding codereview comments (and with the merge conflict in the interpreter resolved), I think we're good to merge this!

Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.

Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.

The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.

Copyright (c) 2022, Arm Limited.
@cfallin
Copy link
Member

cfallin commented Jul 7, 2022

I spoke to @sparker-arm and he is planning to file followup tracking issues tomorrow; for now, this is (finally) good to merge. Thanks for the patience throughout the process!

@cfallin cfallin merged commit 9c43749 into bytecodealliance:main Jul 7, 2022
@sparker-arm
Copy link
Contributor Author

Thanks @cfallin for all your help! I've raised an issue for the next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:clif cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants