-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This fixes emitGetVexPrefixSize to support detecting 2-byte prefix support #79478
Conversation
…tSimdPrefixIfNeeded to emitOutputRexOrSimdPrefixIfNeeded
…xAware to emitGetAdjustedSize
… can be computed correctly
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsAs raised on #79363 and in various past PR/issues, we did not correctly estimate the size of the VEX prefix when used. This had negative side effects such as allocating more memory than necessary and when loop alignment support was added, it meant that we could no longer use the 2-byte prefix when also aligning loops. This resolves that by updating As a side effect, this also removes some code that was dead and does some other minor cleanup to improve the general handling of the VEX prefix.
|
CC. @kunalspathak, @BruceForstall This passed the full We'll want to also run the JitStress tests once CI shows there aren't any missed edge cases. |
src/coreclr/jit/emitxarch.cpp
Outdated
// | ||
unsigned emitter::emitOutputSimdPrefixIfNeeded(instruction ins, BYTE* dst, code_t& code) | ||
emitter::code_t emitter::emitExtractEvexPrefix(instruction ins, code_t& code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially extracted these methods as I thought the easiest thing was going to be to just pass the code_t
through to emitGetVexPrefixSize
and then do something like:
code_t vexPrefix = emitExtractVexPrefix(ins, code);
assert(vexPrefix != 0);
if ((vexPrefix & 0xFFFF7F80) == 0x00C46100)
{
return 2;
}
return 3;
However, there is quite a bit more logic that goes into correctly building up code
and since we don't cache it anywhere, it was going to negatively impact throughput.
I ended up leaving the helper method here as it may still be useful in the future and it isolates a large chunk of complex logic that was just splatted inline before.
src/coreclr/jit/emitxarch.cpp
Outdated
unsigned emitter::emitGetEvexPrefixSize(instrDesc* id) | ||
{ | ||
instruction ins = id->idIns(); | ||
assert(IsEvexEncodedInstruction(ins)); | ||
return 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only called this from emitGetAdjustedSize
and only under an existing IsEvexEncodedInstruction
check, so I simplified it to just assert and return the constant.
// code -- The current opcode and any known prefixes | ||
// | ||
// Returns: | ||
// Updated size. | ||
// | ||
unsigned emitter::emitGetAdjustedSizeEvexAware(instruction ins, emitAttr attr, code_t code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When emitGetAdjustedSizeEvexAware
was added emitGetAdjustedSize
became fully dead code.
It was duplicating quite a bit of complex logic and if we really need it again, we can grab it from the git history, so I removed the dead code and renamed the EvexAware
method to the same as the original name.
src/coreclr/jit/emitxarch.cpp
Outdated
// Returns: | ||
// Prefix size in bytes. | ||
// | ||
unsigned emitter::emitGetVexPrefixSize(instrDesc* id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function had to be moved "down" so it could access hasCodeMR
.
Like I mentioned above, I was originally going to go a different route but settled on the simpler approach here where we switch on the insFmt
and do the couple minor checks rather than trying to build up the full code_t
.
If we were caching the code_t
somewhere so we didn't need to rebuild it 2-3 times, then another approach would be better. That is also a much more involved/complex change, but one that may be worthwhile long term.
|
||
if (EncodedBySSE38orSSE3A(ins)) | ||
{ | ||
// When the prefix is 0x0F38 or 0x0F3A, we must use the 3-byte encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filters out a majority of the complex instructions, particularly those that take 3 inputs or do other "special things" with register representation.
if ((regForSibBits != REG_NA) && IsExtendedReg(regForSibBits)) | ||
{ | ||
// When the REX.X bit is present, we must use the 3-byte encoding | ||
return 3; | ||
} | ||
|
||
if ((regFor012Bits != REG_NA) && IsExtendedReg(regFor012Bits)) | ||
{ | ||
// When the REX.B bit is present, we must use the 3-byte encoding | ||
return 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insEncodeReg345
uses the REX.R
bit and is always available
insEncodeReg3456
uses the vvvv
field and is always available.
insEncodeReg012
uses the REX.B
bit for extended registers and is only available in the 3-byte encoded
insEncodeRegSIB
uses the REX.X
bit for extended registers and is only available in the 3-byte encoded
Since SIB
is only used for address encodings, we typically don't need to worry about it. Likewise, we normally only have to worry about the 012
case for scenarios where an operand can come from register or memory.
For VEX encoded binary instructions, like vaddps
, this is normally the second operand:
ins tgt, op1, op2/mem
scenario.
However, there are also some unary instructions, like vmovd
, where this can be the destination or first operand:
ins tgt/mem, op1
ins tgt, op1/mem
break; | ||
} | ||
|
||
case IF_RRW_RRW_CNS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to call out this format in particular.
It seems like we have some cases where the IF_*
defined doesn't quite "make sense". This one, for example, should probably be IF_RWR_RRD_CNS
.
It is currently used by emitIns_R_R_I
and applies to instructions like:
pextrb
pextrd
pextrq
pextrw_sse41
extractps
vextractf128
vextracti128
shld
shrd
psrldq
pslldq
There are other formats I saw as well that don't exactly match the semantics of the instruction that's using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely identify and fix cases where the IF_ formats are incorrectly used; they are complicated enough as-is, without some being wrong.
Diffs are hugely positive, with similar diffs on Linux x64
There is a throughput regression, which is to be expected since we now have to do more checks/computation:
|
Could probably always estimate the 3-byte encoding in min-opts to save time and reduce the min-opts impact. |
src/coreclr/jit/emitxarch.cpp
Outdated
// prefix if optimizations are enabled or we know we won't negatively impact the | ||
// estimated alignment sizes. | ||
|
||
if (emitComp->opts.OptimizationEnabled() || (emitCurIG->igNum > emitLastAlignedIgNum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with @kunalspathak and this is currently needed as we still try to do alignment when optimizations are disabled.
We may want to revisit that since OSR + TC should handle all the important cases and aligning debug code likely isn't worth the cycles required.
-- Don't try to estimate the 2-byte VEX prefix when optimizations are disabled This commit actually regressed throughput even more, which was unexpected. For example, MinOpts throughput changed from the above to
I've pushed a new commit that tries just |
Turns out the I manually reordered it and the codegen is a lot better. That being said, |
Better but still regressed throughput for minopts, more so than for full-opts which doesn't really make sense since the check should mean we're doing "less work". I'd guess its negatively interacting with something else, like the alignment support, and so the 4k savings we get makes up for the difference in time |
/azp run runtime-coreclr jitstress, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
Fuzzlyn failures are happening in
|
CC. @dotnet/jit-contrib this is ready for review. Good size savings in known hot code at the cost of a small TP regression. Resolving the TP regression would require some non-trivial work/refactorings in the emitter. |
Have you looked at a detailed throughput trace (e.g. using @SingleAccretion's tool)? +0.2% to +0.4% in MinOpts is quite a bit (e.g. it is more than we spent in FullOpts on tail merging recently, which was a rather large optimization). |
What is the tool and where is the documentation for running it, etc? |
The general tool is the pintool, building documented here: https://github.com/SingleAccretion/Dotnet-Runtime.Dev#dotnet-runtimedev. The particular part which Jakob refers to is this script: https://github.com/SingleAccretion/Dotnet-Runtime.Dev#analyze-pin-trace-diffps1---diff-the-traces-produced-by-the-pin-tool, which compares two traces captured using PIN and prints statistics on which methods are most responsible for regressions / improvements. |
Numbers for 4d0c099 show the following (noting some methods were renamed and one method is new, so I tried to break it apart slightly):
A slight refactoring (57d5725) changes it instead to be:
It doesn't look to be profitable to skip the exact size estimation, even if the Like I mentioned in Discord, the three biggest regressions are:
All three of these are really unrelated to this change and are pre-existing issues. They are showing regressions because they aren't doing the same thing as all the other paths and so the VEX only changes are showing up where you wouldn't expect them to. We should ideally work on cleaning these up so that the only real impact is in the new code paths. |
[unrelated to PR]
@SingleAccretion Looks like an awesome set of scripts for working with .NET and JIT. I'm sure everyone on the CodeGen team has their own similar set. It's too bad we don't share these kind of scripts more broadly, e.g., in jitutils where they could end up in jitutils/bin that will (likely) be on our PATH. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for doing this.
As raised on #79363 and in various past PR/issues, we did not correctly estimate the size of the VEX prefix when used. This had negative side effects such as allocating more memory than necessary and when loop alignment support was added, it meant that we could no longer use the 2-byte prefix when also aligning loops.
This resolves that by updating
emitGetVexPrefixSize
to check the relevantinstrDesc
inputs to determine if the 2-byte or 3-byte prefix will be used.As a side effect, this also removes some code that was dead and does some other minor cleanup to improve the general handling of the VEX prefix.