-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to reduce cost of Async #101605
Try to reduce cost of Async #101605
Conversation
Contributes to dotnet#79204. I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable. Shuffling a couple things in `GetStateMachineBox` I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.
That's because RyuJIT inlines the box helper for large structs. I think this should be fixed in the JIT. Outlining of the event tracing helper looks good. |
...raries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs
Outdated
Show resolved
Hide resolved
@@ -180,64 +181,67 @@ public struct AsyncTaskMethodBuilder<TResult> | |||
// result in a boxing allocation when storing the TStateMachine if it's a struct, but | |||
// this only happens in active debugging scenarios where such performance impact doesn't | |||
// matter. | |||
if (taskField is AsyncStateMachineBox<IAsyncStateMachine> weaklyTypedBox) | |||
else if (taskField is AsyncStateMachineBox<IAsyncStateMachine> weaklyTypedBox) | |||
{ | |||
// If this is the first await, we won't yet have a state machine, so store it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're trying to squeeze water from a stone, this whole block should be extremely cold (only used in debugging scenarios), so if you can come up with a good way to outline it all to a separate helper, that'd be reasonable. Not sure if that'd help much, though, given that stateMachine is generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would likely make things worse because it's still the same code, but with extra unwinding/GC tracking data structures because we have a new method.
Do you have a heuristic in mind? It seems like what RyuJIT is doing now is strictly better from throughput perspective (we call the right allocation helper, assign all the fields, only do write barriers for GC pointers). Falling back to the regular boxing helper would be slower. Or are you thinking about a |
I presume Jan means that for a struct with many gc fields (JIT can check that) it's better for VM/Box helper to do a batch move ( |
I expect that the bulk copy in the box helper starts to win in throughput above certain size. The JIT should probably give up inlining the copy way below this threshold for good code size / throughput tradeoff. Also, the current implementation of the box is slow since the box helper is almost never used today. Jeremy hit that problem when switching a few places in reflection to use the box helper (WIP #101137). |
Quick test: [Benchmark]
public object Box() => ms;
MyStruct ms;
struct MyStruct
{
string a1;
string a2;
string a3;
string a4;
string a5;
string a6;
} Benchmark for current impl vs using
|
I wonder if this also means the JIT should use the batch move when assigning structs with many GC fields. Looks like there is a The method I'm changing here does two struct assignments with these large structs - one as part of the box, and the other as part of the non-boxing path. Looks like it might be beneficial to treat both cases the same in the JIT and if one is using Line 189 in 81c1d87
Line 225 in 81c1d87
|
We discussed this in #99096 tldr: there is always a trade-off E.g. the code we currently generate is interruption-friendly and is more precise from GC's point of view (we update card table for each gc handle in a struct) |
For box helper, this precision loss in card table updates should not be a concern. The object was just allocated and so it is very unlikely that will be any card table updates necessary. (There was even a discussion at one point about what it would take to skip the write barrier calls completely in cases like this.) |
I made a draft PR to use the regular boxing helper above a certain number of GC fields at #101669. I hope we'll be able to tune the number with the bots/automation that exists in this repo. A struct with 6 fields, all of them GC fields should obviously use the box helper path but I'm less clear on the criteria for others (does mixing GC and non-GC fields make any difference, etc.?). |
// generating this extra code until a better solution is implemented. | ||
var box = new AsyncStateMachineBox<TStateMachine>(); | ||
// DebugFinalizableAsyncStateMachineBox looks like a small type, but it actually is not because | ||
// it will have a copy of all the slots from its parent. It will add another hundred(s) bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be out of scope for this PR, but how about using an unspecialized DebugFinalizableAsyncStateMachineBox<IAsyncStateMachine>
for NativeAOT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know much about this - if it's only used by managed debuggers, it would not be useful for native AOT anyway. It's ifdeffed out so I'm not concerned about it.
Contributes to dotnet#79204. I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable. Shuffling a couple things in `GetStateMachineBox` I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.
Contributes to dotnet#79204. I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable. Shuffling a couple things in `GetStateMachineBox` I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.
Contributes to #79204.
I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable.
Shuffling a couple things in
GetStateMachineBox
I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.Click to expand existing assembly
Addresses following problems:
RhpByRefAssignRefAVLocation1
- that's the write barrier. I'm making the box take a less efficient path based on comments.This is just a couple dozen bytes in savings per method, but ends up saving 50 kB for the whole app because that's how many specializations of the method we have.
Cc @stephentoub @jkotas for thoughts.