Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize field access in types not marked with beforefieldinit #1327

Closed
mikernet opened this issue Jan 6, 2020 · 24 comments
Closed

Optimize field access in types not marked with beforefieldinit #1327

mikernet opened this issue Jan 6, 2020 · 24 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@mikernet
Copy link
Contributor

mikernet commented Jan 6, 2020

Is it possible for tiered compilation to partially alleviate my issue regarding beforefieldinit by addressing the performance aspect of that issue discussed here: dotnet/csharplang#3080?

I would still like the option of explicit control over class initialization timing as discussed there but it would be much less of an issue in most cases if tiered compilation could at least solve the performance aspects. For reference, here is a copy-paste of the benchmark code in that thread that shows the big performance discrepancy between static field access in classes with and without beforefieldinit:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

| Method |      Mean |     Error |    StdDev |
|------- |----------:|----------:|----------:|
| Static | 18.870 us | 0.1066 us | 0.0997 us |
|  Field |  2.713 us | 0.0099 us | 0.0092 us |

Code:

const int iterations = 10000;

public static class StaticCctor
{
    public static bool Value { get; set; }

    static StaticCctor()
    {
        Value = true;
    }
}

public static class FieldInit
{
    public static bool Value { get; set; } = true;
}

public class BeforeFieldInit
{
    [Benchmark]
    public void Static()
    {
        bool value;

        for (int i = 0; i < iterations; i++)
            value = StaticCctor.Value;

    }

    [Benchmark]
    public void Field()
    {
        bool value;

        for (int i = 0; i < iterations; i++)
            value = FieldInit.Value;
    }
}

category:cq
theme:optimization
skill-level:intermediate
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 6, 2020
@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 6, 2020
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jan 6, 2020
@AndyAyersMS AndyAyersMS self-assigned this Jan 6, 2020
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 6, 2020

Thanks; yes this is something we can fix a consequence of us not re-jitting methods with loops, by default (see note below).

The difference right now is that the Field case hoists the static base accessor out of the loop, and the Static case does not:

;;; Static

G_M33157_IG03:
       48B9B0532B42F97F0000 mov      rcx, 0x7FF9422B53B0
       BA01000000           mov      edx, 1
       E8B5E9885F           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       FFC6                 inc      esi
       81FE10270000         cmp      esi, 0x2710
       7CE2                 jl       SHORT G_M33157_IG03

;; Field

G_M65279_IG02:
       33F6                 xor      esi, esi
       48B9B0532B42F97F0000 mov      rcx, 0x7FF9422B53B0
       BA02000000           mov      edx, 2
       E875E9885F           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE

G_M65279_IG03:
       FFC6                 inc      esi
       81FE10270000         cmp      esi, 0x2710
       7CF6                 jl       SHORT G_M65279_IG03

The code in optComputeLoopSideEffectsOfBlock is too conservative, it checks if the helper call could cause class initialization, but doesn't also check if the class happens to already have been initialized.

@jkotas
Copy link
Member

jkotas commented Jan 6, 2020

@AndyAyersMS If the class have been initialized already, the importer won't create CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE at all.

To fix this specific case, we would need to run the static initialization during the first iteration of the loop only.

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

If this issue is fixed, are there still going to be significant performance penalties for static field access on non-beforefieldinit types in some situations or has that largely been mitigated in newer .NET Core releases? I haven't seen any recent discussions on the performance implications of having a static constructor. The latest thing I could find is this PR dotnet/corefx#6016 where it seems like people are still trying to avoid static constructors where possible for performance reasons.

@jkotas
Copy link
Member

jkotas commented Jan 6, 2020

Depends what you mean by significant. Non-beforefieldinit type are more expensive than beforefieldinit types by design. Note that the costs are not just in throughput, but also in code size.

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

That's what I thought, but @YairHalberstadt seemed to think that tiered compilation could/should remove the overhead of accessing the field.

@jkotas
Copy link
Member

jkotas commented Jan 6, 2020

tiered compilation could/should remove the overhead of accessing the field.

Yes, it does. It takes some time for the tiered compilation to kick in and you pay the costs until that happens. I do not think it kicked in for your test at the top of this issue.

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

Hmm, I see. What's the order of magnitude of "some time" / what conditions affect how long it takes?

I'm trying to gauge if it is still worth complicating class initialization code, removing readonly field modifiers, making it more cumbersome to work with NRT fields, etc in order to avoid having a static constructor in performance-sensitive code like a DB engine. If it takes on the order of seconds for that to kick in and optimize the code then I guess it usually isn't that big of a deal.

I mean, ideally, we just get the attribute I proposed in the related issue so the tradeoff becomes unnecessary, but I wouldn't mind refactoring some nasty code if it ultimately doesn't make a difference after a few second warmup.

@AndyAyersMS
Copy link
Member

If the class have been initialized already, the importer won't create CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE at all.

Right, just came to the same realization. Assembly code above is for the case where the .cctors have not yet been run. We only jit these methods once: they have a loop, and by default COMPlus_TC_QuickJitForLoops=0, so they bypass Tier0.

It takes some time for the tiered compilation to kick in

Not in this case, though. We jit at Tier1 initially because of the loop, and won't ever rejit. If you set COMPlus_TC_QuickJitForLoops=1 and run, then tiering can remove the class init overhead:

Method Mean Error StdDev
Static 2.631 us 0.0427 us 0.0378 us
Field 2.574 us 0.0508 us 0.0696 us

@YairHalberstadt
Copy link
Contributor

We jit at Tier1 initially because of the loop, and won't ever rejit.

Presumably in real life code if this is a bottleneck we will rejit, since the loop is unlikely to be top level?

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

Is there any way to force tiered compilation to run before a test starts so that I can benchmark the post-tiered compilation performance impact of adding static initializers to some of my code?

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 6, 2020

Presumably in real life code if this is a bottleneck we will rejit, since the loop is unlikely to be top level?

Not currently, no. Once we jit at Tier1 we never rejit. So with default settings, methods with loops are jitted just once.

The intent of QJFL=0 is to avoid getting trapped in Tier0 code; the downside is you may end up with sub-optimal Tier1 code.

The general fix for this is to implement something like on-stack replacement, so we can jit at Tier0 initially and transition (mid-method, if necessary) to Tier1 code if the method is called frequently, or loops around frequently.

Is there any way to force tiered compilation to run before a test starts

BenchmarkDotNet should generally get you to Tier1 code. But as noted above, not all Tier1 code is equally performant; there's a benefit in some cases to running the Tier0 code first. Setting COMPlus_TC_QuickJitForLoops=1 will ensure this.

@YairHalberstadt
Copy link
Contributor

I see. Thanks for the info @AndyAyersMS

Do you think this is a problem that will be resolved in the future, or would it be worth considering adding a language feature for this case?

@AndyAyersMS
Copy link
Member

Do you think this is a problem that will be resolved in the future

Possibly. I have been looking into on-stack replacement, but mostly thinking of it as something that could help improve startup, not as something that could help improve steady-state.

I would be interested in learning about more realistic "real-world" cases where performance improves when running with QJFL=1. Measurements on simple benchmarks can highlight problem areas, but often overstate their impact.

@jkotas
Copy link
Member

jkotas commented Jan 6, 2020

a language feature for this case?

I do not think that a language feature for this makes sense.

Another way to fix this issue is to teach the JIT to inline the "already initialized" check into the code. It would eliminate the call overhead on the fast path. The cost of the well-predicted inlined check would be negligible.

FWIW, .NET Native did the inlining of the "already initialized" check optimization.

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

What case/language feature are we talking about here? I think a [BeforeFieldInit(bool)] attribute to control the beforefieldinit flag is quite sensible. Is that what you are referring to @jkotas / @YairHalberstadt or something else?

I bet 95%+ of classes that have static constructors don't actually want/need type initialization on first field access, it's just there for convenience (but that's a bit of a wild guess, kind of hard to measure). Even if this gets optimized on a JIT level, letting people just set beforefieldinit will make less work for the JIT and reduce code size.

@jkotas
Copy link
Member

jkotas commented Jan 6, 2020

I do not think that [BeforeFieldInit(bool)] is worth the complexity that it would adds to the language spec. https://blogs.msdn.microsoft.com/ericgu/2004/01/12/minus-100-points/ will give you an idea for how the language team evaluates whether the language feature is worth it.

@mikernet
Copy link
Contributor Author

mikernet commented Jan 6, 2020

I understand that, but I respectfully disagree. I see library authors constantly dealing with this and it doesn't really add any complexity to anyone that doesn't care about optimizing performance at that level...it's just an optional attribute that does something really simple, i.e. control whether the compiler emits beforefieldinit on a method or not.

@AndyAyersMS
Copy link
Member

To recap the above: in general, runtime overhead of class init checks should largely be mitigated by tiered compilation.

We can still see impact of class init checks in methods that are not yet tiered up; impact of this should be limited to startup phases as the code will eventually get replaced by Tier1 code:

  • prejitted methods (in non-AOT modes)
  • tier0 jittted methods

We can also see impact of class init checks in methods that aren't eligible for tiering; in such cases impact of class init checks can persist indefinitely:

  • all methods, in AOT modes
  • jitted methods with loops which bypass tier0 (unless QJFL=1)
  • methods marked with AggressiveOptimization which also bypass tier0
  • dynamic methods and other special cases where we don't allow tiering

Seems like we should look more deeply into implementing support for partially inlined class init checks in the jit, so am marking this as 5.0 for now.

@AndyAyersMS AndyAyersMS added this to the 5.0 milestone May 6, 2020
@AndyAyersMS
Copy link
Member

Seems like we should look more deeply into implementing support for partially inlined class init checks in the jit

I current don't see us getting to this for 5.0, so moving to future.

@AndyAyersMS AndyAyersMS modified the milestones: 5.0, Future Jun 3, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@AndyAyersMS
Copy link
Member

We should consider this for 6.0, so updating milestone.

@AndyAyersMS AndyAyersMS modified the milestones: Future, 6.0.0 Jan 25, 2021
@JulieLeeMSFT JulieLeeMSFT assigned EgorBo and unassigned AndyAyersMS Jan 26, 2021
@EgorBo
Copy link
Member

EgorBo commented Feb 3, 2021

@jkotas @AndyAyersMS a few questions:

  1. Can we just hoist call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE from the loop body in Static just like it was done for Field? https://www.diffchecker.com/mTHFbDfu (Static is on the left)

  2. Regarding the inlined "is init" check - if I understand it correctly it should be something like this?

if (moduleID->m_pDataBlob[classId] & ClassInitFlags::INITIALIZED_FLAG == 0)
    CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE(moduleId, classId);

@EgorBo
Copy link
Member

EgorBo commented Feb 4, 2021

Quick prototype for "inlined check": EgorBo@78436b0 (jit only, doesn't do hoisting opt)

       | Method |     Mean |     Error |
       |------- |---------:|----------:|
master | Static | 8.808 us | 0.1702 us |
    PR | Static | 4.285 us | 0.0163 us |

diff for Static method: https://www.diffchecker.com/9swAJvSl

@jkotas
Copy link
Member

jkotas commented Feb 4, 2021

Can we just hoist call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE from the loop body in Static just like it was done for Field?

In general case, it would be observable behavior change in the timing of static constructors. You can do this in some limited cases like when you can prove the loop will run at least once and the static is accessed as the first thing in the loop, etc.

inlined check

The offsets and masks should be abstracted via JIT/EE interface. It will also allow enabling this optimization for non-shared generics.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 5, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 10, 2021
@EgorBo EgorBo modified the milestones: 6.0.0, Future May 24, 2021
@EgorBo EgorBo modified the milestones: Future, 8.0.0 Apr 27, 2023
@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2023

This issue is basically fixed by QuickJit enabled by default for Loops in .NET 7.0
For NativeAOT and TieredCompilation=0 it's fixed in .NET 8.0 via #83911

TC=0 .NET 7.0:

| Method |     Mean |     Error |    StdDev |
|------- |---------:|----------:|----------:|
| Static | 9.212 us | 3.1957 us | 0.1752 us |
|  Field | 1.842 us | 0.0979 us | 0.0054 us |

TC=0 .NET 8.0:


| Method |     Mean |     Error |    StdDev |
|------- |---------:|----------:|----------:|
| Static | 1.837 us | 0.0926 us | 0.0051 us |
|  Field | 1.829 us | 0.4411 us | 0.0242 us |

@EgorBo EgorBo closed this as completed Apr 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants