Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a proposal which suggests updating the xarch baseline target #272

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tannergooding
Copy link
Member

As per the doc, I propose the minimum required hardware for x86/x64 on .NET should be changed from x86-x64-v1 to x86-x64-v2.

@tannergooding
Copy link
Member Author

@masonwheeler
Copy link

Looks reasonable in general, though I'd like to touch on one specific point:

However, with the introduction of Native AOT we now have a higher consideration for scenarios where recompilation is not possible and therefore whatever the precompiled code targets is what all it has access to. There are some ways around this such as dynamic ISA checks or compiling a given method twice with dynamic dispatch selecting the appropriate implementation, but this comes with various downsides and often requires restructuring code in a way that can make it less maintainable.

Why not go for the best-of-both-worlds approach? Build and ship IL, and have AOT compilation occur on the destination machine at installation time, rather than JITting at runtime or AOTing at build time? This is essentially what Android does and it works quite well there.

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

This is essentially what Android does and it works quite well there.

This works when the runtime is part of the OS (or part of large app with complex installer) and the OS can manage the app lifecycle.

It does not work well for runtimes that ship independently on OS, like what .NET runtime is today.

@tannergooding
Copy link
Member Author

Build and ship IL, and have AOT compilation occur on the destination machine at installation time, rather than JITting at runtime or AOTing at build time? This is essentially what Android does and it works quite well there.

One consideration is that Android owns the OS and so is able to guarantee the tools required to do that are available. They also don't support any concepts like "xcopy" of apps and centralize acquisition via the App Store.

I think doing the same for .NET would be pretty awesome, but it also comes with considerations like a larger deployment mechanism and potential other negative side effects.

Crossgen2 for a higher baseline is pretty much like this already, but without many of the drawbacks.

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

A few angles to consider:

  • Microsoft official build vs. source build: We can bump the baseline for Microsoft official build, keep the code for x86 v1 around for the time being and tell anybody who really needs it to build their own bits from sources. (In other words, drop the x86 v1 support level to community supported.)

  • 32-bit vs. 64-bit: We can consider keeping the baseline for 32-bit and raising it only on 64-bit.


## Alternatives

We could maintain the `x86-x64-v1` baseline for the JIT (optionally removing pre `v2` SIMD acceleration) while changing the default for AOT. This could emit a diagnostic by default elaborting to users that it won't support older hardware and indicate how they could explicitly retarget to `x86-64-v1` that is important for their domain.
Copy link
Member

@jkotas jkotas Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Emitting diagnostic like this by default is just output clutter. If we were to do this, it would be just mentioned in the docs.

@masonwheeler
Copy link

Sure, NativeAOT isn't built into the OS. How much work would it be to integrate it with a standard installer-generator system like MSI, though? It would never be The Standard, but it would at least be available for developers in the know.

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

Sure, NativeAOT isn't built into the OS. How much work would it be to integrate it with a standard installer-generator system like MSI, though?

It depends on what your requirements are. You can certainly do some variant of it on your own.

I do not expect we (Microsoft .NET team) will provide or recommend a solution like this. It would not pass our security signoff.

@masonwheeler
Copy link

masonwheeler commented Sep 20, 2022

Huh. That's not the objection I'd have expected to see. What are the security concerns here?

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

For example, the binaries cannot be signed.

@masonwheeler
Copy link

masonwheeler commented Sep 20, 2022

Wasn't signing eliminated from Core a few versions ago anyway? I remember that one pretty clearly because there were breaking changes in .NET 6 that broke my compiler, and when I complained about it the team refused to make even the most inconsequential of changes to alleviate the compatibility break.

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

I am not talking about strong name signing. I am talking about Microsoft authenticode, Apple app code signing, and similar type of signatures.

@masonwheeler
Copy link

All right. So how does Android handle it?

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

I do not know the details on how Android handles this. I can tell you what it involved to make this scheme work with .NET Framework: NGen service process was recognized as a special process by the Windows OS that was allowed to vouch for authenticity of its output. It involved hardening like disallowing debugger attach to the NGen service process (again, another special service provided by the Windows OS) so that you cannot tamper with its execution.

@masonwheeler
Copy link

Yeah, that makes sense. The AOT compiler has to be in a position of high trust for a scheme like that to work. Joe Duffy said something very similar about the Midori architecture.

proposed/xarch-baseline.md Outdated Show resolved Hide resolved
@gfoidl
Copy link
Member

gfoidl commented Sep 20, 2022

Is related to #173?
Or is this one here more about AOT?

@MichalStrehovsky
Copy link
Member

Would this also cover how we compile the native parts of the (non-AOT) runtimes (GC, CoreCLR VM, etc.)?

My main concern would be about the user experience for the minority of users that don't meet this requirement - I'd like to avoid the user experience to be STATUS_ILLEGAL_INSTRUCTION with a crashdump. NativeAOT does a failfast with a message to stderr. It's not great. (It's obviously not visible for OutputType=WinExe, for example.)

@MichalStrehovsky
Copy link
Member

Do we have any motivating scenarios that we expect to meaningfully improve? I tried TechEmpower Json benchmark, but I'm seeing some very confusing results (crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario json --profile aspnet-citrine-lin --application.framework net7.0 --application.environmentVariables COMPlus_EnableSSE3=0 shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).

@EgorBo
Copy link
Member

EgorBo commented Sep 21, 2022

shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).

That's weird because SSE3 specifically doesn't bring any value (except, maybe, HADD for floats but it's unlikely to be touched in TE benchmarks). For shuffle it's SSSE3 that is interesting because it provides overloads we need.

@MichalStrehovsky
Copy link
Member

shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).

That's weird because SSE3 specifically doesn't bring any value (except, maybe, HADD for floats but it's unlikely to be touched in TE benchmarks). For shuffle it's SSSE3 that is interesting because it provides overloads we need.

Wait does COMPlus_EnableSSE3=0 only disable SSE3? I thought it works similar to how we do detection in codeman.cpp - not detecting SSE3 means we also consider SSSE3/4/4.2/AVX etc. unavailable. Or do I need to COMPlus_EnableXXX everything one by one to get the measurement I wanted to measure?

@tannergooding
Copy link
Member Author

Is related to #173?
Or is this one here more about AOT?

Its similar, but its about upgrading the baseline for AOT and therefore directly impacts all consumers of .NET

173 impacts the default for crossgen, which only provides worse startup performance on older hardware.

@tannergooding
Copy link
Member Author

Would this also cover how we compile the native parts of the (non-AOT) runtimes (GC, CoreCLR VM, etc.)?

That would likely be up for debate. MSVC only provides /arch:SSE2 and /arch:AVX/AVX2, there is no equivalent for SSE3-SSE4.2. Clang/GCC do support these intermediaries, but I don't expect them to be as big of wins for native given the typical dynamic linking and limited explicit vectorization.

My main concern would be about the user experience for the minority of users that don't meet this requirement - I'd like to avoid the user experience to be STATUS_ILLEGAL_INSTRUCTION with a crashdump. NativeAOT does a failfast with a message to stderr. It's not great. (It's obviously not visible for OutputType=WinExe, for example.)

We have a similar message raised by the VM today as well, its just unlikely to ever be encountered since its just checking for SSE2.

@tannergooding
Copy link
Member Author

Do we have any motivating scenarios that we expect to meaningfully improve? I tried TechEmpower Json benchmark, but I'm seeing some very confusing results (crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario json --profile aspnet-citrine-lin --application.framework net7.0 --application.environmentVariables COMPlus_EnableSSE3=0 shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).

Most codepaths that use Vector128<T>. Like @EgorBo called out there is pretty "core" functionality only available in these later ISAs.

Notably:

  • SSE3 - Floating-Point Alternating Add/Subtract, Horizontal Add, Horizontal Subtract
  • SSSE3 - Integer Conditional Negate, Absolute Value, Bytewise Shuffle, Horizontal Add, Horizontal Subtract
  • SSE4.1 - Dot Product (FP only), Blend, Round (FP only), Insert, Extract, Test, Integer Min/Max
  • SSE4.2 - 64-bit Compare Greater Than
  • POPCNT

For SSSE3, the most important is Bytewise Shuffle. Doing arbitrary vector reordering is very expensive otherwise and so this is key for doing things like Reverse Endianness or handling edge cases to emulate other functionality.

For SSE4.1, the most important are Blend, Insert, Extract, and Test. Dot Product and Round are both important for many scenarios including WPF and other image manipulation scenarios due to heavy use of floating-point. In the case of Blend it allows simplifying (a & b) | (a & ~b) to a single instruction, effectively giving you "vectorized ternary select". This is a key part of handling leading/trailing elements or operating only on matched data. Insert/Extract are key for getting data into and out of the vector registers efficiently, and Test is key for efficiently determining (a & b) == 0 or (a & b) != 0, which is used for many paths to determine if any match exists and therefore whether more expensive computation has to be done.

Not having these means the codegen for many core algorithms, especially in the string/span handling can be significantly pessimized against newer hardware that a majority of customers are expected to have.

@tannergooding
Copy link
Member Author

Wait does COMPlus_EnableSSE3=0 only disable SSE3? I thought it works similar to how we do detection in codeman.cpp - not detecting SSE3 means we also consider SSSE3/4/4.2/AVX etc. unavailable. Or do I need to COMPlus_EnableXXX everything one by one to get the measurement I wanted to measure?

They are hierarchical and so EnableSSE3=0 will also disable SSSE3, SSE4.1, SSE4.2, AVX, AVX2, etc.

Could you provide more concrete numbers and possibly codegen? This sounds unexpected and doesn't match what I've seen in past benchmarking comparisons.

We could certainly get more concrete numbers by running all of dotnet/performance with COMPlus_EnableSSE3=0 and see if anything exceptional pops out as having improved performance.

@MichalStrehovsky
Copy link
Member

I'd rather this decision wasn't made solely on microbenchmarks. I have no doubts it helps microbenchmarks. They're good supporting evidence, but something that impacts the users is better as the main evidence. That's why I'm trying TechEmpower (it's an E2E number we care about).

Could you provide more concrete numbers and possibly codegen? This sounds unexpected and doesn't match what I've seen in past benchmarking comparisons.

I can give you what I did, but not much more than that. Hopefully it's enough to find what I'm doing wrong:

dotnet tool install -g Microsoft.Crank.Controller --version "0.2.0-*" 

And then just run the above crank command with/without the --application.environmentVariables COMPlus_EnableSSE3=0.

Without the EnableSSE3=0 argument:

| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 79         |
| Cores usage (%)        | 2,219      |
| Working Set (MB)       | 38         |
| Private Memory (MB)    | 363        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 74         |
| Requests/sec           | 1,214,438  |
| Requests               | 18,336,914 |
| Mean latency (ms)      | 0.71       |
| Max latency (ms)       | 54.51      |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 169.09     |
| Latency 50th (ms)      | 0.38       |
| Latency 75th (ms)      | 0.44       |
| Latency 90th (ms)      | 0.56       |
| Latency 99th (ms)      | 10.11      |

With the SSE3=0 argument:

| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 80         |
| Cores usage (%)        | 2,229      |
| Working Set (MB)       | 38         |
| Private Memory (MB)    | 363        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 75         |
| Requests/sec           | 1,229,283  |
| Requests               | 18,560,989 |
| Mean latency (ms)      | 0.61       |
| Max latency (ms)       | 32.51      |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 171.16     |
| Latency 50th (ms)      | 0.38       |
| Latency 75th (ms)      | 0.44       |
| Latency 90th (ms)      | 0.52       |
| Latency 99th (ms)      | 8.05       |

@tannergooding
Copy link
Member Author

tannergooding commented Sep 21, 2022

I'd rather this decision wasn't made solely on microbenchmarks. I have no doubts it helps microbenchmarks. They're good supporting evidence, but something that impacts the users is better as the main evidence.

I agree that it shouldn't be made solely based on microbenchmarks. However, we also know well how frequently span/string APIs are used and the cost of branches in hot loops, even predicted branches. We likewise know the importance of these operations in scenarios like ML, Image Processing, Games, etc. I imagine coming up with real world benchmarks showing improvements won't be difficult.

With that being said, we also really should not restrict ourselves to a 20 year old baseline regardless. Such hardware is all officially out of support and discontinued by the respective hardware manufacturers. Holding out for such a small minority of hardware that likely isn't even running on a supported OS is ultimately pretty silly (and I expect such users aren't likely to be using new versions of .NET anyways). At some point, we need to have the freedom/flexibility to tell users that newer versions of .NET won't support hardware that old (at the very least "officially", Jan's suggestion of leaving the support in but making it community supported is reasonable; as would simply making it not the default).

I can give you what I did, but not much more than that. Hopefully it's enough to find what I'm doing wrong:

Does this work on Windows, or is it Linux only? Do you also need crank-agent like the docs I found suggest?

On Windows, I see

The specified endpoint url 'http://asp-citrine-lin:5001' for 'application' is invalid or not responsive: "No such host is known. (asp-citrine-lin:5001)"

Likewise, how much variance is there here run-to-run (that is against separate attempts to profile using the same command line)?
How much of this is R2R compiled (doing EnableSSE3=0 will throw out the corelib CG2 images/etc)?
Is this accounting for rejit/tiered compilation costs?

@MichalStrehovsky
Copy link
Member

We likewise know the importance of these operations in scenarios like ML, Image Processing, Games, etc. I imagine coming up with real world benchmarks showing improvements won't be difficult.

I'd like us to have such E2E number - we're discussing making .NET-produced executables FailFast on 1 out of 100 machines in the wild by default - "string.IndexOf is a lot faster" is less convincing argument that it's the right choice than "X RPS improvement in web scenario Y", "X fps improvement in game Y", etc. That's the argument we'll give whenever someone complains about this choice (for me the important bit is that this is a requirement for where the code runs, not a requirement for the .NET developers machine - the .NET developers likely won't even know about this hardware floor until they hear from their user).

Does this work on Windows, or is it Linux only? Do you also need crank-agent like the docs I found suggest?

I run it on Windows. You need VPN on because asp-citrine-lin is a corpnet machine. AFAIK crank-agent is needed on the machine where you run the test (which is asp-citrine-lin in this case, so no need to worry about it).

Likewise, how much variance is there here run-to-run (that is against separate attempts to profile using the same command line)?

I made 2 runs each and there was some noise but the difference in the two runs looked conclusive. I think crank does a warmup, but it's really a ASP.NET team tool that I don't have much experience with (only to the extent that we track it and it's part of our release criteria, and therefore looks relevant).

@tannergooding
Copy link
Member Author

tannergooding commented Sep 21, 2022

I'd like us to have such E2E number - we're discussing making .NET-produced executables FailFast on 1 out of 100 machines in the wild by default - "string.IndexOf is a lot faster" is less convincing argument that it's the right choice than "X RPS improvement in web scenario Y", "X fps improvement in game Y", etc. That's the argument we'll give whenever someone complains about this choice (for me the important bit is that this is a requirement for where the code runs, not a requirement for the .NET developers machine - the .NET developers likely won't even know about this hardware floor until they hear from their user).

I expect its much less than this in practice, especially when taking into account Enterprise/cloud hardware, the users that are likely to be running on a supported OS (especially if you consider officially supported hardware, of which only Linux supports hardware this old), and that are likely to be running/using the latest versions of .NET.

It's worth noting I opened dotnet/sdk#28055 so that we can, longer term, get more definitive information on this and other important hardware characteristics.

I run it on Windows. You need VPN on because asp-citrine-lin is a corpnet machine. AFAIK crank-agent is needed on the machine where you run the test (which is asp-citrine-lin in this case, so no need to worry about it).

👍. The below is the median of 5 results for each. I didn't notice any obvious outliers.

I notably ran both JSON and Plaintext to get two different comparisons. There is a clear disambiguator when SIMD is disabled entirely and there is a small but measurable difference between -v1 and -v2 with -v2 winning.

The default (which should be -v3 assuming these machines have AVX2 support) tends to be a bit slower than -v2 and this is likely because the payloads aren't large enough for Vector256<T> to benefit. Instead, the larger size coupled with the additional checks causes a small regression.

Json

.NET 7 - Default

load
CPU Usage (%) 66
Cores usage (%) 1,847
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 115
Requests/sec 985,938
Requests 14,886,941
Mean latency (ms) 0.42
Max latency (ms) 51.21
Bad responses 0
Socket errors 0
Read throughput (MB/s) 142.92
Latency 50th (ms) 0.24
Latency 75th (ms) 0.27
Latency 90th (ms) 0.32
Latency 99th (ms) 7.16

.NET 7 - EnableAVX=0 (effectively target x86-64-v2)

load
CPU Usage (%) 66
Cores usage (%) 1,847
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 113
Requests/sec 990,885
Requests 14,962,330
Mean latency (ms) 0.46
Max latency (ms) 39.10
Bad responses 0
Socket errors 0
Read throughput (MB/s) 143.64
Latency 50th (ms) 0.23
Latency 75th (ms) 0.27
Latency 90th (ms) 0.32
Latency 99th (ms) 8.18

.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)

load
CPU Usage (%) 66
Cores usage (%) 1,835
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 116
Requests/sec 980,136
Requests 14,799,763
Mean latency (ms) 0.49
Max latency (ms) 41.72
Bad responses 0
Socket errors 0
Read throughput (MB/s) 142.08
Latency 50th (ms) 0.24
Latency 75th (ms) 0.27
Latency 90th (ms) 0.32
Latency 99th (ms) 8.72

.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)

load
CPU Usage (%) 64
Cores usage (%) 1,783
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 211
Requests/sec 944,005
Requests 14,253,837
Mean latency (ms) 0.50
Max latency (ms) 48.26
Bad responses 0
Socket errors 0
Read throughput (MB/s) 136.84
Latency 50th (ms) 0.25
Latency 75th (ms) 0.29
Latency 90th (ms) 0.34
Latency 99th (ms) 9.27

Plaintext

.NET 7 - Default

load
CPU Usage (%) 44
Cores usage (%) 1,221
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 90
Requests/sec 4,625,118
Requests 69,838,073
Mean latency (ms) 0.60
Max latency (ms) 29.17
Bad responses 0
Socket errors 0
Read throughput (MB/s) 582.23
Latency 50th (ms) 0.52
Latency 75th (ms) 0.76
Latency 90th (ms) 1.05
Latency 99th (ms) 0.00

.NET 7 - EnableAVX=0 (effectively target x86-64-v2)

load
CPU Usage (%) 44
Cores usage (%) 1,232
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 93
Requests/sec 4,679,347
Requests 70,655,667
Mean latency (ms) 0.58
Max latency (ms) 35.71
Bad responses 0
Socket errors 0
Read throughput (MB/s) 589.06
Latency 50th (ms) 0.51
Latency 75th (ms) 0.75
Latency 90th (ms) 1.03
Latency 99th (ms) 0.00

.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)

load
CPU Usage (%) 44
Cores usage (%) 1,225
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 91
Requests/sec 4,635,911
Requests 69,999,632
Mean latency (ms) 0.59
Max latency (ms) 32.99
Bad responses 0
Socket errors 0
Read throughput (MB/s) 583.59
Latency 50th (ms) 0.53
Latency 75th (ms) 0.76
Latency 90th (ms) 1.09
Latency 99th (ms) 0.00

.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)

load
CPU Usage (%) 42
Cores usage (%) 1,178
Working Set (MB) 38
Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 158
Requests/sec 4,370,389
Requests 65,991,281
Mean latency (ms) 0.63
Max latency (ms) 32.96
Bad responses 0
Socket errors 0
Read throughput (MB/s) 550.17
Latency 50th (ms) 0.55
Latency 75th (ms) 0.80
Latency 90th (ms) 1.10
Latency 99th (ms) 0.00

@masonwheeler
Copy link

I'm actually trying to do exactly this kind of real-world codebase benchmarking, to see if the .NET 7 performance benefits touted in the blog posts make a measurable difference in some performance-sensitive code. Unfortunately, I've been stymied by the inability to actually get anything to run in .NET 7. Any help would be welcome, and I promise to report back with relevant numbers once I have some to share.

@MichalStrehovsky
Copy link
Member

It's worth noting I opened dotnet/sdk#28055 so that we can, longer term, get more definitive information on this and other important hardware characteristics.

I'm not sure if that one would help - it's the hardware the .NET developers use, not hardware where .NET code runs. Developers are more likely to skew towards the latest and greatest. Users are the "secretary's machine" and "school computer". Windows org is more likely to have the kind of telemetry.

The below is the median of 5 results for each. I didn't notice any obvious outliers.

What command line arguments did you use for crank? The numbers for JSON are all a bit lower than I would expect (compare with mine above).

@tannergooding
Copy link
Member Author

What command line arguments did you use for crank? The numbers for JSON are all a bit lower than I would expect (compare with mine above).

Ah, you know what I ran json.benchmarks.yml and plaintext.benchmarks.yml rather than platform.benchmarks.yml. That might've had something to do with it.

I'm not sure if that one would help - it's the hardware the .NET developers use, not hardware where .NET code runs. Developers are more likely to skew towards the latest and greatest. Users are the "secretary's machine" and "school computer". Windows org is more likely to have the kind of telemetry.

Which again, doesn't really matter when you consider that most Operating Systesm don't support hardware that old.

In the case of macOS, it looks to be impossible for any OS we currently support to be running on pre-AVX2 hardware.

In the case of Windows, 8.1 is the oldest client SKU we still support. For 8.1, Windows themselves updated the baseline CPU required for x64 (must have CMPXCHG16B and LAHF/SAHF). Various articles quote a comment stating "the number of affected processors are extremely small since this instruction has been supported for greater than 10 years.". For 7, its only supported with an ESU subscription in which case other factors like the Windows Processor Requirements list comes into play and they are all post -v3 processors. -- Even stricter requirements/expectations exist for Server

Linux is really the only interesting case where the kernel still officially supports running on an 80386 (older than we support) and where many distros intentionally keep their specs "low". This is also a case where many recommend using alternative GUIs or specialized distro-builds for such low-spec computers to help. Ubuntu's docs go so far as to describe 10 and 15 year old systems and the scenarios that will likely prevent their usage in a default configuration. The biggest of which is typically that they don't support and have no way of supporting an SSD.

In short, such hardware is simply too old to be meaningful and given our official OS support matrix, is already unlikely to have a good experience with the latest versions of .NET.

@masonwheeler
Copy link

@tannergooding Agreed. "Not supported" loses all meaning if no changes can be made based on disregarding the existence of things officially not supported.

@EgorBo
Copy link
Member

EgorBo commented Sep 21, 2022

Notes about TechEmpower benchmarks:

They operate on extremely small data, e.g. here is what JSON benchmarks test: https://github.com/aspnet/Benchmarks/blob/e3095f4021fef7171bb3ae86616b9156df39b7bd/src/Benchmarks/Middleware/JsonMiddleware.cs#L51 its string representation probably doesn't even fit into AVX vector. And here is the Plaintext - https://github.com/aspnet/Benchmarks/blob/e3095f4021fef7171bb3ae86616b9156df39b7bd/src/Benchmarks/Middleware/PlaintextMiddleware.cs#L16-L41 even smaller. It's nowhere near to be called a "real" workload. I mean TE benchmarks are great to spot obvious regressions and they already helped us a lot to spot problems like in GC Regions, Crossgen, etc, measure internal aspnet overhead and threadpool scaling, but definitely not something we can use to make decisions around vector width IMO.

Same regarding HTTP headers, in TechEmpower benchmarks they're:

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Host: 10.0.0.102:5000
Connection: keep-alive

UTF8 so only Accept is "probably" worth using AVX for.

While normally your browser sends like 16 or more headers.

@tannergooding
Copy link
Member Author

They operate on extremely small data, e.g. here is what JSON benchmarks test

It's worth noting its not just about the size of the data, its primarily about the amount of data required to be processed until the algorithm can exit.

Even for very large inputs, if you're just finding the first index of a common character then the number of iterations you execute is small and the overhead for the required checks slow things down. Where-as if its an uncommon character or if you have to process the whole input, then you can get a 1.5x or more perf improvement as payoff

Ah, you know what I ran json.benchmarks.yml and plaintext.benchmarks.yml rather than platform.benchmarks.yml. That might've had something to do with it.

Reran ensuring I used platform.benchmarks. Similar results overall showing that AVX2 is the primary reason why perf shows as "worse" and with the inputs being this small, that makes sense. In addition to what I ran before, I also ran with EnableAVX2=0 which gives us effectively x86-64-v2 but still allows the VEX encoding to be used and further helps cement that its 256-bit vectors that are the "problem" and that its not some EnableSSE3=0 makes it faster (and indeed EnableSSE3=0 is slower than EnableAVX=0 as expected).

Json

.NET 7 - Default

load
CPU Usage (%) 79
Cores usage (%) 2,214
Working Set (MB) 38
Private Memory (MB) 363
Start Time (ms) 0
First Request (ms) 77
Requests/sec 1,196,802
Requests 18,071,073
Mean latency (ms) 0.86
Max latency (ms) 46.53
Bad responses 0
Socket errors 0
Read throughput (MB/s) 166.64
Latency 50th (ms) 0.38
Latency 75th (ms) 0.45
Latency 90th (ms) 0.69
Latency 99th (ms) 11.81

.NET 7 - EnableAVX2=0 (effectively target x86-64-v2 but allowing VEX encoding)

load
CPU Usage (%) 80
Cores usage (%) 2,229
Working Set (MB) 38
Private Memory (MB) 363
Start Time (ms) 0
First Request (ms) 85
Requests/sec 1,215,177
Requests 18,348,339
Mean latency (ms) 0.69
Max latency (ms) 48.18
Bad responses 0
Socket errors 0
Read throughput (MB/s) 169.20
Latency 50th (ms) 0.38
Latency 75th (ms) 0.44
Latency 90th (ms) 0.55
Latency 99th (ms) 9.71

.NET 7 - EnableAVX=0 (effectively target x86-64-v2)

load
CPU Usage (%) 80
Cores usage (%) 2,238
Working Set (MB) 38
Private Memory (MB) 363
Start Time (ms) 0
First Request (ms) 78
Requests/sec 1,221,743
Requests 18,447,377
Mean latency (ms) 0.66
Max latency (ms) 36.48
Bad responses 0
Socket errors 0
Read throughput (MB/s) 170.11
Latency 50th (ms) 0.38
Latency 75th (ms) 0.44
Latency 90th (ms) 0.54
Latency 99th (ms) 9.17

.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)

load
CPU Usage (%) 79
Cores usage (%) 2,206
Working Set (MB) 38
Private Memory (MB) 363
Start Time (ms) 0
First Request (ms) 79
Requests/sec 1,205,550
Requests 18,203,349
Mean latency (ms) 0.75
Max latency (ms) 46.42
Bad responses 0
Socket errors 0
Read throughput (MB/s) 167.86
Latency 50th (ms) 0.38
Latency 75th (ms) 0.45
Latency 90th (ms) 0.60
Latency 99th (ms) 10.20

.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)

load
CPU Usage (%) 78
Cores usage (%) 2,198
Working Set (MB) 38
Private Memory (MB) 363
Start Time (ms) 0
First Request (ms) 99
Requests/sec 1,206,129
Requests 18,211,020
Mean latency (ms) 0.77
Max latency (ms) 41.32
Bad responses 0
Socket errors 0
Read throughput (MB/s) 167.94
Latency 50th (ms) 0.38
Latency 75th (ms) 0.45
Latency 90th (ms) 0.57
Latency 99th (ms) 10.92

Plaintext

.NET 7 - Default

load
CPU Usage (%) 93
Cores usage (%) 2,602
Working Set (MB) 38
Private Memory (MB) 370
Start Time (ms) 0
First Request (ms) 78
Requests/sec 10,914,328
Requests 164,806,480
Mean latency (ms) 1.28
Max latency (ms) 61.41
Bad responses 0
Socket errors 0
Read throughput (MB/s) 1,310.72
Latency 50th (ms) 0.76
Latency 75th (ms) 1.14
Latency 90th (ms) 1.88
Latency 99th (ms) 14.39

.NET 7 - EnableAVX2=0 (effectively target x86-64-v2 but allowing VEX encoding)

load
CPU Usage (%) 93
Cores usage (%) 2,611
Working Set (MB) 38
Private Memory (MB) 370
Start Time (ms) 0
First Request (ms) 74
Requests/sec 10,979,623
Requests 165,786,328
Mean latency (ms) 1.27
Max latency (ms) 54.10
Bad responses 0
Socket errors 0
Read throughput (MB/s) 1,320.96
Latency 50th (ms) 0.75
Latency 75th (ms) 1.13
Latency 90th (ms) 1.84
Latency 99th (ms) 14.13

.NET 7 - EnableAVX=0 (effectively target x86-64-v2)

load
CPU Usage (%) 94
Cores usage (%) 2,637
Working Set (MB) 38
Private Memory (MB) 370
Start Time (ms) 0
First Request (ms) 66
Requests/sec 10,994,770
Requests 165,985,771
Mean latency (ms) 1.35
Max latency (ms) 55.85
Bad responses 0
Socket errors 0
Read throughput (MB/s) 1,320.96
Latency 50th (ms) 0.75
Latency 75th (ms) 1.14
Latency 90th (ms) 1.98
Latency 99th (ms) 15.22

.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)

load
CPU Usage (%) 92
Cores usage (%) 2,585
Working Set (MB) 38
Private Memory (MB) 370
Start Time (ms) 0
First Request (ms) 71
Requests/sec 10,916,742
Requests 164,843,707
Mean latency (ms) 1.20
Max latency (ms) 51.97
Bad responses 0
Socket errors 0
Read throughput (MB/s) 1,310.72
Latency 50th (ms) 0.76
Latency 75th (ms) 1.13
Latency 90th (ms) 1.78
Latency 99th (ms) 13.05

.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)

load
CPU Usage (%) 84
Cores usage (%) 2,359
Working Set (MB) 38
Private Memory (MB) 370
Start Time (ms) 0
First Request (ms) 109
Requests/sec 9,972,152
Requests 150,576,179
Mean latency (ms) 1.25
Max latency (ms) 76.28
Bad responses 0
Socket errors 0
Read throughput (MB/s) 1,198.08
Latency 50th (ms) 0.86
Latency 75th (ms) 1.27
Latency 90th (ms) 1.74
Latency 99th (ms) 13.06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants