-
-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhead match workload #2309
Overhead match workload #2309
Conversation
Match overhead action implementation to workload action implementation. Support more consume types. Cleaned up byref returns.
b44d380
to
cec7f9b
Compare
@timcassell thanks for the PR! It will take some time for me to properly review and test it, but I will try to do it soon. As for now, I have one high-level concern: if you need C# 11 for these changes, please do it in a separate PR. It's a significant upgrade for the source code base, and I would prefer to apply it independently from the overhead-related refactoring. |
I just did that for raw string literal convenience. I didn't need it otherwise. Reverted. |
Added comment about `Type.IsClass` returns true for pointers.
48c10d0
to
fbdb74c
Compare
I don't know if anyone actually ran into it, but I noticed the InProcessEmit tests were missing pointer returns, so I added that test and fixed it here, too (extra confidence that my changes work for both csproj gen and inprocess toolchains). |
…n (ByRefLike and pointers), use Unsafe.SkipInit for everything else. Check for pointer type when setting default local in InProcessEmitToolchain.
This comment was marked as resolved.
This comment was marked as resolved.
@timcassell just wanted to say that the review is still in process. I tend to procrastinate merging pull requests touching toolchains. Since we do not have 100% test coverage, changes to these critical parts demand additional verification to prevent potential not-so-obvious bugs in corner cases. |
@timcassell, as I can see if the benchmark methods return private System.Int32 __Overhead() // __ is to avoid possible name conflict
{
System.Runtime.CompilerServices.Unsafe.SkipInit(out System.Int32 value);
return value;
} I don't think that it's a good idea to use I would suggest restoring the original behavior ( |
Interesting, I thought that would be inlined and eliminated as a noop. I guess the Mono JIT isn't that powerful.
Should I do that for any type whose size is <= IntPtr.Size? Or <= 8 (would include double and long on 32-bit runtime)? Or only primitives and references? |
@AndreyAkinshin I ran some benches on my Windows machine and got these results. Are you sure what you're seeing isn't just some noise in the system?
Code
public struct Struct16
{
public (long, long) field;
}
public struct Struct32
{
public (Struct16, Struct16) field;
}
public struct Struct64
{
public (Struct32, Struct32) field;
}
public struct Struct128
{
public (Struct64, Struct64) field;
}
[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(long))]
[GenericTypeArguments(typeof(Struct16))]
[GenericTypeArguments(typeof(Struct32))]
[GenericTypeArguments(typeof(Struct64))]
[GenericTypeArguments(typeof(Struct128))]
public class DefaultVsSkipInit<T>
{
public volatile byte _byte;
[Benchmark]
public void Default()
{
Consume(default(T));
}
[Benchmark]
public void SkipInit()
{
Unsafe.SkipInit(out T value);
Consume(value);
}
private void Consume(in T value)
{
_byte = Unsafe.As<T, byte>(ref Unsafe.AsRef(value));
}
} [Edit] I ran benchmarks again with directly typed out types (instead of generic), and it still just looks like noise to me: Details
public class DefaultVsSkipInit
{
public volatile byte _byte;
[Benchmark]
public void DefaultInt()
{
Consume(default(int));
}
[Benchmark]
public void SkipInitInt()
{
Unsafe.SkipInit(out int value);
Consume(value);
}
[Benchmark]
public void DefaultLong()
{
Consume(default(long));
}
[Benchmark]
public void SkipInitLong()
{
Unsafe.SkipInit(out long value);
Consume(value);
}
[Benchmark]
public void DefaultObject()
{
Consume(default(object));
}
[Benchmark]
public void SkipInitObject()
{
Unsafe.SkipInit(out object value);
Consume(value);
}
[Benchmark]
public void DefaultStruct16()
{
Consume(default(Struct16));
}
[Benchmark]
public void SkipInitStruct16()
{
Unsafe.SkipInit(out Struct16 value);
Consume(value);
}
[Benchmark]
public void DefaultStruct32()
{
Consume(default(Struct32));
}
[Benchmark]
public void SkipInitStruct32()
{
Unsafe.SkipInit(out Struct32 value);
Consume(value);
}
[Benchmark]
public void DefaultEmptyStruct()
{
Consume(default(EmptyStruct));
}
[Benchmark]
public void SkipInitEmptyStruct()
{
Unsafe.SkipInit(out EmptyStruct value);
Consume(value);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private void Consume<T>(in T value)
{
_byte = Unsafe.As<T, byte>(ref Unsafe.AsRef(value));
}
} And even with the noise, it looks like |
Hmm. This is strange. I have performed the following integration experiment. I wrote the following benchmark: [Config(typeof(Config))]
public class Benchmarks
{
private class Config : ManualConfig
{
public Config()
{
var job = Job.Default
.WithRuntime(MonoRuntime.Default)
.WithIterationCount(100)
.WithInvocationCount(160_000_000)
;
AddJob(job);
}
}
[Benchmark] public int Foo() => 0;
} Also, I hacked the BenchmarkDotNet sources to force it always perform 100 overhead iterations: diff --git a/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs b/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs
--- a/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs (revision 56756271f63663c406b82077209a6e630d703b1d)
+++ b/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs (date 1686227871484)
@@ -32,7 +32,7 @@
}
public IReadOnlyList<Measurement> RunOverhead(long invokeCount, int unrollFactor)
- => RunAuto(invokeCount, IterationMode.Overhead, unrollFactor);
+ => RunSpecific(invokeCount, IterationMode.Overhead, 100, unrollFactor);
public IReadOnlyList<Measurement> RunWorkload(long invokeCount, int unrollFactor, bool forceSpecific = false)
=> Run(invokeCount, IterationMode.Workload, false, unrollFactor, forceSpecific); Thus, we get 100 overhead measurements and 100 workload measurements. Here are the corresponding density plots for the master branch: And here are the plots for your PR: As you can see, the left mode of the overhead distribution is approximately 1 CPU cycle larger than the left mode of the workload distribution. Raw data: |
@AndreyAkinshin How are you generating those plots? I want to try that on my machine. |
First, I generated csv from summary: var summary = BenchmarkRunner.Run<Benchmarks>();
var builder = new StringBuilder();
Thread.CurrentThread.CurrentCulture = DefaultCultureInfo.Instance;
builder.AppendLine("duration,type");
foreach (var measurement in summary.Reports.First().AllMeasurements.Where(m => m.IsOverhead() && m.IterationStage == IterationStage.Actual))
builder.AppendLine((measurement.Nanoseconds / measurement.Operations).ToString("N4") + ",overhead");
foreach (var measurement in summary.Reports.First().AllMeasurements.Where(m => m.IsWorkload() && m.IterationStage == IterationStage.Actual))
builder.AppendLine((measurement.Nanoseconds / measurement.Operations).ToString("N4") + ",workload");
File.WriteAllText("bdn.csv", builder.ToString()); (Yes, we can introduce a new exporter or reuse the existing ones, but It was the fastest way for me to obtain the first results). Next, I used the following R script to generate plots: # Run `install.packages("ggplot2")` to install the package for the first time
library(ggplot2)
df <- read.csv("bdn.csv")
ggplot(df, aes(duration, col = type)) +
geom_density(bw = "SJ") +
geom_rug() +
labs(
x = "Duration, ns",
y = "Density",
col = "Method"
) |
Very interesting. When I run with those forced iteration and invocation counts, I'm seeing similar results as you. Strange that my heuristic benchmarks were having different results. Anyway, I'll do some more testing with these forced invoke counts to see what types should have default vs skipinit, and update this PR accordingly. [Edit] Oh, it looks like it's because I was consuming it directly in my benchmark instead of returning the value. It had nothing to do with the invoke counts. I'm not really sure why that causes the discrepancy, though. |
Well, it looks like Results
I even tried adding
So, I went ahead and removed |
This comment was marked as outdated.
This comment was marked as outdated.
e20aea3
to
c4f5f0d
Compare
Ok, now we should discuss the original problem that I described in #2305. Let's consider the following benchmark: public struct HugeStruct
{
public long L00;
public long L01;
public long L02;
public long L03;
public long L04;
public long L05;
public long L06;
public long L07;
public long L08;
public long L09;
}
public class Benchmarks
{
private class Config : ManualConfig
{
public Config()
{
var job = Job.Default
.WithRuntime(MonoRuntime.Default)
.WithIterationCount(100)
.WithInvocationCount(16_000_000)
;
AddJob(job);
}
}
private HugeStruct s = new HugeStruct();
[Benchmark] public HugeStruct Foo() => s;
} Using the same approach as before, I got the following picture (I took the latest state of this PR): |
Hmm... the result of that measurement seems to depend on the runtime. In .Net 7 and Framework, I observe that
I also experimented with another unsafe approach, which got close to the field read in Mono, but is still slower in other runtimes. [Benchmark]
public Struct1024 UnsafeAsStruct1024()
{
byte b = 0;
return Unsafe.As<byte, Struct1024>(ref b);
} Can we branch the code generated depending on the runtime? We could use field read in Mono, and [Edit] Mono with .Net 7 results (wow, it's slooooow!),
|
54c0b54
to
8403c15
Compare
8403c15
to
878842f
Compare
44f74c4
to
cc207fa
Compare
cc207fa
to
8b4af58
Compare
@AndreyAkinshin Latest uses field read if the struct is > 64 bytes and the runtime is old Mono, default for all other cases. Is that ok? (Tbh I'm not sure how 32-bit works, if we can even mix 32/64-bit host/benchmark process. I only tested 64-bit.) |
I don't like this approach. It feels too hacky and fragile. I suspect that it would be quite challenging to maintain it and keep it relevant to all the runtimes we support. Another severe issue is the fairness of comparison between different benchmarks. Could we compare absolute nanobenchmark measurements between Mono and non-Mono? Or a benchmark with struct size of 64 and 65 bytes? The original idea behind the overhead evaluation was to evaluate the overhead of the general BenchmarkDotNet infrastructure, like method/delegates calls and the main loop that performs invocation. The delegate call evaluation includes the instruction pointer jumps. In order to make the overhead evaluation most "honest," we try to match the signature of the overhead method to the signature of the target workload method (where such an approach is applicable). We use this signature mimicry to ensure that the measured duration of an empty method (e.g., In the case of an empty Thus, I don't think we should use the actual struct type of the target workload method for the overhead method signature. This gives us two options for the overhead return type:
@timcassell what do you think? |
So, if we were to measure the cost of creating 2 different structs [Benchmark] public SmallStruct Small() => default;
[Benchmark] public LargeStruct Large() => default; In the current version, they would both report 0. But we may want them to report different time values (since it does actually take some time).
I don't like either of these since they don't solve the issue. Using a primitive type instead of the actual type is what we're already doing. And we can't just remove
Would it be reasonable for the overhead delegate to always return private void OverheadActionNoUnroll(System.Int64 invokeCount)
{
LargeStruct consumeMe = default;
for (System.Int64 i = 0; i < invokeCount; i++)
{
overheadDelegate();
consumer.Consume(consumeMe);
}
} [Edit] As for the time spent in the |
I just checked how it works with the latest version of BenchmarkDotNet from the public struct SmallStruct
{
public long Field00 { get; }
}
public struct LargeStruct
{
public long Field00 { get; }
public long Field01 { get; }
public long Field02 { get; }
public long Field03 { get; }
public long Field04 { get; }
public long Field05 { get; }
public long Field06 { get; }
public long Field07 { get; }
public long Field08 { get; }
public long Field09 { get; }
public long Field10 { get; }
public long Field11 { get; }
public long Field12 { get; }
public long Field13 { get; }
public long Field14 { get; }
public long Field15 { get; }
public long Field16 { get; }
public long Field17 { get; }
public long Field18 { get; }
public long Field19 { get; }
public long Field20 { get; }
public long Field21 { get; }
public long Field22 { get; }
public long Field23 { get; }
public long Field24 { get; }
public long Field25 { get; }
public long Field26 { get; }
public long Field27 { get; }
public long Field28 { get; }
public long Field29 { get; }
public long Field30 { get; }
public long Field31 { get; }
public long Field32 { get; }
public long Field33 { get; }
public long Field34 { get; }
public long Field35 { get; }
public long Field36 { get; }
public long Field37 { get; }
public long Field38 { get; }
public long Field39 { get; }
public long Field40 { get; }
public long Field41 { get; }
public long Field42 { get; }
public long Field43 { get; }
public long Field44 { get; }
public long Field45 { get; }
public long Field46 { get; }
public long Field47 { get; }
public long Field48 { get; }
public long Field49 { get; }
}
public class Benchmarks
{
[Benchmark] public SmallStruct Small() => default;
[Benchmark] public LargeStruct Large() => default;
} Here are the results: BenchmarkDotNet=v0.13.5.20230616-develop, OS=ubuntu 22.04
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.304
[Host] : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|------- |----------:|----------:|----------:|
| Small | 0.0000 ns | 0.0000 ns | 0.0000 ns |
| Large | 2.8153 ns | 0.0254 ns | 0.0237 ns |
// * Warnings *
ZeroMeasurement
Benchmarks.Small: Default -> The method duration is indistinguishable from the empty method duration This looks like a reasonable result. I would like to keep this behavior.
I have some doubts about this idea. Let's consider the following benchmarks: [Benchmark] public object EmptyObject() => null;
[Benchmark] public int EmptyInt() => 0;
[Benchmark] public double EmptyDouble() => 0.0; With the current version of BenchmarkDotNet, all of these benchmarks have "zero" duration. I would like to keep this behavior. If void Overhead allows us to achieve this on all the runtimes, we can consider such an approach (however, I would expect some non-zero results).
Sorry for keeping #2111 without a review for such a long time. As you can see, it's quite challenging to verify engine/toolchain-related changes: any modifications in these subsystems may lead to surprisingly unpleased side effects. I will try to make another attempt to review it and share my thoughts. |
Yeah, I mean the current version of this PR, sorry.
Would that be so bad, though? After all, I would expect this
To take some non-zero time longer than this
I mean yes, people may see "regressions", but it would be more accurate, wouldn't it?
Excellent! I'll go fix the merge issues, then! You may want to review #2108 first, though. |
I experimented with that idea, and I didn't like the results I was getting. The difference between loading a local to pass to So I studied the WorkloadAction implementation to try to find another method, and I thought, why is the
The default primitives and Can we refactor to remove
|
@AndreyAkinshin This PR turned out to be the wrong direction, so I'm closing it in favor of #2336. I also put up a separate PR #2337 for just the |
Matched overhead to workload return type and implementation.
Optimized
Consumer.Consume<T>(in T value)
.Use
Consumer
more liberally in benchmarks.Cleaned up byref returns.
Fixed pointer returns in InProcessEmitToolchain.
Fixes #2305