Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce contention for Datacontract Serialization #73893

Merged
merged 12 commits into from
Aug 11, 2023

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented Aug 13, 2022

This is just #70668 which was closed after #71752 was merged giving a merge conflict.
Since the later did not really make any large improvement as explained here I am reopening the PR.

In short it improves RPS by around ~5% without contention and way over 400% (from 137 to near 600 rps) for scenario with 16 threads.

New

Old:

Se PR #70688 for more details.

If it is better to reopen the original PR then feel free to do so (I did not have the option), maybe @StephenMolloy can do it if it is better.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 13, 2022
TryCreateBuiltInDataContract(type, out dataContract);
s_typeToBuiltInContract.Add(type, dataContract);
}
TryCreateBuiltInDataContract(key, out DataContract? dataContract);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryCreateBuiltInDataContract seems thread safe (it only contains a switch statement and creates new DataContracts) so it should not be a problem to call it concurrently or the slim chance of creating duplicate instances

@HongGit
Copy link
Contributor

HongGit commented Feb 2, 2023

@mconnew and @StephenMolloy, can you please take a look?

@StephenMolloy
Copy link
Member

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

However, this isn't really an "apples to apples" case. In order to use Hashtable, we have to wrap value types and provider our own comparison method. The extra overhead involved does indeed make ConcurrentDictionary faster in this case as far as I can tell.

However however... I suspect these collections are part of the problem causing #77877. So I'm going to need to re-do how this all works anyway. Trying to keep in mind the lessons learned from this performance dive.

@stephentoub
Copy link
Member

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

Can you share the benchmarks you're using that lead to these conclusions?

@mconnew
Copy link
Member

mconnew commented Apr 3, 2023

Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer.

Can you share the benchmarks you're using that lead to these conclusions?

This information was coming from me. Many years ago on .NET Framework we replaced a lock in the DataContractSerializer with a lock only on write semantic. At the time we benchmarked different solutions and ConcurrentDictionary performed worse than the existing always lock code and the final solution of using a HashTable and only locking on write. This was in the 4.6.x time frame if my memory is correct. I don't have any of the benchmarks available now.
The dev who was working on it at the time coded it a couple of different ways and ran the .NET Framework perf benchmarks and the HashTable with lock on write was the fastest.

I did a quick visual comparison of the two implementations (.NET vs NetFx) and I can see multiple changes which would account for the performance difference. On NetFx, every read has a Volative.Read call which on mult-numa node machines can be really expensive as it causes a L1 cache write flush and L1 cache line ownership change which is negotiated across all cores. On multi-numa node machines this can be very expensive if the old owning core and the new owning core are on different numa nodes. This has been eliminated in the .NET implementation with an explanation here. As you're aware, this can cause contention problems between different CPU cores even when accessing different values if they are in the same cache line. Removal of the Volatile.Read call removes the cache flush and reduces memory contention.

A second improvement which will apply in this case is the code here. There is a cheaper code path when getting a value when TKey is a value type and no comparer has been set. The code key.GetHashCode() is going to be cheaper than the code here.

I think it's fair to say that the performance (at least for this scenario) of ConcurrentDictionary has been improved for the scenario with multiple threads reading from the ConcurrentDictionary at the same time, especially for multi-numa node machines (generally Xeon class processors and usually not what we have on our dev machines). With the improvements to ConcurrentDictionary, it looks like it's the better option now as it results in simpler code.

I also think it might be worth making another change in the code. WCF had a break when the Hot Reload feature was introduced. When code is changed, some runtime reflection classes get replaced with new instances. In the WCF case we were using a MethodInfo as the key to a dictionary to look up the SOAP operation. But after a hot reload, the comparison between MethodInfo objects from before and after the reload for the exact same method returned that they are different. Details here. Looking at RuntimeTypeHandle.Equals, it calls ReferenceEquals to compare RuntimeTypeHandle.m_type of the two instances. If the Type instance changes during a hot reload (it's possible it's only MethodInfo instances that get invalidated, I'm presuming here), then this will be fragile and result in creating a new duplicated DataContract effectively identical to the previous one. Would we be better off storing the IntPtr value from RuntimeTypeHandle.Value instead? When calling a method passing a value type, it has to copy the entire struct for each method call. We have quite a bit of indirection where we hop multiple methods before we get to the actual lookup code. If this was changed to the RuntimeTypeHandle.Value it would make the method calls cheaper too as an IntPtr can be passed by register but a RuntimeTypeHandle can't, it must be passed on the stack.

@StephenMolloy
Copy link
Member

@stephentoub - Nothing fancy. Just simple single-threaded loop doing Hashtable/ConcurrentDictionary gets and adds. This single-access read/write model is the 90% case. I don't think Hashtable being faster than CD in this general scenario is controversial.

Or were you asking about the applied to this situation scenario where ConcurrentDictionary comes out faster? I didn't do anything fancy. I pretty much just took @Daniel-Svensson's simple benchmarks and added some parallelism which is supposed to be where Hashtable shines over traditional synchronized r/w collections. Instead of the original nested-for loops in his benchmark, I did something like this:

            [Params(1, 4, 16, 32)]
            public int concurrency;

            Parallel.For(0, concurrency, c =>
            {
                int l = _typeHandles.Length;
                int s = (l * c) / concurrency;
                for (int i = 0; i < l; ++i)
                    for (int j = 0; j <= i; ++j)
                        sum += GetId_Implementation_Of_Interest(_typeHandles[(j+s)%l]);
            });

Again though, nothing fancy or sophisticated. All just playing around on my virtual 4-core dev box to get an idea of what I was looking at. I ran against on the .Net 6, 7, and 8 runtimes with the various 'GetId()' implementations. The .Net6 numbers were still in favor of CD over HT, but they were closer to a wash. CD performance has gotten better though, so it's a pretty noticeable difference in my basic exploration. The real difference is not having to mess around with the 'TypeHandleRef' class et al.

@StephenMolloy
Copy link
Member

Or perhaps you were asking for something more along the lines of @mconnew remarks he was apparently typing up at the same time. ;)

Either way, I believe this code is part of the problem with DCS and AssemblyLoadContext. So we'll be doing something different here anyway.

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented May 7, 2023

I have applied the suggestion from @mconnew and changed the key from RuntimeTypeHandle to RuntimeTypeHandle.Value.
I had looked at the mono source code version and that just implemented Equals and GetHashCode using .Value.

If you prefer the code can be changed to keep RuntimeTypeHandle as key and use a custom comparer instead, the performance difference will be small.

The proposed code should hopefully be both simpler and much faster that the existing code so maybe you can consider merging it even if you might make changes later on? It should hopefully make any future work easier to do rather than harder.

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented May 29, 2023

Or were you asking about the applied to this situation scenario where ConcurrentDictionary comes out faster? I didn't do anything fancy. I pretty much just took @Daniel-Svensson's simple benchmarks and added some parallelism which is supposed to be where Hashtable shines over traditional synchronized r/w collections. Instead of the original nested-for loops in his benchmark, I did something like this:

            [Params(1, 4, 16, 32)]
            public int concurrency;

            Parallel.For(0, concurrency, c =>
            {
                int l = _typeHandles.Length;
                int s = (l * c) / concurrency;
                for (int i = 0; i < l; ++i)
                    for (int j = 0; j <= i; ++j)
                        sum += GetId_Implementation_Of_Interest(_typeHandles[(j+s)%l]);
            });

Again though, nothing fancy or sophisticated. All just playing around on my virtual 4-core dev box to get an idea of what I was looking at. I ran against on the .Net 6, 7, and 8 runtimes with the various 'GetId()' implementations. The .Net6 numbers were still in favor of CD over HT, but they were closer to a wash. CD performance has gotten better though, so it's a pretty noticeable difference in my basic exploration.

@StephenMolloy you are not benchmarkning what you think you are. Your example will run "concurrency" number of iterations of the loop code. And it will use at most as many threads as your CPU (4).
To actually run on more threads you are better of

  1. creating N threads manually
  2. Wait for all threads to be started
  3. Have all threads wait for a semaphoreslim, barrier or similar and ensure that they are released all at the same time.
    • make sure that the inner loop takes enough time (maybe seconds) for any concurrency issues to be more easily detected, it must be much larger than syncronisation overhead of an empty loop.
      (4. Measure and report time for each thread as well as total time)

You can also use the full end to end benchmark originally used, it needs to be started with the "/server" parameter and you also need a load test application such as nettling.

While a micro benchmark might be good to more quickly test and iterate on a solution I think the most important measure are how it might impact real world scenarios and in this case it was more than 400% improvement in RPS. if you manage to make a realistic micro benchmark you should expect the improvement to be larger than that.

The real difference is not having to mess around with the 'TypeHandleRef' class et al.

That not only gave cleaner, easier code but also part of what makes the solution faster.
Hopefully the benchmark is enough to show that the single threaded performance is better, and the load test result to prove that there is a large multithreaded performance win apart from simpler code.

I really hope that is enough to get a review of the code

@mconnew
Copy link
Member

mconnew commented May 30, 2023

This can be overcome by passing an instance of ParallelOptions before the delegate setting MaxDegreeOfParallelism to concurrency.
Also I don't know which dev machine @StephenMolloy used, but my current dev machine has 10 physical cores (20 cores with hyperthreading) so all but the 32 concurrency test variant will have had sufficient CPU core count to not be limited by that if testing on my own work dev box.

@StephenMolloy
Copy link
Member

Just chiming in with a reminder again... this is an interesting and informative discussion, so the PR is still open for the moment. But it's not likely that we merge this because I believe the code involved here is the cause of #77877. We will likely be making a different change here. But I do want that change to be informed by this discussion.

@StephenMolloy
Copy link
Member

It seems like this does not interfere with fixing DCS/ALC after all. Pushing an additional commit to apply the same logic to DCJS and removing some of the supporting code for structs that are no longer used.

@StephenMolloy StephenMolloy merged commit 21f3bbe into dotnet:main Aug 11, 2023
100 of 103 checks passed
@Daniel-Svensson
Copy link
Contributor Author

Thank you for comin back and merging this @StephenMolloy

If you want to benchmark concurrent code in the future you can do something similar to https://github.com/Daniel-Svensson/DataContractBench/blob/main/Program.cs

Example output for this specific case for mutex+dictionary vs Concurrent Dictionary https://github.com/Daniel-Svensson/DataContractBench/blob/main/Results.csv showing the old mutex approach having less than halved throughput with 4 threads while concurrent dictionary instead is 4 times as fast.

@Daniel-Svensson Daniel-Svensson deleted the datacontract_contention branch August 15, 2023 09:20
@Daniel-Svensson
Copy link
Contributor Author

I reran my benchmarks for 20 000 objects comparing net7 and net8 and the results looks really promising ~460% with 16 threaded load and slightly more with 320 thread loading on my 8 core Ryzen 5800X with much lower latencies.

Net7.0.10 vs 8.0.0-rc.2.23423.11 (daily build)
image

Thank you @StephenMolloy @stephentoub and @mconnew for your time with this

(16 threads, 1 object request only shows 8% performance increase so the rest should mostly be improvements in serialization)

@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Serialization community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants