-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce contention for Datacontract Serialization #73893
Reduce contention for Datacontract Serialization #73893
Conversation
…ing DataContractSerializer
TryCreateBuiltInDataContract(type, out dataContract); | ||
s_typeToBuiltInContract.Add(type, dataContract); | ||
} | ||
TryCreateBuiltInDataContract(key, out DataContract? dataContract); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TryCreateBuiltInDataContract seems thread safe (it only contains a switch statement and creates new DataContracts) so it should not be a problem to call it concurrently or the slim chance of creating duplicate instances
@mconnew and @StephenMolloy, can you please take a look? |
Traditionally ConcurrentDictionary has struggled to keep up with access from multiple threads. Single thread read/write looks good, but actually using the concurrency features slow things down compared to Hashtable. And although ConcurrentDictionary has made strides in recent releases, this is still the "apples to apples" case. Hashtable is just faster for multiple readers and a single writer. However, this isn't really an "apples to apples" case. In order to use Hashtable, we have to wrap value types and provider our own comparison method. The extra overhead involved does indeed make ConcurrentDictionary faster in this case as far as I can tell. However however... I suspect these collections are part of the problem causing #77877. So I'm going to need to re-do how this all works anyway. Trying to keep in mind the lessons learned from this performance dive. |
Can you share the benchmarks you're using that lead to these conclusions? |
This information was coming from me. Many years ago on .NET Framework we replaced a lock in the DataContractSerializer with a lock only on write semantic. At the time we benchmarked different solutions and ConcurrentDictionary performed worse than the existing always lock code and the final solution of using a HashTable and only locking on write. This was in the 4.6.x time frame if my memory is correct. I don't have any of the benchmarks available now. I did a quick visual comparison of the two implementations (.NET vs NetFx) and I can see multiple changes which would account for the performance difference. On NetFx, every read has a Volative.Read call which on mult-numa node machines can be really expensive as it causes a L1 cache write flush and L1 cache line ownership change which is negotiated across all cores. On multi-numa node machines this can be very expensive if the old owning core and the new owning core are on different numa nodes. This has been eliminated in the .NET implementation with an explanation here. As you're aware, this can cause contention problems between different CPU cores even when accessing different values if they are in the same cache line. Removal of the Volatile.Read call removes the cache flush and reduces memory contention. A second improvement which will apply in this case is the code here. There is a cheaper code path when getting a value when TKey is a value type and no comparer has been set. The code I think it's fair to say that the performance (at least for this scenario) of ConcurrentDictionary has been improved for the scenario with multiple threads reading from the ConcurrentDictionary at the same time, especially for multi-numa node machines (generally Xeon class processors and usually not what we have on our dev machines). With the improvements to ConcurrentDictionary, it looks like it's the better option now as it results in simpler code. I also think it might be worth making another change in the code. WCF had a break when the Hot Reload feature was introduced. When code is changed, some runtime reflection classes get replaced with new instances. In the WCF case we were using a |
@stephentoub - Nothing fancy. Just simple single-threaded loop doing Hashtable/ConcurrentDictionary gets and adds. This single-access read/write model is the 90% case. I don't think Hashtable being faster than CD in this general scenario is controversial. Or were you asking about the applied to this situation scenario where ConcurrentDictionary comes out faster? I didn't do anything fancy. I pretty much just took @Daniel-Svensson's simple benchmarks and added some parallelism which is supposed to be where Hashtable shines over traditional synchronized r/w collections. Instead of the original nested-for loops in his benchmark, I did something like this: [Params(1, 4, 16, 32)]
public int concurrency;
Parallel.For(0, concurrency, c =>
{
int l = _typeHandles.Length;
int s = (l * c) / concurrency;
for (int i = 0; i < l; ++i)
for (int j = 0; j <= i; ++j)
sum += GetId_Implementation_Of_Interest(_typeHandles[(j+s)%l]);
}); Again though, nothing fancy or sophisticated. All just playing around on my virtual 4-core dev box to get an idea of what I was looking at. I ran against on the .Net 6, 7, and 8 runtimes with the various 'GetId()' implementations. The .Net6 numbers were still in favor of CD over HT, but they were closer to a wash. CD performance has gotten better though, so it's a pretty noticeable difference in my basic exploration. The real difference is not having to mess around with the 'TypeHandleRef' class et al. |
Or perhaps you were asking for something more along the lines of @mconnew remarks he was apparently typing up at the same time. ;) Either way, I believe this code is part of the problem with DCS and AssemblyLoadContext. So we'll be doing something different here anyway. |
I have applied the suggestion from @mconnew and changed the key from RuntimeTypeHandle to RuntimeTypeHandle.Value. If you prefer the code can be changed to keep RuntimeTypeHandle as key and use a custom comparer instead, the performance difference will be small. The proposed code should hopefully be both simpler and much faster that the existing code so maybe you can consider merging it even if you might make changes later on? It should hopefully make any future work easier to do rather than harder. |
@StephenMolloy you are not benchmarkning what you think you are. Your example will run "concurrency" number of iterations of the loop code. And it will use at most as many threads as your CPU (4).
You can also use the full end to end benchmark originally used, it needs to be started with the "/server" parameter and you also need a load test application such as nettling. While a micro benchmark might be good to more quickly test and iterate on a solution I think the most important measure are how it might impact real world scenarios and in this case it was more than 400% improvement in RPS. if you manage to make a realistic micro benchmark you should expect the improvement to be larger than that.
That not only gave cleaner, easier code but also part of what makes the solution faster. I really hope that is enough to get a review of the code |
This can be overcome by passing an instance of |
Just chiming in with a reminder again... this is an interesting and informative discussion, so the PR is still open for the moment. But it's not likely that we merge this because I believe the code involved here is the cause of #77877. We will likely be making a different change here. But I do want that change to be informed by this discussion. |
…ss that we no longer need.
It seems like this does not interfere with fixing DCS/ALC after all. Pushing an additional commit to apply the same logic to DCJS and removing some of the supporting code for structs that are no longer used. |
Thank you for comin back and merging this @StephenMolloy If you want to benchmark concurrent code in the future you can do something similar to https://github.com/Daniel-Svensson/DataContractBench/blob/main/Program.cs Example output for this specific case for mutex+dictionary vs Concurrent Dictionary https://github.com/Daniel-Svensson/DataContractBench/blob/main/Results.csv showing the old mutex approach having less than halved throughput with 4 threads while concurrent dictionary instead is 4 times as fast. |
I reran my benchmarks for 20 000 objects comparing net7 and net8 and the results looks really promising ~460% with 16 threaded load and slightly more with 320 thread loading on my 8 core Ryzen 5800X with much lower latencies. Net7.0.10 vs 8.0.0-rc.2.23423.11 (daily build) Thank you @StephenMolloy @stephentoub and @mconnew for your time with this (16 threads, 1 object request only shows 8% performance increase so the rest should mostly be improvements in serialization) |
This is just #70668 which was closed after #71752 was merged giving a merge conflict.
Since the later did not really make any large improvement as explained here I am reopening the PR.
In short it improves RPS by around ~5% without contention and way over 400% (from 137 to near 600 rps) for scenario with 16 threads.
New
Old:
Se PR #70688 for more details.
If it is better to reopen the original PR then feel free to do so (I did not have the option), maybe @StephenMolloy can do it if it is better.