-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce peak commit on memory-alloc intensive apps #150
Comments
Thank you , I will dive right into this and see what I can find (and improve) |
Ping! I'm wondering if you had any time to look into this? (at least I wanted to know if there's a potential for improvement) |
Yeah, I've started looking into it but been occupied with other things lately. I will try to get back to this soon. |
I've gotten to the point where I'm going to start comparing some changes to rpmalloc, but I can't seem to correctly setup my environment. When trying to bootstrap the llvm repo, running the initial cmake in stage1 I get
|
Are you running inside a Visual Studio cmd shell? (x64 Native Tools Command Prompt for VS 2019) |
I tried using MSVC as well, but then it fails when building the ninja check-all with
|
At this point, it should pick up the Is |
Uh, I came back to this today and realized it was just me being completely stupid, was running the wrong command shell (it was setup for x86 MSVC environment, not x86-64) |
Seems I'm on track once again, stage1 completed and will continue this investigation tomorrow |
Got usable timing and dumps from stage2, investigating and trying out some ideas. |
Great news! Many thanks again for taking the time on this ;-) |
I've made some progress on this, I hope to have something ready in a few days |
@aganea I'm having issues trying to reduce the small block granularity from 16 to 8 bytes to see if that causes the additional memory usage, but it seems the lld-link.exe assumes allocations are 16 byte aligned? Running lld-link.exe with rpmalloc giving back 8-byte aligned memory blocks it bails out when trying to zero out memory, the disassembly looks like
It bails on the Do you have any insights into this? I thought mimalloc had 8 bytes granularity/alignment as well. |
I think because the X86_64 ABI defines |
I guess I must have missed something when looking at the mimalloc source then, I got the impression it had a natural block alignment of sizeof(uintptr_t) |
@aganea could you try the mjansson/llvm-opt branch and see what results you get on your end? You will have to change the enable preload/overload defines to 1 I guess, otherwise it should be good to go inside llvm |
Great! I will do that and get back. |
Do you think with -std=c++17 and 'aligned new', things could be improved? |
I think it's better to leave it to 16 byte and have that as natural granularity in the allocator, to allow the optimizer to utilize sse instruction freely |
Btw this branch will probably not reduce peak commit, but it will hopefully make it faster :) |
I did some test with the mjansson/llvm-opt branch. It seems the performance is better overall, there's a consistent decrease in link times with LLD (-10 sec over 148 sec of link times), to about the same times as mimalloc. However there's a slightly increase of the commit size. Overall it looks good, I'll do more tests and update the LLVM review. I've also tested snmalloc, will post that (overall all are in the same ballpark in terms of performance, but mimalloc generally does better in terms of commit, rpmalloc is behind, then comes snmalloc which has the highest commit usage) |
I'll take another look at this and see if I can tune the sizes of class buckets to reduce the overhead and in the end the total commit. |
Please see the result from last night's ABBA test. It ran for 10 hours. It seems llvm-opt saves about 2 sec linktime on the 36-core. I'll run the test again on the 6-core, the difference should be higher I imagine:
|
10-hour test on the 36-core with snmall and mimalloc. Same conditions as above.
|
So if I'm reading the latest numbers here (and from your last comment on https://reviews.llvm.org/D71786) it seems wall clock time performance of rpmalloc is now on par with the others (mimalloc, snmalloc) and I should focus efforts on the peak commit? |
Yes indeed! I could send by email the allocation pattern for all three allocators. Like mentionned previously, it seems mimalloc does lots of commits in very small ranges. But perhaps the internal structures are important as well. Do you think you can sort out something with the licence? Seems like it's the most contentious issue. Does git support switching the licence to MIT along the lifetime of a repo? Or is it fixed once the repro is created? |
rpmalloc has been dual licensed for quite some time, either released to the public domain, or if you are a grumpy lawyer, under MIT license. I now noticed the LICENSE file had not been updated (or lost in some merge) and only the README had the dual license. I've updated the LICENSE file now to include the MIT option as well. |
I've reworked the caches a bit in branch mjansson/array-cache - give it a try and see what impact it has on performance and peak commit |
I've recently integrated rpmalloc and mimalloc into LLVM, please see thread: https://reviews.llvm.org/D71786
I discovered along the way that rpmalloc takes more memory than mimalloc when linking with LLD & ThinLTO. For example:
There's a difference in terms of execution time, in favor of mimalloc. It seems the difference is proportional to the difference of the commit size between the two.
To repro (windows bash, but you could probably repro all this on Linux as well),
To compare with mimalloc, you'd need to compile first mimalloc as a static lib (disable /GL).
You can reference it then in place of rpmalloc, by using the following patch (simply revert this file from the previous patch, before applying):
You don't need to rebuild stage1, only stage2. You don't need to call cmake again, you can simply call
ninja all -C stage2
after applying the mimalloc modification above. You can then switch between rpmalloc and mimalloc by commenting-out the relevant sections in this file, and re-running ninja.At this point, you should see a difference in terms of peak Committed memory. I'm using UIforETW (https://github.com/google/UIforETW) to take profiles on Windows.
You can probably repro this on Linux as well, and maybe linking a smaller program instead of clang.exe if you want faster iteration. Please don't hesitate to poke me by email if any of these doesn't work or if you're stuck.
The text was updated successfully, but these errors were encountered: